Online learning is very important in machine learning as it allows for the inclusion of new data samples without having to recalculate model parameters for the rest of the data. The aim of this exercise is to explore this concept.
We will now look into online estimation of a mean vector. The objective is to apply the following formula for estimating a mean (see Bishop Section 2.3.5):
Let's first create a data generator. Create a function gen_data(n, k, mean, var)
which returns a
-
$N_k()$ is the k-variate normal distribution -
$\mu$ (ormean
) is the mean vector of dimension$k$ -
$\sigma$ (orvar
) is the variance. -
$I_k$ is the identity matrix
You should use np.random.multivariate_normal
for this.
Example inputs and outputs (these examples use np.random.seed(1234)
):
gen_data(2, 3, np.array([0, 1, -1]), 1.3)
[[ 0.61286571, -0.5482684 , 0.86251906],
[-0.40644746, 0.06323465, 0.15331182]]
gen_data(5, 1, np.array([0.5]), 0.5)
[[ 0.73571758],
[-0.09548785],
[ 1.21635348],
[ 0.34367405],
[ 0.13970563]]
Answer this question in a raw text file and submit it as 1_2_1.txt
Lets create some data
You can visualize your data using tools.scatter_3d_data
to get a plot similar to the following
You can also use tools.bar_per_axis
to visualize the distribution of the data per dimension:
Do you expect the batch estimate to be exactly
Continue to section 1.4
.
We will now implement the sequential estimate.
We want a function that returns
Create a function update_sequence_mean(mu, x, n)
which performs the update in the equation above.
Example inputs and outputs:
mean = np.mean(X, 0)
new_x = gen_data(1, 3, np.array([0, 0, 0]), 1)
update_sequence_mean(mean, new_x, X.shape[0]+1)
Results in an array, similar to [[-0.21653761 -0.00721158 -0.15876203]]
(since we're using random numbers, the values you get will probably not be exactly the same).
Gradescope wil use np.random.seed(1234)
before generating the data to test this function.
Lets plot the estimates on all dimensions as the sequence estimate gets updated. You can use _plot_sequence_estimate()
as a template. You should:
- Generate 100 3-dimensional points with mean
[0, 0, 0]
and variance4
. - Set the initial estimate as
$(0, 0, 0)$ - And perform
update_sequence_mean
for each point in the set. - Collect the estimates as you go
For a different set of points this plot looks like the following:
Turn in your plot as 1_5_1.png
Lets now plot the squared error between the estimate and the actual mean after every update.
The squared error between e.g. a ground truth
Of course our data will be 3-dimensional so after calculating the squared error you will have a 3-dimensional error. Take the mean of those three values to get the average error across all three dimensions and plot those values.
You can use _plot_square_error
and _square_error
for this.
For a different distribution this plot looks like the following:
Turn in your plot as 1_6_1.png
Read this carefully before you submit your solution.
You should edit template.py
to include your own code.
This is an individual project so you can of course help each other out but your code should be your own.
You are not allowed to import any non-built in packages that are not already imported.
Files to turn in:
template.py
: This is your code1_2_1.txt
1_5_1.png
1_6_1.png
Make sure the file names are exact. Submission that do not pass the first two tests in Gradescope will not be graded.
What happens if the mean value changes (perhaps slowly) with time? What if
Create this type of data and formulate a method for tracking the mean.
Plot the estimate of all dimensions and the mean squared error over all three dimensions. Turn in these plots as indep_1.png
and indep_2.png
.
Write a short summary how your method works.