carpentries-incubator · colinsauze · Oct 8, 2024 · Sep 17, 2024 · Sep 26, 2024 · Sep 26, 2024
diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md
@@ -301,7 +301,7 @@ plt.show()
 
 Comparing the plots and errors it seems like a polynomial regression of N=2 is a far superior fit to Dataset II than a linear fit. In fact, it looks like our polynomial fit almost perfectly fits Dataset II... which is because Dataset II is created from a N=2 polynomial equation!
 
-> ## Exercise: Perform and compare linear and polynomial fits for Datasets I, III, and IV.
+> ## Exercise: Perform and compare linear and polynomial fits for Datasets I, II, III, and IV.
 > Which performs better for each dataset? Modify your polynomial regression function to take `N` as an input parameter to your regression model. How does changing the degree of polynomial fit affect each dataset?
 > > ## Solution
 > > ~~~
@@ -316,14 +316,49 @@ Comparing the plots and errors it seems like a polynomial regression of N=2 is a
 > > ~~~
 > > {: .language-python}
 > >
+> > ![Polynomial regression of dataset I](../fig/regress_polynomial_1st.png)
+> > ![Polynomial regression of dataset II](../fig/regress_polynomial_2nd.png)
+> > ![Polynomial regression of dataset III](../fig/regress_polynomial_3rd.png)
+> > ![Polynomial regression of dataset IV](../fig/regress_polynomial_4th.png)
+> > 
 > > The `N=2` polynomial fit is far better for Dataset II. According to the RMSE the polynomial is a slightly better fit for Datasets I and III, however it could be argued that a linear fit is good enough.
 > > Dataset III looks like a linear relation that has a single outlier, rather than a truly non-linear relation. The polynomial and linear fits perform just as well (or poorly) on Dataset IV.
 > > For Dataset IV it looks like `y` may be a better estimator of `x`, than `x` is at estimating `y`.
 > > ~~~
-> > def fit_poly_model(x_poly, y_data, N):
-> >     # Define our estimator/model(s)
+> > def pre_process_poly(x, y, N):
+> >     # sklearn requires a 2D array, so lets reshape our 1D arrays.
+> >     x_data = np.array(x).reshape(-1, 1)
+> >     y_data = np.array(y).reshape(-1, 1)
+> >
+> >     # create a polynomial representation of our data
 > >     poly_features = PolynomialFeatures(degree=N)
-> >     # ...
+> >     x_poly = poly_features.fit_transform(x_data)
+> >
+> >     return x_poly, x_data, y_data
+> >
+> > def plot_poly_model(x_data, poly_data, N):
+> >     # visualise!
+> >     plt.plot(x_data, poly_data, label="poly fit N=" + str(N))
+> >     plt.legend()
+> >
+> > def fit_predict_plot_poly(x, y, N):
+> >     # Combine all of the steps
+> >     x_poly, x_data, y_data = pre_process_poly(x, y, N)
+> >     poly_regress = fit_poly_model(x_poly, y_data)
+> >     poly_data = predict_poly_model(poly_regress, x_poly, y_data)
+> >     plot_poly_model(x_data, poly_data, N)
+> >
+> >     return poly_regress
+> >
+> > for ds in ["I","II","III","IV"]:
+> >     # Sort our data in order of our x (feature) values
+> >     data_ds = data[data["dataset"]==ds]
+> >     data_ds = data_ds.sort_values("x")
+> >     fit_predict_plot_linear(data_ds["x"],data_ds["y"])
+> >     for N in range(2,11):
+> >         print("Polynomial degree =",N)
+> >         fit_predict_plot_poly(data_ds["x"],data_ds["y"],N)
+> >     plt.show()
 > > ~~~
 > > {: .language-python}
 > >
@@ -344,6 +379,12 @@ Comparing the plots and errors it seems like a polynomial regression of N=2 is a
 > > With a large enough polynomial you can fit through every point with a unique `x` value.
 > > Datasets II and IV remain unchanged beyond `N=2` as the polynomial has converged (dataset II) or cannot model the data (Dataset IV).
 > > Datasets I and III slowly decrease their RMSE and N is increased, but it is likely that these more complex models are overfitting the data. Overfitting is discussed later in the lesson.
+> >
+> > ![Polynomial regression of dataset I with N between 1 and 10](../fig/regress_polynomial_n_1st.png)
+> > ![Polynomial regression of dataset II with N between 1 and 10](../fig/regress_polynomial_n_2nd.png)
+> > ![Polynomial regression of dataset III with N between 1 and 10](../fig/regress_polynomial_n_3rd.png)
+> > ![Polynomial regression of dataset IV with N between 1 and 10](../fig/regress_polynomial_n_4th.png)
+> >
 > {: .solution}
 {: .challenge}
 

diff --git a/_episodes/05-clustering.md b/_episodes/05-clustering.md
@@ -266,6 +266,9 @@ plots_labels(circles, labels)
 ~~~
 {: .language-python}
 
+![Kmeans attempting to cluster the concentric circles](../fig/kmeans_concentric_circle_2.png)
+![Spectral clustering on the concentric circles](../fig/spectral_concentric_circle_2.png)
+
 
 > ## Comparing k-means and spectral clustering performance
 > Modify the program we wrote in the previous exercise to use spectral clustering instead of k-means and save it as a new file.

diff --git a/_episodes/06-dimensionality-reduction.md b/_episodes/06-dimensionality-reduction.md
@@ -246,11 +246,11 @@ Our example here is still a relatively simple example of 8x8 images and not very
 > > from mpl_toolkits.mplot3d import Axes3D
 > > # PCA
 > > pca = decomposition.PCA(n_components=3)
-> > pca.fit(x)
-> > x_pca = pca.transform(x)
+> > pca.fit(features)
+> > x_pca = pca.transform(features)
 > > fig = plt.figure(1, figsize=(4, 4))
 > > ax = fig.add_subplot(projection='3d')
-> > ax.scatter(x_pca[:, 0], x_pca[:, 1], x_pca[:, 2], c=y,
+> > ax.scatter(x_pca[:, 0], x_pca[:, 1], x_pca[:, 2], c=labels,
 > >           cmap=plt.cm.nipy_spectral, s=9, lw=0)
 > > plt.show()
 > > ~~~
@@ -262,10 +262,10 @@ Our example here is still a relatively simple example of 8x8 images and not very
 > > # t-SNE embedding
 > > tsne = manifold.TSNE(n_components=3, init='pca',
 > >         random_state = 0)
-> > x_tsne = tsne.fit_transform(x)
+> > x_tsne = tsne.fit_transform(features)
 > > fig = plt.figure(1, figsize=(4, 4))
 > > ax = fig.add_subplot(projection='3d')
-> > ax.scatter(x_tsne[:, 0], x_tsne[:, 1], x_tsne[:, 2], c=y,
+> > ax.scatter(x_tsne[:, 0], x_tsne[:, 1], x_tsne[:, 2], c=labels,
 > >           cmap=plt.cm.nipy_spectral, s=9, lw=0)
 > > plt.show()
 > > ~~~

diff --git a/fig/kmeans_concentric_circle_2.png b/fig/kmeans_concentric_circle_2.png
diff --git a/fig/regress_polynomial_1st.png b/fig/regress_polynomial_1st.png
diff --git a/fig/regress_polynomial_2nd.png b/fig/regress_polynomial_2nd.png
diff --git a/fig/regress_polynomial_3rd.png b/fig/regress_polynomial_3rd.png
diff --git a/fig/regress_polynomial_4th.png b/fig/regress_polynomial_4th.png
diff --git a/fig/regress_polynomial_n_1st.png b/fig/regress_polynomial_n_1st.png
diff --git a/fig/regress_polynomial_n_2nd.png b/fig/regress_polynomial_n_2nd.png
diff --git a/fig/regress_polynomial_n_3rd.png b/fig/regress_polynomial_n_3rd.png
diff --git a/fig/regress_polynomial_n_4th.png b/fig/regress_polynomial_n_4th.png
diff --git a/fig/spectral_concentric_circle_2.png b/fig/spectral_concentric_circle_2.png
diff --git a/setup.md b/setup.md
@@ -1,43 +1,50 @@
 ---
 title: Setup
 ---
-# Software Packages Required
+# Requirements
 
-You will need to have an installation of Python 3 with the matplotlib, pandas, numpy and optionally opencv packages. 
+## Software
 
-The [Anaconda Distribution](https://www.anaconda.com/products/individual#Downloads) includes all of these except opencv by default.
+You will need a terminal, Python 3.8+, and the ability to create Python virtual environments.
 
-## Installing OpenCV with Anaconda
+## Packages
 
-* Load the Anaconda Navigator
-* Click on "Environments" on the left hand side.
-* Choose "Not Installed" from the pull down menu next to the channels button.
-* Type "opencv" into the search box.
-* Tick the box next to the opencv package and then click apply. 
+You will need the MatPlotLib, Pandas, Numpy and OpenCV packages. 
 
-## Installing from the Anaconda command line
+# Setup
 
-From the Anaconda terminal run the command `conda install -c conda-forge opencv`
+Create a new directory for the workshop, then launch a terminal in it:
 
-# Download the data
+~~~
+mkdir workshop-ml
+cd workshop-ml
+~~~
+{: .language-bash}
 
-Please create a sub directory called data in the directory where you save any code you write.
+## Creating a new Virtual Environment
+We'll install the prerequisites in a virtual environment, to prevent them from cluttering up your Python environment and causing conflicts.
+First, create a new directory and ent
 
-Download the the following files to this directory:
+To create a new virtual environment for the project, open the terminal and type:
 
-* [Gapminder Life Expectancy Data](data/gapminder-life-expectancy.csv)
-* [World Bank GDP Data](data/worldbank-gdp.csv)
-* [World Bank GDP Data with outliers](data/worldbank-gdp-outliers.csv)
+~~~
+python3 -m venv venv
+~~~
+{: .language-bash}
+
+> If you're on Linux and this doesn't work, try installing `python3-venv` using your package manager, e.g. `sudo apt-get install python3-venv`.
+{: .info}
 
-If you are using a Mac or Linux system the following commands will download this:
+## Installing your prerequisites
+
+Activate your virtual environment, and install the prerequisites:
 
 ~~~
-mkdir data
-cd data
-wget https://scw-aberystwyth.github.io/machine-learning-novice/data/worldbank-gdp.csv
-wget https://scw-aberystwyth.github.io/machine-learning-novice/data/worldbank-gdp-outliers.csv
-wget https://scw-aberystwyth.github.io/machine-learning-novice/data/gapminder-life-expectancy.csv
+source venv/bin/activate
+pip install numpy pandas matplotlib opencv-python
 ~~~
 {: .language-bash}
 
+You'll need to re-activate the virtual environment to use it during the session.
+
 {% include links.md %}