Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some solutions and outputs, tweaking setup to remove references to unused data #44

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 45 additions & 4 deletions _episodes/02-regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ plt.show()

Comparing the plots and errors it seems like a polynomial regression of N=2 is a far superior fit to Dataset II than a linear fit. In fact, it looks like our polynomial fit almost perfectly fits Dataset II... which is because Dataset II is created from a N=2 polynomial equation!

> ## Exercise: Perform and compare linear and polynomial fits for Datasets I, III, and IV.
> ## Exercise: Perform and compare linear and polynomial fits for Datasets I, II, III, and IV.
> Which performs better for each dataset? Modify your polynomial regression function to take `N` as an input parameter to your regression model. How does changing the degree of polynomial fit affect each dataset?
> > ## Solution
> > ~~~
Expand All @@ -316,14 +316,49 @@ Comparing the plots and errors it seems like a polynomial regression of N=2 is a
> > ~~~
> > {: .language-python}
> >
> > ![Polynomial regression of dataset I](../fig/regress_polynomial_1st.png)
> > ![Polynomial regression of dataset II](../fig/regress_polynomial_2nd.png)
> > ![Polynomial regression of dataset III](../fig/regress_polynomial_3rd.png)
> > ![Polynomial regression of dataset IV](../fig/regress_polynomial_4th.png)
> >
> > The `N=2` polynomial fit is far better for Dataset II. According to the RMSE the polynomial is a slightly better fit for Datasets I and III, however it could be argued that a linear fit is good enough.
> > Dataset III looks like a linear relation that has a single outlier, rather than a truly non-linear relation. The polynomial and linear fits perform just as well (or poorly) on Dataset IV.
> > For Dataset IV it looks like `y` may be a better estimator of `x`, than `x` is at estimating `y`.
> > ~~~
> > def fit_poly_model(x_poly, y_data, N):
> > # Define our estimator/model(s)
> > def pre_process_poly(x, y, N):
> > # sklearn requires a 2D array, so lets reshape our 1D arrays.
> > x_data = np.array(x).reshape(-1, 1)
> > y_data = np.array(y).reshape(-1, 1)
> >
> > # create a polynomial representation of our data
> > poly_features = PolynomialFeatures(degree=N)
> > # ...
> > x_poly = poly_features.fit_transform(x_data)
> >
> > return x_poly, x_data, y_data
> >
> > def plot_poly_model(x_data, poly_data, N):
> > # visualise!
> > plt.plot(x_data, poly_data, label="poly fit N=" + str(N))
> > plt.legend()
> >
> > def fit_predict_plot_poly(x, y, N):
> > # Combine all of the steps
> > x_poly, x_data, y_data = pre_process_poly(x, y, N)
> > poly_regress = fit_poly_model(x_poly, y_data)
> > poly_data = predict_poly_model(poly_regress, x_poly, y_data)
> > plot_poly_model(x_data, poly_data, N)
> >
> > return poly_regress
> >
> > for ds in ["I","II","III","IV"]:
> > # Sort our data in order of our x (feature) values
> > data_ds = data[data["dataset"]==ds]
> > data_ds = data_ds.sort_values("x")
> > fit_predict_plot_linear(data_ds["x"],data_ds["y"])
> > for N in range(2,11):
> > print("Polynomial degree =",N)
> > fit_predict_plot_poly(data_ds["x"],data_ds["y"],N)
> > plt.show()
> > ~~~
> > {: .language-python}
> >
Expand All @@ -344,6 +379,12 @@ Comparing the plots and errors it seems like a polynomial regression of N=2 is a
> > With a large enough polynomial you can fit through every point with a unique `x` value.
> > Datasets II and IV remain unchanged beyond `N=2` as the polynomial has converged (dataset II) or cannot model the data (Dataset IV).
> > Datasets I and III slowly decrease their RMSE and N is increased, but it is likely that these more complex models are overfitting the data. Overfitting is discussed later in the lesson.
> >
> > ![Polynomial regression of dataset I with N between 1 and 10](../fig/regress_polynomial_n_1st.png)
> > ![Polynomial regression of dataset II with N between 1 and 10](../fig/regress_polynomial_n_2nd.png)
> > ![Polynomial regression of dataset III with N between 1 and 10](../fig/regress_polynomial_n_3rd.png)
> > ![Polynomial regression of dataset IV with N between 1 and 10](../fig/regress_polynomial_n_4th.png)
> >
> {: .solution}
{: .challenge}

Expand Down
3 changes: 3 additions & 0 deletions _episodes/05-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,9 @@ plots_labels(circles, labels)
~~~
{: .language-python}

![Kmeans attempting to cluster the concentric circles](../fig/kmeans_concentric_circle_2.png)
![Spectral clustering on the concentric circles](../fig/spectral_concentric_circle_2.png)


> ## Comparing k-means and spectral clustering performance
> Modify the program we wrote in the previous exercise to use spectral clustering instead of k-means and save it as a new file.
Expand Down
10 changes: 5 additions & 5 deletions _episodes/06-dimensionality-reduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,11 +246,11 @@ Our example here is still a relatively simple example of 8x8 images and not very
> > from mpl_toolkits.mplot3d import Axes3D
> > # PCA
> > pca = decomposition.PCA(n_components=3)
> > pca.fit(x)
> > x_pca = pca.transform(x)
> > pca.fit(features)
> > x_pca = pca.transform(features)
> > fig = plt.figure(1, figsize=(4, 4))
> > ax = fig.add_subplot(projection='3d')
> > ax.scatter(x_pca[:, 0], x_pca[:, 1], x_pca[:, 2], c=y,
> > ax.scatter(x_pca[:, 0], x_pca[:, 1], x_pca[:, 2], c=labels,
> > cmap=plt.cm.nipy_spectral, s=9, lw=0)
> > plt.show()
> > ~~~
Expand All @@ -262,10 +262,10 @@ Our example here is still a relatively simple example of 8x8 images and not very
> > # t-SNE embedding
> > tsne = manifold.TSNE(n_components=3, init='pca',
> > random_state = 0)
> > x_tsne = tsne.fit_transform(x)
> > x_tsne = tsne.fit_transform(features)
> > fig = plt.figure(1, figsize=(4, 4))
> > ax = fig.add_subplot(projection='3d')
> > ax.scatter(x_tsne[:, 0], x_tsne[:, 1], x_tsne[:, 2], c=y,
> > ax.scatter(x_tsne[:, 0], x_tsne[:, 1], x_tsne[:, 2], c=labels,
> > cmap=plt.cm.nipy_spectral, s=9, lw=0)
> > plt.show()
> > ~~~
Expand Down
Binary file added fig/kmeans_concentric_circle_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/regress_polynomial_1st.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/regress_polynomial_2nd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/regress_polynomial_3rd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/regress_polynomial_4th.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/regress_polynomial_n_1st.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/regress_polynomial_n_2nd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/regress_polynomial_n_3rd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/regress_polynomial_n_4th.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added fig/spectral_concentric_circle_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
53 changes: 30 additions & 23 deletions setup.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,50 @@
---
title: Setup
---
# Software Packages Required
# Requirements

You will need to have an installation of Python 3 with the matplotlib, pandas, numpy and optionally opencv packages.
## Software

The [Anaconda Distribution](https://www.anaconda.com/products/individual#Downloads) includes all of these except opencv by default.
You will need a terminal, Python 3.8+, and the ability to create Python virtual environments.

## Installing OpenCV with Anaconda
## Packages

* Load the Anaconda Navigator
* Click on "Environments" on the left hand side.
* Choose "Not Installed" from the pull down menu next to the channels button.
* Type "opencv" into the search box.
* Tick the box next to the opencv package and then click apply.
You will need the MatPlotLib, Pandas, Numpy and OpenCV packages.

## Installing from the Anaconda command line
# Setup

From the Anaconda terminal run the command `conda install -c conda-forge opencv`
Create a new directory for the workshop, then launch a terminal in it:

# Download the data
~~~
mkdir workshop-ml
cd workshop-ml
~~~
{: .language-bash}

Please create a sub directory called data in the directory where you save any code you write.
## Creating a new Virtual Environment
We'll install the prerequisites in a virtual environment, to prevent them from cluttering up your Python environment and causing conflicts.
First, create a new directory and ent

Download the the following files to this directory:
To create a new virtual environment for the project, open the terminal and type:

* [Gapminder Life Expectancy Data](data/gapminder-life-expectancy.csv)
* [World Bank GDP Data](data/worldbank-gdp.csv)
* [World Bank GDP Data with outliers](data/worldbank-gdp-outliers.csv)
~~~
python3 -m venv venv
~~~
{: .language-bash}

> If you're on Linux and this doesn't work, try installing `python3-venv` using your package manager, e.g. `sudo apt-get install python3-venv`.
{: .info}

If you are using a Mac or Linux system the following commands will download this:
## Installing your prerequisites

Activate your virtual environment, and install the prerequisites:

~~~
mkdir data
cd data
wget https://scw-aberystwyth.github.io/machine-learning-novice/data/worldbank-gdp.csv
wget https://scw-aberystwyth.github.io/machine-learning-novice/data/worldbank-gdp-outliers.csv
wget https://scw-aberystwyth.github.io/machine-learning-novice/data/gapminder-life-expectancy.csv
source venv/bin/activate
pip install numpy pandas matplotlib opencv-python
~~~
{: .language-bash}

You'll need to re-activate the virtual environment to use it during the session.

{% include links.md %}
Loading