diff --git a/_episodes/01-introduction.md b/_episodes/01-introduction.md index 033645d..816d1c3 100644 --- a/_episodes/01-introduction.md +++ b/_episodes/01-introduction.md @@ -25,7 +25,7 @@ Machine learning is a set of techniques that enable computers to use data to imp The term machine learning (ML) is often mentioned alongside artificial intelligence (AI) and deep learning (DL). Deep learning is a subset of machine learning, and machine learning is a subset of artificial intelligence. -AI is a broad term used to describe a system possessing a "general intelligence" that can be applied to solve a diverse range of problems, often mimicking the behaviour of intelligent biological systems. Modern attempts are getting close to fooling humans, but while there have been great advances in AI and ML research, human-like intelligence is only possible in a few specialist areas. Despite this technical definition, AI is often used to describe ML and DL systems in general. +AI is increasingly being used as a catch-all term to describe things that encompass ML and DL systems - from simple email spam filters, to more complex image recognition systems, to large language models such as ChatGPT. The more specific term "Artificial General Intelligence" (AGI) is used to describe a system possessing a "general intelligence" that can be applied to solve a diverse range of problems, often mimicking the behaviour of intelligent biological systems. Modern attempts at AGI are getting close to fooling humans, but while there have been great advances in AI research, human-like intelligence is only possible in a few specialist areas. ML refers to techniques where a computer can "learn" patterns in data, usually by being shown many training examples. While ML algorithms can learn to solve specific problems, or multiple similar problems, they are not considered to possess a general intelligence. ML algorithms often need hundreds or thousands of examples to learn a task and are confined to activities such as simple classifications. A human-like system could learn much quicker than this, and potentially learn from a single example by using it's knowledge of many other problems. @@ -104,14 +104,16 @@ Machine learning is about creating models from data: for that reason, we'll star Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features] +We typically have a "Features Matrix" (usually referred to as the code variable `X`) which are the "features" data we wish to train on. + * n_samples: The number of samples. A sample can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits. * n_features: The number of features (variables) that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases. +If we want our ML models to make predictions or classifications, we also provide "labels" as our expected "answers/results". The model will then be trained on the input features to try and match our provided labels. This is done by providing a "Target Array" (usually referred to as the code variable `y`) which contains the "labels or values" that we wish to predict using the features data. + ![Types of Machine Learning](../fig/introduction/sklearn_input.png) Figure from the [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook) -If we want our ML models to make predictions or classifications, we also provide "labels" as our expected "answers/results". The model will then be trained on the input features to try and match our provided labels. - # What will we cover today? This lesson will introduce you to some of the key concepts and sub-domains of ML such as supervised learning, unsupervised learning, and neural networks. diff --git a/_episodes/03-classification.md b/_episodes/03-classification.md index 447ff37..db51cf0 100644 --- a/_episodes/03-classification.md +++ b/_episodes/03-classification.md @@ -16,9 +16,9 @@ keypoints: Classification is a supervised method to recognise and group data objects into a pre-determined categories. Where regression uses labelled observations to predict a continuous numerical value, classification predicts a discrete categorical fit to a class. Classification in ML leverages a wide range of algorithms to classify a set of data/datasets into their respective categories. -In this lesson we are going to introduce the concept of supervised classification by classifying penguin data into different species of penguins using Scikit-Learn. +In this episode we are going to introduce the concept of supervised classification by classifying penguin data into different species of penguins using Scikit-Learn. -### The penguins dataset +## The penguins dataset We're going to be using the penguins dataset of Allison Horst, published [here](https://github.com/allisonhorst/palmerpenguins) in 2020, which is comprised of 342 observations of three species of penguins: Adelie, Chinstrap & Gentoo. For each penguin we have measurements of bill length and depth (mm), flipper length (mm), body mass (g), and information on species, island, and sex. ~~~ @@ -35,15 +35,31 @@ As a rule of thumb for ML/DL modelling, it is best to start with a simple model For this lesson we will limit our dataset to only numerical values such as bill_length, bill_depth, flipper_length, and body_mass while we attempt to classify species. -The above table contains multiple categorical objects such as species. If we attempt to include the other categorical fields, island and sex, we hinder classification performance due to the complexity of the data. +The above table contains multiple categorical objects such as species. If we attempt to include the other categorical fields, island and sex, we might hinder classification performance due to the complexity of the data. -### Training-testing split -When undertaking any machine learning project, it's important to be able to evaluate how well your model works. In order to do this, we set aside some data (usually 20%) as a testing set, leaving the rest as your training dataset. +## Training-testing split +When undertaking any machine learning project, it's important to be able to evaluate how well your model works. -> ## Why do we do this? -> It's important to do this early, and to do all of your work with the training dataset - this avoids any risk of you introducing bias to the model based on your own observations of data in the testing set, and can highlight when you are over-fitting on your training data. +Rather than evaluating this manually we can instead set aside some of our training data, usually 20% of our training data, and use these as a testing dataset. We then train on the remaining 80% and use the testing dataset to evaluate the accuracy of our trained model. + +We lose a bit of training data in the process, But we can now easily evaluate the performance of our model. With more advanced test-train split techniques we can even recover this lost training data! + +> ### Why do we do this? +> It's important to do this early, and to do all of your work with the training dataset - this avoids any risk of you introducing bias to the model based on your own manual observations of data in the testing set (afterall, we want the model to make the decisions about parameters!). This can also highlight when you are over-fitting on your training data. {: .callout} +How we split the data into training and testing sets is also extremely important. We need to make sure that our training data is representitive of both our test data and actual data. + +For classification problems this means we should ensure that each class of interest is represented proportionately in both training and testing sets. For regression problems we should ensure that our training and test sets cover the range of feature values that we wish to predict. + +In the previous regression episode we created the penguin training data by taking the first 146 samples our the dataset. Unfortunately the penguin data is sorted by species and so our training data only considered one type of penguin and thus was not representitive of the actual data we tried to fit. We could have avoided this issue by randomly shuffling our penguin samples before splitting the data. + +> ### When not to shuffle your data +> Sometimes your data is dependant on it's ordering, such as time-series data where past values influence future predictions. Creating train-test splits for this can be tricky at first glance, but fortunately there are existing techniques to tackle this (often called stratification): See [Scikit-Learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators) for more information. +{: .callout} + +Lets do some pre-processing on our dataset and specify our `X` features and `Y` labels: + ~~~ # Extract the data we need feature_names = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'] @@ -57,7 +73,7 @@ Y = dataset['species'] ~~~ {: .language-python} -Having extracted our features (X) and labels (Y), we can now split the data +Having extracted our features `X` and labels `y`, we can now split the data using the `train_test_split` function. We specify the fraction of data to use as test data, and the function randomly shuffles our data prior to splitting: ~~~ from sklearn.model_selection import train_test_split @@ -66,7 +82,7 @@ x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_ ~~~ {: .language-python} -We'll use x_train and y_train to develop our model, and only look at x_test and y_test when it's time to evaluate its performance. +We'll use `x_train` and `y_train` to develop our model, and only look at `x_test` and `y_test` when it's time to evaluate its performance. ### Visualising the data In order to better understand how a model might classify this data, we can first take a look at the data visually, to see what patterns we might identify. @@ -97,27 +113,33 @@ We can see that penguins from each species form fairly distinct spatial clusters ## Classification using a decision tree We'll first apply a decision tree classifier to the data. Decisions trees are conceptually similar to flow diagrams (or more precisely for the biologists: dichotomous keys). They split the classification problem into a binary tree of comparisons, at each step comparing a measurement to a value, and moving left or right down the tree until a classification is reached. -(figure) +![Decision tree for classifying penguins](../fig/decision_tree_example.png) + Training and using a decision tree in Scikit-Learn is straightforward: ~~~ from sklearn.tree import DecisionTreeClassifier, plot_tree -clf = DecisionTreeClassifier() +clf = DecisionTreeClassifier(max_depth=2) clf.fit(x_train, y_train) clf.predict(x_test) ~~~ {: .language-python} +> ## Hyper-parameters: parameters that tune a model +> 'Max Depth' is an example of a *hyper-parameter* for the decision tree model. Where models use the parameters of an observation to predict a result, hyper-parameters are used to tune how a model works. Each model you encounter will have its own set of hyper-parameters, each of which affects model behaviour and performance in a different way. The process of adjusting hyper-parameters in order to improve model performance is called hyper-parameter tuning. + + We can conveniently check how our model did with the .score() function, which will make predictions and report what proportion of them were accurate: ~~~ clf_score = clf.score(x_test, y_test) +print(clf_score) ~~~ {: .language-python} -We can also look at the decision tree that was generated: +Our model reports an accuracy of ~98% on the test data! We can also look at the decision tree that was generated: ~~~ fig = plt.figure(figsize=(12, 10)) @@ -126,15 +148,20 @@ plt.show() ~~~ {: .language-python} -![Decision tree for classifying penguins](../fig/e3_dt_6.png) +![Decision tree for classifying penguins](../fig/e3_dt_2.png) + +The first first question (`depth=1`) splits the training data into "Adelie" and "Gentoo" categories using the criteria `flipper_length_mm <= 206.5`, and the next two questions (`depth=2`) split the "Adelie" and "Gentoo" categories into "Adelie & Chinstrap" and "Gentoo & Chinstrap" predictions. + + -We can see from this that there's some very tortuous logic being used to tease out every single observation in the training set. For example, the single purple Gentoo node at the bottom of the tree. If we truncated that branch to the second level (Chinstrap), we'd have a little inaccuracy, a total of 9 non-Chinstraps in with 48 Chinstraps, but a less convoluted model. + ### Visualising the classification space -We can visualise the delineation produced, but only for two parameters at a time, so the model produced here isn't exactly the same as that used above: +We can visualise the classification space (decision tree boundaries) to get a more intuitive feel for what it is doing.Note that our 2D plot can only show two parameters at a time, so we will quickly visualise by training a new model on only 2 features: ~~~ from sklearn.inspection import DecisionBoundaryDisplay @@ -142,7 +169,7 @@ from sklearn.inspection import DecisionBoundaryDisplay f1 = feature_names[0] f2 = feature_names[3] -clf = DecisionTreeClassifier() +clf = DecisionTreeClassifier(max_depth=2) clf.fit(x_train[[f1, f2]], y_train) d = DecisionBoundaryDisplay.from_estimator(clf, x_train[[f1, f2]]) @@ -152,69 +179,17 @@ plt.show() ~~~ {: .language-python} -![Classification space for our decision tree](../fig/e3_dt_space_6.png) - -We can see that rather than clean lines between species, the decision tree produces orthogonal regions as each decision only considers a single parameter. Again, we can see that the model is over-fitting as the decision space is far more complex than needed, with regions that only select a single point. +![Classification space for our decision tree](../fig/e3_dt_space_2.png) +## Tuning the `max_depth` hyperparameter -## Classification using support vector machines -Next, we'll look at another commonly used classification algorithm, and see how it compares. Support Vector Machines (SVM) work in a way that is conceptually similar to your own intuition when first looking at the data. They devise a set of hyperplanes that delineate the parameter space, such that each region contains ideally only observations from one class, and the boundaries fall between classes. - -### Normalising data -Unlike decision trees, SVMs require an additional pre-processing step for our data. We need to normalise it. Our raw data has parameters with different magnitudes such as bill length measured in 10's of mm's, whereas body mass is measured in 1000's of grams. If we trained an SVM directly on this data, it would only consider the parameter with the greatest variance (body mass). +Our decision tree using a `max_depth=2` is fairly simple and there are still some incorrect predictions in our final classifications. Let's try varying the `max_depth` hyperparameter to see if we can improve our model predictions. -Normalising maps each parameter to a new range so that it has a mean of 0 and a standard deviation of 1. + ~~~ -from sklearn import preprocessing import pandas as pd -scalar = preprocessing.StandardScaler() -scalar.fit(x_train) -x_train_scaled = pd.DataFrame(scalar.transform(x_train), columns=x_train.columns, index=x_train.index) -x_test_scaled = pd.DataFrame(scalar.transform(x_test), columns=x_test.columns, index=x_test.index) -~~~ -{: .language-python} - -Note that we fit the scalar to our training data - we then use this same pre-trained scalar to transform our testing data. - -With this scaled data, training the models works exactly the same as before. - -~~~ -from sklearn import svm - -SVM = svm.SVC(kernel='poly', degree=3, C=1.5) -SVM.fit(x_train_scaled, y_train) - -svm_score = SVM.score(x_test_scaled, y_test) -print("Decision tree score is ", clf_score) -print("SVM score is ", svm_score) -~~~ -{: .language-python} - -We can again visualise the decision space produced, also using only two parameters: - -~~~ -x2 = x_train_scaled[[feature_names[0], feature_names[1]]] - -SVM = svm.SVC(kernel='poly', degree=3, C=1.5) -SVM.fit(x2, y_train) - -DecisionBoundaryDisplay.from_estimator(SVM, x2) #, ax=ax -sns.scatterplot(x2, x=feature_names[0], y=feature_names[1], hue=dataset['species']) -plt.show() -~~~ -{: .language-python} - -![Classification space generated by the SVM model](../fig/e3_svc_space.png) - -While this SVM model performs slightly worse than our decision tree (95.6% vs. 97.1%), we can see that the decision space is much simpler, and less likely to be overfit to the data. - - -## Reducing over-fitting in the decision tree -We can reduce the over-fitting of our decision tree model by limiting its depth, forcing it to use less decisions to produce a classification, and resulting in a simpler decision space. - -~~~ max_depths = [1, 2, 3, 4, 5] accuracy = [] @@ -236,13 +211,12 @@ plt.show() ![Performance of decision trees of various depths](../fig/e3_dt_overfit.png) -Here we can see that a maximum depth of two performs just as well as our original model with a depth of five. In this example it even performs a little better. +Here we can see that a `max_depth=2` performs slightly better on the test data than those with `max_depth > 2`. This can seem counter intuitive, as surely more questions should be able to better split up our categories and thus give better predictions? - -Reusing our visualisation code from above, we can inspect our simplified decision tree and decision space: +Let's reuse our fitting and plotting codes from above to inspect a decision tree that has `max_depth=5`: ~~~ -clf = DecisionTreeClassifier(max_depth=2) +clf = DecisionTreeClassifier(max_depth=5) clf.fit(x_train, y_train) fig = plt.figure(figsize=(12, 10)) @@ -251,15 +225,14 @@ plt.show() ~~~ {: .language-python} -![Simplified decision tree](../fig/e3_dt_2.png) - -Noting the added max_depth=2 parameter. +![Simplified decision tree](../fig/e3_dt_6.png) +It looks like our decision tree has split up the training data into the correct penguin categories and more accurately than the `max_depth=2` model did, however it used some very specific questions to split up the penguins into the correct categories. Let's try visualising the classification space for a more intuitive understanding: ~~~ f1 = feature_names[0] f2 = feature_names[3] -clf = DecisionTreeClassifier(max_depth=2) +clf = DecisionTreeClassifier(max_depth=5) clf.fit(x_train[[f1, f2]], y_train) d = DecisionBoundaryDisplay.from_estimator(clf, x_train[[f1, f2]]) @@ -269,17 +242,64 @@ plt.show() ~~~ {: .language-python} -![Classification space of the simplified decision tree](../fig/e3_dt_space_2.png) +![Classification space of the simplified decision tree](../fig/e3_dt_space_6.png) -We can see that both the tree and the decision space are much simpler, but still do a good job of classifying our data. We've succeeded in reducing over-fitting. +Earlier we saw that the `max_depth=2` model split the data into 3 simple bounding boxes, whereas for `max_depth=5` we see the model has created some very specific classification boundaries to correctly classify every point in the training data. -> ## Hyper-parameters: parameters that tune a model -> 'Max Depth' is an example of a *hyper-parameter* for the decision tree model. Where models use the parameters of an observation to predict a result, hyper-parameters are used to tune how a model works. Each model you encounter will have its own set of hyper-parameters, each of which affects model behaviour and performance in a different way. The process of adjusting hyper-parameters in order to improve model performance is called hyper-parameter tuning. -{: .callout} +This is a classic case of over-fitting - our model has produced extremely specific parameters that work for the training data but are not representitive of our test data. Sometimes simplicity is better! + + +## Classification using support vector machines +Next, we'll look at another commonly used classification algorithm, and see how it compares. Support Vector Machines (SVM) work in a way that is conceptually similar to your own intuition when first looking at the data. They devise a set of hyperplanes that delineate the parameter space, such that each region contains ideally only observations from one class, and the boundaries fall between classes. + +### Normalising data +Unlike decision trees, SVMs require an additional pre-processing step for our data. We need to normalise it. Our raw data has parameters with different magnitudes such as bill length measured in 10's of mm's, whereas body mass is measured in 1000's of grams. If we trained an SVM directly on this data, it would only consider the parameter with the greatest variance (body mass). + +Normalising maps each parameter to a new range so that it has a mean of 0 and a standard deviation of 1. + +~~~ +from sklearn import preprocessing +import pandas as pd + +scalar = preprocessing.StandardScaler() +scalar.fit(x_train) +x_train_scaled = pd.DataFrame(scalar.transform(x_train), columns=x_train.columns, index=x_train.index) +x_test_scaled = pd.DataFrame(scalar.transform(x_test), columns=x_test.columns, index=x_test.index) +~~~ +{: .language-python} + +Note that we fit the scalar to our training data - we then use this same pre-trained scalar to transform our testing data. + +With this scaled data, training the models works exactly the same as before. + +~~~ +from sklearn import svm + +SVM = svm.SVC(kernel='poly', degree=3, C=1.5) +SVM.fit(x_train_scaled, y_train) + +svm_score = SVM.score(x_test_scaled, y_test) +print("Decision tree score is ", clf_score) +print("SVM score is ", svm_score) +~~~ +{: .language-python} + +We can again visualise the decision space produced, also using only two parameters: + +~~~ +x2 = x_train_scaled[[feature_names[0], feature_names[1]]] + +SVM = svm.SVC(kernel='poly', degree=3, C=1.5) +SVM.fit(x2, y_train) + +DecisionBoundaryDisplay.from_estimator(SVM, x2) #, ax=ax +sns.scatterplot(x2, x=feature_names[0], y=feature_names[1], hue=dataset['species']) +plt.show() +~~~ +{: .language-python} + +![Classification space generated by the SVM model](../fig/e3_svc_space.png) +While this SVM model performs slightly worse than our decision tree (95.6% vs. 98.5%), it's likely that the non-linear boundaries will perform better when exposed to more and more real data, as decision trees are prone to overfitting and requires complex linear models to reproduce simple non-linear boundaries. It's important to pick a model that is appropriate for your problem and data trends! -### Note that care is needed when splitting data -- You generally want to ensure that each class is represented proportionately in both training and testing (beware of just taking the first 80%). -- Sometimes you want to make sure a group is excluded from the train/test split, e.g.: when multiple samples come from one individual. -- This is often called stratification -See [Scikit-Learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators) for more information. +{: .callout} diff --git a/fig/decision_tree_example.png b/fig/decision_tree_example.png new file mode 100644 index 0000000..74c51f6 Binary files /dev/null and b/fig/decision_tree_example.png differ diff --git a/fig/e3_dt_2.png b/fig/e3_dt_2.png index 4bd3d95..363985c 100644 Binary files a/fig/e3_dt_2.png and b/fig/e3_dt_2.png differ diff --git a/fig/e3_dt_space_2.png b/fig/e3_dt_space_2.png index 13b4d01..dd4da2b 100644 Binary files a/fig/e3_dt_space_2.png and b/fig/e3_dt_space_2.png differ