changes to ensemble lesson

mike-ivs · Sep 19, 2023 · d7a119a · d7a119a
1 parent 9c906c7
commit d7a119a
Showing 1 changed file with 74 additions and 76 deletions.
diff --git a/_episodes/04-ensemble-methods.md b/_episodes/04-ensemble-methods.md
@@ -15,16 +15,25 @@ keypoints:
 
 # Ensemble methods
 
-What's better than one decision tree, perhaps two, or three? How about enough trees to make up a forest? If we think back to classification with decision trees we stumbled into the problem of overfitting our training data. Ensemble methods are based on the mantra that the whole is greater than the sum of the parts. If we combine predictions from a series of over/under fitting estimators then we can often produce a better final prediction than using a single reliable model. Decision trees and regressions are often very sensitive to outliers and so are well suited to be a part of an ensemble.
+What's better than one decision tree? Perhaps two? or three? How about enough trees to make up a forest? 
+Ensemble methods bundle individual models together and use each of their outputs to contribute towards a final consensus for a given problem. Ensemble methods are based on the mantra that the whole is greater than the sum of the parts. 
+
+Thinking back to the classification episode with decision trees we quickly stumbled into the problem of overfitting our training data. If we combine predictions from a series of over/under fitting estimators then we can often produce a better final prediction than using a single reliable model - in the same way that humans often hear multiple opinions on a scenario before deciding a final outcome. Decision trees and regressions are often very sensitive to training outliers and so are well suited to be a part of an ensemble.
 
 Ensemble methods are used for a variety of applciations including, but not limited to, search systems and object detection. We can use any model/estimator available in sci-kit learn to create an ensemble. There are three main methods to create ensembles approaches: 
 
+* Stacking
+* Bagging
+* Boosting
+
+Let's explore them in a bit more depth.
+
 ### Stacking
 
 This is where we train a series of different models/estimators on the same input data in parallel. We then take the output of each model and pass them into a final decision algorithm/model that makes the final prediction. 
 
-The emphasis with stacking is to choose different models that
-can be used to build up a reliable concensus. Regression is typically a good choice for the final decision-making model.
+If we trained the same model multiple times on the same data we would expect very similar answers, and so the emphasis with stacking is to choose different models that can be used to build up a reliable concensus. Regression is then typically a good choice for the final decision-making model.
+
 ![Stacking](../fig/stacking.jpeg)
 
 [Image from Vasily Zubarev via their blog](https://vas3k.com/blog/machine_learning/)
@@ -41,77 +50,85 @@ The most common example is known as the Random Forest algorithm, which we'll tak
 
 ### Boosting
 
-This is where we train a single type of Model/estimator on an initial dataset, and then subsequently train the same type of models on poorly predicted samples i.e. each new model pays most attention to data that were incorrectly predicted by the last one.
+This is where we train a single type of Model/estimator on an initial dataset, test it's accuracy, and then subsequently train the same type of models on poorly predicted samples i.e. each new model pays most attention to data that were incorrectly predicted by the last one.
 
-Just like for bagging, boosting is trained mostly on subsets, however in this case these subsets are not randomly generated. Boosting can produce some very high accuracies by learning from it's mistakes, but due to the iterative nature of these improvements it doesn't parallelize well unlike the other ensemble methods. Despite this it can still be a faster, and computationally cheaper alternative to Neural Networks.
+Just like for bagging, boosting is trained mostly on subsets, however in this case these subsets are not randomly generated but are instead built using poorly estimated predictions. Boosting can produce some very high accuracies by learning from it's mistakes, but due to the iterative nature of these improvements it doesn't parallelize well unlike the other ensemble methods. Despite this it can still be a faster, and computationally cheaper alternative to Neural Networks.
 
 ![Stacking](../fig/boosting.jpeg)
 
 [Image from Vasily Zubarev via their blog](https://vas3k.com/blog/machine_learning/)
 
+### Ensemble summary
+
+Machine learning jargon can often be hard to remember, so here is a quick summary of the 3 ensemble methods:
+
+* Stacking - same dataset, different models, trained in parallel
+* Bagging - different subsets, same models, trained in parallel
+* Boosting - subsets of bad estimates, same models, trained in series
 
-## Using Random Forests for classification 
+## Using Bagging (Random Forests) for a classification problem
 
 In this session we'll take another look at the penguins data and applying one of the most common bagging approaches, random forests, to try and solve our species classification problem. First we'll load in the dataset and define a train and test split. 
 
 ~~~
 # import libraries
-import seaborn as sns
-import pandas as pd
 import numpy as np
+import pandas as pd
+import seaborn as sns
+from sklearn.model_selection import train_test_split
 
 # load penguins data
 penguins = sns.load_dataset('penguins')
 
+# prepare and define our data and targets
 feature_names = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
 penguins.dropna(subset=feature_names, inplace=True)
 
 species_names = penguins['species'].unique()
 
-# Define data and targets
 X = penguins[feature_names]
-
 y = penguins.species
 
 # Split data in training and test set
-from sklearn.model_selection import train_test_split
-
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)
 
-print(f'train size: {X_train.shape}')
-print(f'test size: {X_test.shape}')
+print("train size:", X_train.shape)
+print("test size", X_test.shape)
 ~~~
 {: .language-python}
 
-For comparison, we'll create a single decision tree estimator, find out the score and visualise the classification space. 
+For comparison with our Random Forest example we'll also create a single decision tree estimator, like we did previously, then find out the score and visualise the classification space using this decision tree. 
 
 ~~~
 from sklearn.tree import DecisionTreeClassifier
 
+# define our model
+tree = DecisionTreeClassifier()
 
-clf = DecisionTreeClassifier()
-
-clf.fit(X_train, y_train)
-
-clf.predict(X_test)
+# train our model
+tree.fit(X_train, y_train)
 
-print(clf.score(X_test, y_test))
+# test our model
+tree.predict(X_test)
+print(tree.score(X_test, y_test))
 ~~~
 {: .language-python}
 
-If we think back to the classification session, the decision tree trained using two parameters overfitted the data. Visualising classification space for body mass and bill length this is apparent where we can observe regions with single points. 
+If we think back to the classification session, the decision tree trained using two parameters overfitted the data. By visualising the classification space for body mass and bill length we can easily see regions that are overfitting to single points. 
 
 ~~~
 from sklearn.inspection import DecisionBoundaryDisplay
 import matplotlib.pyplot as plt
 
+# define two features to visualise
 f1 = feature_names[0]
 f2 = feature_names[3]
 
-clf = DecisionTreeClassifier()
-clf.fit(X_train[[f1, f2]], y_train)
+# redifine 
+tree_2d = DecisionTreeClassifier()
+tree_2d.fit(X_train[[f1, f2]], y_train)
 
-d = DecisionBoundaryDisplay.from_estimator(clf, X_train[[f1, f2]])
+d = DecisionBoundaryDisplay.from_estimator(tree_2d, X_train[[f1, f2]])
 
 sns.scatterplot(X_train, x=f1, y=f2, hue=y_train, palette="husl")
 plt.show()
@@ -134,11 +151,17 @@ We can now define a random forest estimator and train it using the penguin train
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.tree import plot_tree
 
-clf = RandomForestClassifier(n_estimators=100, max_depth=7, min_samples_leaf=1) #extra parameter called n_estimators which is number of trees in the forest
+# Define our model
+# extra parameter called n_estimators which is number of trees in the forest
+# a leaf is a class label at the end of the decision tree
+forest = RandomForestClassifier(n_estimators=100, max_depth=7, min_samples_leaf=1) 
 
-clf.fit(X_train, y_train)
+# train our model
+forest.fit(X_train, y_train)
 
-clf.score(X_test, y_test)
+# compare our models
+print(tree.score(X_test, y_test))
+print(forest.score(X_test, y_test))
 ~~~
 {: .language-python}
 
@@ -151,7 +174,7 @@ fig, axes = plt.subplots(nrows=1, ncols=5 ,figsize=(12,6))
 
 # plot first 5 trees in forest
 for index in range(0, 5):
-    plot_tree(clf.estimators_[index], 
+    plot_tree(forest.estimators_[index], 
         class_names=species_names,
         feature_names=feature_names, 
         filled=True, 
@@ -171,10 +194,10 @@ If we train the random forest estimator using the same two parameters used to pl
 
 ~~~
 # plot classification space for body mass and bill length with random forest
-clf = RandomForestClassifier(n_estimators=100, max_depth=7, min_samples_leaf=1, random_state=5)
-clf.fit(X_train[[f1, f2]], y_train)
+forest_2d = RandomForestClassifier(n_estimators=100, max_depth=7, min_samples_leaf=1, random_state=5)
+forest_2d.fit(X_train[[f1, f2]], y_train)
 
-d = DecisionBoundaryDisplay.from_estimator(clf, X_train[[f1, f2]])
+d = DecisionBoundaryDisplay.from_estimator(forest_2d, X_train[[f1, f2]])
 
 sns.scatterplot(X_train, x=f1, y=f2, hue=y_train, palette="husl")
 plt.show()
@@ -185,12 +208,12 @@ plt.show()
 
 There is still some overfitting indicated by the regions that contain only single points but using the same hyper-parameter settings used to fit the decision tree classifier, we can see that overfitting is reduced. 
 
-## Stacking up some regression problems 
+## Stacking a regression problem
 
 We've had a look at a bagging approach but we'll now take a look at a stacking approach and apply it to a regression problem. We'll also introduce a new dataset to play around with. 
 
 ### The diabetes dataset 
-The diabetes dataset, contains 10 baseline variables for 442 diabetes patients where the target attribute is quantitative measure of disease progression one year after baseline. For more information see [Efron et al., (2004)](https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf). The useful thing about this data it is available as part of the sci-kit learn library. We'll start by loading the dataset to explore the attributes.
+The diabetes dataset, contains 10 baseline variables for 442 diabetes patients where the target attribute is quantitative measure of disease progression one year after baseline. For more information see [Efron et al., (2004)](https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf). The useful thing about this data it is available as part of the [sci-kit learn library](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset). We'll start by loading the dataset to very briefly inspect the attributes by printing them out.
 
 ~~~
 from sklearn.datasets import load_diabetes
@@ -199,25 +222,33 @@ print(load_diabetes())
 ~~~
 {: .language-python}
 
-By printing the dataset we can have a look at the attributes that make up the features of the dataset. 
+For more details on this SKLearn dataset see [this link for details.](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)
 
-> ## Exercise: A different question
-> How does this differ from a classification problem?
+For the the purposes of learning how to create and use ensemble methods we are about to commit a cardinal sin of machine learning and blindly use this dataset without inspecting it any further.
+
+> ## Exercise: Investigate and visualise the dataset
+> For this episode we simply want to learn how to build and use an Ensemble rather than actually solve a regression problem. To build u your skills as an ML practitioner, investigate and visualise this dataset. What can you say about the dataset itself, and what can you summarise about about any potential relationships or prediction problems?
 {: .challenge}
 
 Lets start by splitting the dataset into training and testing subsets:
 
 ~~~
+from sklearn.model_selection import train_test_split
+
+# load in data
 X, y = load_diabetes(return_X_y=True)
 
+# split into train and test sets
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)
 
 print(f'train size: {X_train.shape}')
 print(f'test size: {X_test.shape}')
 ~~~
 {: .language-python}
 
-Now we'll train a single estimator: a random forest regressor estimator.
+Lets stack a series of regression models. In the same way the RandomForest classifier derives a results from a series of trees, we will combine the results from a series of different models in our stack. This is done using what's called an ensemble meta-estimator called a VotingRegressor. 
+
+We'll apply a Voting regressor to a random forest, gradient boosting and linear regressor.
 
 > ## But wait, aren't random forests/decision tree for classification problems?
 > Yes they are, but quite often in machine learning various models can be used to solve both regression and classification problems. 
@@ -227,43 +258,8 @@ Now we'll train a single estimator: a random forest regressor estimator.
 > This works well for periodic/repeating numerical data. These trees are extremely sensitive to the data they are trained on, which makes them a very good model to use as a Random Forest. 
 {: .callout}
 
-~~~
-# define classifier  
-from sklearn.ensemble import RandomForestRegressor
-
-clf = RandomForestRegressor()
-
-clf.fit(X_train, y_train)
-
-pred = clf.predict(X_test)
-
-clf.score(X_test, y_test) 
-~~~
-{: .language-python}
-
-Lets plot the first 20 predictions to see what that looks like. 
-
-~~~
-plt.figure()
-
-plt.plot(pred[:20],  "o", color="navy", label="RandomForestRegressor") # plot first 20 predictions to make figure clearer.
-
-plt.tick_params(axis="x", which="both", bottom=False, top=False, labelbottom=False)
-plt.ylabel("predicted")
-plt.xlabel("training samples")
-plt.legend(loc="best")
-plt.title("Regressor predictions")
-~~~
-{: .language-python}
-
-![random forest reg pred](../fig/EM_rf_reg_prediction.png)
-
-Lets take this step further and stack a series of regression estimators. In the same way the RandomForest classifier derives a results from a series of trees we will combine the results from a series of different estimators in our stack. This is done using whats called an ensemble meta-estimator called VotingRegressor. 
-
-We'll apply a Voting regressor to a random forest, gradient boosting and linear regressor.
-
 > ## But wait again, isn't a random forest (and a gradient boosting model) an ensemble method instead of a regression model?
-> Yes they are, but they can be thought of as one big complex model. The awesome thing about ensemble methods, and the generalisation of Scikit-Learn models, is that you can put an ensemble in an enseble!
+> Yes they are, but they can be thought of as one big complex model used like any other model. The awesome thing about ensemble methods, and the generalisation of Scikit-Learn models, is that you can put an ensemble in an ensemble!
 {: .callout}
 
 A VotingRegressor can train several base estimators on the whole dataset, and it can take the average of the individual predictions to form a final prediction.
@@ -294,6 +290,8 @@ voting_reg.fit(X_train, y_train)
 We fit the voting regressor in the same way we would fit a single model. When the voting regressor is instantiated we pass it a parameter containing a list of tuples that contain the estimators we wish to stack: in this case the random forest, gradient boosting and linear regressors. To get a sense of what this is doing lets predict the first 20 samples in the test portion of the data and plot the results. 
 
 ~~~
+import matplotlib.pyplot as plt
+
 # make predictions
 X_test_20 = X_test[:20] # first 20 for visualisation
 
@@ -334,10 +332,10 @@ print(f'voting regressor: {voting_reg.score(X_test, y_test)}')
 ~~~
 {: .language-python}
 
-The stacked result generated by the voting regressor produces a greater score than all three models/estimators singularly. The whole is greater than the some of the parts!
+Each of our models score a pretty poor 0.52-0.53, which is barely better than a coin flip. However what we can see is that the stacked result generated by the voting regressor produces a slightly improved score of 0.55, which is better than any of the three models/estimators taken individually. The whole model is greater than the sum of the individual parts. And of course, we could try and improve our accuracy score by tweaking with our indivdual model hyperparameters, or adjusting our training data features and train-test-split data.
 
 
-> ## Exercise: A different question
+> ## Exercise: Stacking a classification problem.
 > Sci-kit learn also has method for stacking ensemble classifiers ```sklearn.ensemble.VotingClassifier``` do you think you could apply a stack to the penguins dataset using a random forest, SVM and decision tree classifier, or a selection of any other classifier estimators available in sci-kit learn? 
 > 
 > ~~~