Skip to content

Commit

Permalink
Tweak ordering of classification
Browse files Browse the repository at this point in the history
  • Loading branch information
mike-ivs committed May 27, 2024
1 parent 8efe2db commit d9cbb0e
Showing 1 changed file with 20 additions and 16 deletions.
36 changes: 20 additions & 16 deletions _episodes/03-classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,25 @@ For this lesson we will limit our dataset to only numerical values such as bill_

The above table contains multiple categorical objects such as species. If we attempt to include the other categorical fields, island and sex, we might hinder classification performance due to the complexity of the data.

### Preprocessing our data

Lets do some pre-processing on our dataset and specify our `X` features and `Y` labels:

~~~
# Extract the data we need
feature_names = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
dataset.dropna(subset=feature_names, inplace=True)
class_names = dataset['species'].unique()
X = dataset[feature_names]
Y = dataset['species']
~~~
{: .language-python}

Having extracted our features `X` and labels `y`, we can now split the data using the `train_test_split` function.

## Training-testing split
When undertaking any machine learning project, it's important to be able to evaluate how well your model works.

Expand All @@ -58,22 +77,7 @@ In the previous regression episode we created the penguin training data by takin
> Sometimes your data is dependant on it's ordering, such as time-series data where past values influence future predictions. Creating train-test splits for this can be tricky at first glance, but fortunately there are existing techniques to tackle this (often called stratification): See [Scikit-Learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators) for more information.
{: .callout}

Lets do some pre-processing on our dataset and specify our `X` features and `Y` labels:

~~~
# Extract the data we need
feature_names = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
dataset.dropna(subset=feature_names, inplace=True)
class_names = dataset['species'].unique()
X = dataset[feature_names]
Y = dataset['species']
~~~
{: .language-python}

Having extracted our features `X` and labels `y`, we can now split the data using the `train_test_split` function. We specify the fraction of data to use as test data, and the function randomly shuffles our data prior to splitting:
We specify the fraction of data to use as test data, and the function randomly shuffles our data prior to splitting:

~~~
from sklearn.model_selection import train_test_split
Expand Down

0 comments on commit d9cbb0e

Please sign in to comment.