From 2a45dd7db7c666f647acd513b358db7e6f964498 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 12:53:47 -0800 Subject: [PATCH] Chapter 5 production polish (#78) * starting work on ch5+6; categorical type change; remove commented out R code * value counts, class name remap, replace in ch5 * remove warnings * polished ch5+6 up to euclidean dist * minor bugfix * minor bugfix * fixed worksheets link at end of chp * fix minor section heading wording in Ch1 * added nsmallest + note; better chaining for dist comps; removed comments; fixed colors (not working yet) * initial fit and predict polished; model spec -> model object * polishing preprocessing * balancing polished * pipelines * learning objs * mute warnings in ch5 * restore cls2 to main branch * Update classification1.md add output scroll for large table * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * remove random state specificaiton from ch5 * fixed fill plots * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * better intro of meshgrid * better warning filtering in chapters 4 and 5 * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * remove np.number and replace with just 'number' in dtype selection * Update source/classification1.md Co-authored-by: Joel Ostblom * Update source/classification1.md Co-authored-by: Joel Ostblom * properly gluing column names Co-authored-by: Lindsey Heagy Co-authored-by: Joel Ostblom --- source/classification1.md | 914 ++++++++++++++++---------------------- source/intro.md | 4 +- source/viz.md | 6 +- 3 files changed, 397 insertions(+), 527 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index db09a47a..0e2016b9 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -12,41 +12,21 @@ kernelspec: name: python3 --- -(classification)= -# Classification I: training & predicting - ```{code-cell} ipython3 :tags: [remove-cell] +import warnings +warnings.filterwarnings("ignore", category=DeprecationWarning) +warnings.filterwarnings("ignore", category=FutureWarning) -import random - -import altair as alt -from altair_saver import save +from myst_nb import glue import numpy as np -import pandas as pd -import sklearn -from sklearn.compose import make_column_transformer -from sklearn.neighbors import KNeighborsClassifier -from sklearn.pipeline import Pipeline, make_pipeline from sklearn.metrics.pairwise import euclidean_distances -from sklearn.preprocessing import StandardScaler -import plotly.express as px -import plotly.graph_objs as go -from plotly.offline import iplot, plot from IPython.display import HTML -from myst_nb import glue - -alt.data_transformers.disable_max_rows() - -# alt.renderers.enable('altair_saver', fmts=['vega-lite', 'png']) - -# # Handle large data sets by not embedding them in the notebook -# alt.data_transformers.enable('data_server') - -# # Save a PNG blob as a backup for when the Altair plots do not render -# alt.renderers.enable('mimetype') ``` +(classification)= +# Classification I: training & predicting + ## Overview In previous chapters, we focused solely on descriptive and exploratory data analysis questions. @@ -70,7 +50,8 @@ By the end of the chapter, readers will be able to do the following: - Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables. - Explain the $K$-nearest neighbor classification algorithm. - Perform $K$-nearest neighbor classification in Python using `scikit-learn`. -- Use `StandardScaler` to preprocess data to be centered, scaled, and balanced. +- Use `StandardScaler` and `make_column_transformer` to preprocess data to be centered and scaled. +- Use `resample` to preprocess data to be balanced. - Combine preprocessing and model training using `make_pipeline`. +++ @@ -83,7 +64,7 @@ By the end of the chapter, readers will be able to do the following: ```{index} see: feature ; predictor ``` -In many situations, we want to make `predictions` based on the current situation +In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor's past experience with patients; an email provider might want to tag a given @@ -155,10 +136,11 @@ guide patient treatment. Our first step is to load, wrangle, and explore the data using visualizations in order to better understand the data we are working with. We start by -loading the `pandas` package needed for our analysis. +loading the `pandas` and `altair` packages needed for our analysis. ```{code-cell} ipython3 import pandas as pd +import altair as alt ``` In this case, the file containing the breast cancer data set is a `.csv` @@ -169,6 +151,7 @@ arguments, and then inspect its contents: ``` ```{code-cell} ipython3 +:tags: ["output_scroll"] cancer = pd.read_csv("data/wdbc.csv") cancer ``` @@ -206,65 +189,49 @@ total set of variables per image in this data set is: ```{index} info ``` -Below we use `.info()` to preview the data frame. This method can -make it easier to inspect the data when we have a lot of columns, -as it prints the data such that the columns go down -the page (instead of across). +Below we use the `info` method to preview the data frame. This method can +make it easier to inspect the data when we have a lot of columns: +it prints only the column names down the page (instead of across), +as well as their data types and the number of non-missing entries. ```{code-cell} ipython3 cancer.info() ``` -From the summary of the data above, we can see that `Class` is of type object. - -+++ +```{index} unique +``` -Given that we only have two different values in our `Class` column (B for benign and M -for malignant), we only expect to get two names back. +From the summary of the data above, we can see that `Class` is of type `object`. +We can use the `unique` method on the `Class` column to see all unique values +present in that column. We see that there are two diagnoses: +benign, represented by `'B'`, and malignant, represented by `'M'`. ```{code-cell} ipython3 cancer['Class'].unique() ``` +We will also improve the readability of our analysis +by renaming `'M'` to `'Malignant'` and `'B'` to `'Benign'` using the `replace` +method. The `replace` method takes one argument: a dictionary that maps +previous values to desired new values. +Furthermore, since we will be working with `Class` as a categorical statistical variable, +it is a good idea to convert it to the `category` type using the `astype` method +on the `cancer` data frame. We will verify the result using the `info` +and `unique` methods again. + +```{index} replace +``` + ```{code-cell} ipython3 -:tags: [remove-cell] +cancer['Class'] = cancer['Class'].replace({ + 'M' : 'Malignant', + 'B' : 'Benign' +}).astype('category') +cancer.info() +``` -## The above was based on the following text and code in R textbook. ## -####################################################################### -# Below we use `glimpse` \index{glimpse} to preview the data frame. This function can -# make it easier to inspect the data when we have a lot of columns, -# as it prints the data such that the columns go down -# the page (instead of across). - -# ```{r 05-glimpse} -# glimpse(cancer) -# ``` - -# From the summary of the data above, we can see that `Class` is of type character -# (denoted by ``). Since we will be working with `Class` as a -# categorical statistical variable, we will convert it to a factor using the -# function `as_factor`. \index{factor!as\_factor} - -# ```{r 05-class} -# cancer <- cancer |> -# mutate(Class = as_factor(Class)) -# glimpse(cancer) -# ``` - -# Recall that factors have what are called "levels", which you can think of as categories. We -# can verify the levels of the `Class` column by using the `levels` \index{levels}\index{factor!levels} function. -# This function should return the name of each category in that column. Given -# that we only have two different values in our `Class` column (B for benign and M -# for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument; -# so we use the `pull` function to extract a single column (`Class`) and -# pass that into the `levels` function to see the categories -# in the `Class` column. - -# ```{r 05-levels} -# cancer |> -# pull(Class) |> -# levels() -# ``` +```{code-cell} ipython3 +cancer['Class'].unique() ``` ### Exploring the cancer data @@ -272,44 +239,63 @@ cancer['Class'].unique() ```{index} groupby, count ``` +```{code-cell} ipython3 +:tags: [remove-cell] +glue("benign_count", cancer['Class'].value_counts()['Benign']) +glue("benign_pct", int(np.round(100*cancer['Class'].value_counts(normalize=True)['Benign']))) +glue("malignant_count", cancer['Class'].value_counts()['Malignant']) +glue("malignant_pct", int(np.round(100*cancer['Class'].value_counts(normalize=True)['Malignant']))) +``` + Before we start doing any modeling, let's explore our data set. Below we use -the `.groupby()`, `.count()` methods to find the number and percentage -of benign and malignant tumor observations in our data set. When paired with `.groupby()`, `.count()` counts the number of observations in each `Class` group. -Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations. +the `groupby` and `count` methods to find the number and percentage +of benign and malignant tumor observations in our data set. When paired with +`groupby`, `count` counts the number of observations for each value of the `Class` +variable. Then we calculate the percentage in each group by dividing by the total +number of observations and multiplying by 100. We have +{glue:}`benign_count` ({glue:}`benign_pct`\%) benign and +{glue:}`malignant_count` ({glue:}`malignant_pct`\%) malignant +tumor observations. ```{code-cell} ipython3 -num_obs = len(cancer) explore_cancer = pd.DataFrame() explore_cancer['count'] = cancer.groupby('Class')['ID'].count() -explore_cancer['percentage'] = explore_cancer['count'] / num_obs * 100 +explore_cancer['percentage'] = 100 * explore_cancer['count']/len(cancer) explore_cancer ``` -```{index} visualization; scatter +```{index} value_counts ``` -Next, let's draw a scatter plot to visualize the relationship between the -perimeter and concavity variables. Rather than use `altair's` default palette, -we select our own colorblind-friendly colors—`"#efb13f"` -for light orange and `"#86bfef"` for light blue—and - pass them as the `scale` argument in the `color` argument. -We also make the category labels ("B" and "M") more readable by -changing them to "Benign" and "Malignant" using `.apply()` method on the dataframe. +The `pandas` package also has a more convenient specialized `value_counts` method for +counting the number of occurrences of each value in a column. If we pass no arguments +to the method, it outputs a series containing the number of occurences +of each value. If we instead pass the argument `normalize=True`, it instead prints the fraction +of occurrences of each value. ```{code-cell} ipython3 -:tags: [] +cancer['Class'].value_counts() +``` -colors = ["#86bfef", "#efb13f"] -cancer["Class"] = cancer["Class"].apply( - lambda x: "Malignant" if (x == "M") else "Benign" -) +```{code-cell} ipython3 +cancer['Class'].value_counts(normalize=True) +``` + +```{index} visualization; scatter +``` + +Next, let's draw a colored scatter plot to visualize the relationship between the +perimeter and concavity variables. Recall that `altair's` default palette +is colorblind-friendly, so we can stick with that here. + +```{code-cell} ipython3 perim_concav = ( alt.Chart(cancer) - .mark_point(opacity=0.6, filled=True, size=40) + .mark_circle() .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) perim_concav @@ -338,7 +324,8 @@ you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like -the *prediction of an unobserved label* might be possible. +it may be possible to make accurate predictions of the `Class` variable (i.e., a diagnosis) for +tumor images with unknown diagnoses. +++ @@ -348,6 +335,8 @@ the *prediction of an unobserved label* might be possible. :tags: [remove-cell] new_point = [2, 4] +glue("new_point_1_0", new_point[0]) +glue("new_point_1_1", new_point[1]) attrs = ["Perimeter", "Concavity"] points_df = pd.DataFrame( {"Perimeter": new_point[0], "Concavity": new_point[1], "Class": ["Unknown"]} @@ -358,8 +347,6 @@ perim_concav_with_new_point_df = pd.concat((cancer, points_df), ignore_index=Tru my_distances = euclidean_distances(perim_concav_with_new_point_df.loc[:, attrs])[ len(cancer) ][:-1] -glue("1-new_point_0", new_point[0]) -glue("1-new_point_1", new_point[1]) ``` ```{index} K-nearest neighbors; classification @@ -377,8 +364,11 @@ $K$ for us. We will cover how to choose $K$ ourselves in the next chapter. To illustrate the concept of $K$-nearest neighbors classification, we will walk through an example. Suppose we have a -new observation, with standardized perimeter of {glue:}`1-new_point_0` and standardized concavity of {glue:}`1-new_point_1`, whose -diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in {numref}`fig:05-knn-2`. +new observation, with standardized perimeter +of {glue:}`new_point_1_0` and standardized concavity +of {glue:}`new_point_1_1`, whose +diagnosis "Class" is unknown. This new observation is +depicted by the red, diamond point in {numref}`fig:05-knn-2`. ```{code-cell} ipython3 :tags: [remove-cell] @@ -386,21 +376,17 @@ diagnosis "Class" is unknown. This new observation is depicted by the red, diamo perim_concav_with_new_point = ( alt.Chart( perim_concav_with_new_point_df, - # title="Scatter plot of concavity versus perimeter with new observation represented as a red diamond.", ) - .mark_point(opacity=0.6, filled=True, size=40) + .mark_circle() .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color( - "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), - title="Diagnosis", - ), + color=alt.Color("Class", title="Diagnosis"), shape=alt.Shape( "Class", scale=alt.Scale(range=["circle", "circle", "diamond"]) ), - size=alt.condition("datum.Class == 'Unknown'", alt.value(80), alt.value(30)) + size=alt.condition("datum.Class == 'Unknown'", alt.value(100), alt.value(30)), + stroke=alt.condition("datum.Class == 'Unknown'", alt.value('black'), alt.value(None)), ) ) glue('fig:05-knn-2', perim_concav_with_new_point, display=True) @@ -426,10 +412,11 @@ glue("1-neighbor_per", round(near_neighbor_df.iloc[0, :]['Perimeter'], 1)) glue("1-neighbor_con", round(near_neighbor_df.iloc[0, :]['Concavity'], 1)) ``` -{numref}`fig:05-knn-3` shows that the nearest point to this new observation is **malignant** and -located at the coordinates ({glue:}`1-neighbor_per`, {glue:}`1-neighbor_con`). The idea here is that if a point is close to another in the scatter plot, -then the perimeter and concavity values are similar, and so we may expect that -they would have the same diagnosis. +{numref}`fig:05-knn-3` shows that the nearest point to this new observation is +**malignant** and located at the coordinates ({glue:}`1-neighbor_per`, +{glue:}`1-neighbor_con`). The idea here is that if a point is close to another +in the scatter plot, then the perimeter and concavity values are similar, +and so we may expect that they would have the same diagnosis. ```{code-cell} ipython3 :tags: [remove-cell] @@ -446,7 +433,9 @@ glue('fig:05-knn-3', (perim_concav_with_new_point + line), display=True) :::{glue:figure} fig:05-knn-3 :name: fig:05-knn-3 -Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label. +Scatter plot of concavity versus perimeter. The new observation is represented +as a red diamond with a line to the one nearest neighbor, which has a malignant +label. ::: ```{code-cell} ipython3 @@ -463,8 +452,8 @@ perim_concav_with_new_point_df2 = pd.concat((cancer, points_df2), ignore_index=T my_distances2 = euclidean_distances(perim_concav_with_new_point_df2.loc[:, attrs])[ len(cancer) ][:-1] -glue("2-new_point_0", new_point[0]) -glue("2-new_point_1", new_point[1]) +glue("new_point_2_0", new_point[0]) +glue("new_point_2_1", new_point[1]) ``` ```{code-cell} ipython3 @@ -473,7 +462,6 @@ glue("2-new_point_1", new_point[1]) perim_concav_with_new_point2 = ( alt.Chart( perim_concav_with_new_point_df2, - # title="Scatter plot of concavity versus perimeter with new observation represented as a red diamond.", ) .mark_point(opacity=0.6, filled=True, size=40) .encode( @@ -481,7 +469,6 @@ perim_concav_with_new_point2 = ( y=alt.Y("Concavity", title="Concavity (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -509,9 +496,10 @@ glue("2-neighbor_con", round(near_neighbor_df2.iloc[0, :]['Concavity'], 1)) glue('fig:05-knn-4', (perim_concav_with_new_point2 + line2), display=True) ``` -Suppose we have another new observation with standardized perimeter {glue:}`2-new_point_0` and -concavity of {glue:}`2-new_point_1`. Looking at the scatter plot in {numref}`fig:05-knn-4`, how would you -classify this red, diamond observation? The nearest neighbor to this new point is a +Suppose we have another new observation with standardized perimeter +{glue:}`new_point_2_0` and concavity of {glue:}`new_point_2_1`. Looking at the +scatter plot in {numref}`fig:05-knn-4`, how would you classify this red, +diamond observation? The nearest neighbor to this new point is a **benign** observation at ({glue:}`2-neighbor_per`, {glue:}`2-neighbor_con`). Does this seem like the right prediction to make for this observation? Probably not, if you consider the other nearby points. @@ -521,7 +509,9 @@ not, if you consider the other nearby points. :::{glue:figure} fig:05-knn-4 :name: fig:05-knn-4 -Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label. +Scatter plot of concavity versus perimeter. The new observation is represented +as a red diamond with a line to the one nearest neighbor, which has a benign +label. ::: ```{code-cell} ipython3 @@ -591,13 +581,13 @@ next chapter. ```{index} distance; K-nearest neighbors, straight line; distance ``` -We decide which points are the $K$ "nearest" to our new observation -using the *straight-line distance* (we will often just refer to this as *distance*). -Suppose we have two observations $a$ and $b$, each having two predictor variables, $x$ and $y$. -Denote $a_x$ and $a_y$ to be the values of variables $x$ and $y$ for observation $a$; -$b_x$ and $b_y$ have similar definitions for observation $b$. -Then the straight-line distance between observation $a$ and $b$ on the x-y plane can -be computed using the following formula: +We decide which points are the $K$ "nearest" to our new observation using the +*straight-line distance* (we will often just refer to this as *distance*). +Suppose we have two observations $a$ and $b$, each having two predictor +variables, $x$ and $y$. Denote $a_x$ and $a_y$ to be the values of variables +$x$ and $y$ for observation $a$; $b_x$ and $b_y$ have similar definitions for +observation $b$. Then the straight-line distance between observation $a$ and +$b$ on the x-y plane can be computed using the following formula: $$\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$ @@ -613,6 +603,13 @@ the $K=5$ neighbors that are nearest to our new point. You will see in the code below, we compute the straight-line distance using the formula above: we square the differences between the two observations' perimeter and concavity coordinates, add the squared differences, and then take the square root. +In order to find the $K=5$ nearest neighbors, we will use the `nsmallest` function from `pandas`. + +> **Note:** Recall that in the {ref}`intro` chapter, we used `sort_values` followed by `head` to obtain +> the ten rows with the *largest* values of a variable. We could have instead used the `nlargest` function +> from `pandas` for this purpose. The `nsmallest` and `nlargest` functions achieve the same goal +> as `sort_values` followed by `head`, but are slightly more efficient because they are specialized for this purpose. +> In general, it is good to use more specialized functions when they are available! ```{code-cell} ipython3 :tags: [remove-cell] @@ -633,7 +630,6 @@ perim_concav_with_new_point3 = ( y=alt.Y("Concavity", title="Concavity (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -660,14 +656,15 @@ Scatter plot of concavity versus perimeter with new observation represented as a ```{code-cell} ipython3 new_obs_Perimeter = 0 new_obs_Concavity = 3.5 -cancer_dist = cancer.loc[:, ["ID", "Perimeter", "Concavity", "Class"]] -cancer_dist = cancer_dist.assign(dist_from_new = np.sqrt( - (cancer_dist["Perimeter"] - new_obs_Perimeter) ** 2 - + (cancer_dist["Concavity"] - new_obs_Concavity) ** 2 -)) -# sort the rows in ascending order and take the first 5 rows -cancer_dist = cancer_dist.sort_values(by="dist_from_new").head(5) -cancer_dist +( + cancer + [["Perimeter", "Concavity", "Class"]] + .assign(dist_from_new = ( + (cancer["Perimeter"] - new_obs_Perimeter) ** 2 + + (cancer["Concavity"] - new_obs_Concavity) ** 2 + )**(1/2)) + .nsmallest(5, "dist_from_new") +) ``` In {numref}`tab:05-multiknn-mathtable` we show in mathematical detail how @@ -675,36 +672,21 @@ we computed the `dist_from_new` variable (the distance to the new observation) for each of the 5 nearest neighbors in the training data. -```{code-cell} ipython3 -:tags: [remove-cell] - -## Couldn't find ways to have nice Latex equations in pandas dataframe - -# cancer_dist_eq = cancer_dist.copy() -# cancer_dist_eq['Perimeter'] = round(cancer_dist_eq['Perimeter'], 2) -# cancer_dist_eq['Concavity'] = round(cancer_dist_eq['Concavity'], 2) -# for i in list(cancer_dist_eq.index): -# cancer_dist_eq.loc[ -# i, "Distance" -# ] = f"[({new_obs_Perimeter} - {round(cancer_dist_eq.loc[i, 'Perimeter'], 2)})² + ({new_obs_Concavity} - {round(cancer_dist_eq.loc[i, 'Concavity'], 2)})²]¹/² = {round(cancer_dist_eq.loc[i, 'dist_from_new'], 2)}" -# cancer_dist_eq[["Perimeter", "Concavity", "Distance", "Class"]] -``` - ```{table} Evaluating the distances from the new observation to each of its 5 nearest neighbors :name: tab:05-multiknn-mathtable | Perimeter | Concavity | Distance | Class | |-----------|-----------|----------------------------------------|-------| -| 0.24 | 2.65 | $\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88$| B | -| 0.75 | 2.87 | $\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98$| M | -| 0.62 | 2.54 | $\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14$| M | -| 0.42 | 2.31 | $\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26$| M | -| -1.16 | 4.04 | $\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28$| B | +| 0.24 | 2.65 | $\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88$| Benign | +| 0.75 | 2.87 | $\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98$| Malignant | +| 0.62 | 2.54 | $\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14$| Malignant | +| 0.42 | 2.31 | $\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26$| Malignant | +| -1.16 | 4.04 | $\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28$| Benign | ``` +++ The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are -malignant (`M`); since this is the majority, we classify our new observation as malignant. +malignant; since this is the majority, we classify our new observation as malignant. These 5 neighbors are circled in {numref}`fig:05-multiknn-3`. ```{code-cell} ipython3 @@ -771,18 +753,20 @@ three predictors. new_obs_Perimeter = 0 new_obs_Concavity = 3.5 new_obs_Symmetry = 1 -cancer_dist2 = cancer.loc[:, ["ID", "Perimeter", "Concavity", "Symmetry", "Class"]] -cancer_dist2["dist_from_new"] = np.sqrt( - (cancer_dist2["Perimeter"] - new_obs_Perimeter) ** 2 - + (cancer_dist2["Concavity"] - new_obs_Concavity) ** 2 - + (cancer_dist2["Symmetry"] - new_obs_Symmetry) ** 2 +( + cancer + [["Perimeter", "Concavity", "Symmetry", "Class"]] + .assign(dist_from_new = ( + (cancer["Perimeter"] - new_obs_Perimeter) ** 2 + + (cancer["Concavity"] - new_obs_Concavity) ** 2 + + (cancer["Symmetry"] - new_obs_Symmetry) ** 2 + )**(1/2)) + .nsmallest(5, "dist_from_new") ) -# sort the rows in ascending order and take the first 5 rows -cancer_dist2 = cancer_dist2.sort_values(by="dist_from_new").head(5) -cancer_dist2 ``` -Based on $K=5$ nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. +Based on $K=5$ nearest neighbors with these three predictors we would classify +the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. {numref}`fig:05-more` shows what the data look like when we visualize them as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors. @@ -845,7 +829,7 @@ for i, d in enumerate(fig.data): fig.data[i].marker.symbol = symbols[fig.data[i].name] # specify trace names and colors in a dict -colors = {"Malignant": "#86bfef", "Benign": "#efb13f", "Unknown": "red"} +colors = {"Malignant": "#ff7f0e", "Benign": "#1f77b4", "Unknown": "red"} # set all colors in fig for i, d in enumerate(fig.data): @@ -874,7 +858,6 @@ for neighbor_df in neighbor_df_list: fig.update_layout(margin=dict(l=0, r=0, b=0, t=1), template="plotly_white") plot(fig, filename="img/classification1/fig05-more.html", auto_open=False) -# display(HTML("img/classification1/fig05-more.html")) ``` ```{code-cell} ipython3 @@ -887,7 +870,10 @@ display(HTML("img/classification1/fig05-more.html")) :name: fig:05-more :figclass: caption-hack -3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes. +3D scatter plot of the standardized symmetry, concavity, and perimeter +variables. Note that in general we recommend against using 3D visualizations; +here we show the data in 3D only to illustrate what higher dimensions and +nearest neighbors look like, for learning purposes. ``` +++ @@ -897,9 +883,8 @@ display(HTML("img/classification1/fig05-more.html")) In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following: 1. Compute the distance between the new observation and each observation in the training set. -2. Sort the data table in ascending order according to the distances. -3. Choose the top $K$ rows of the sorted table. -4. Classify the new observation based on a majority vote of the neighbor classes. +2. Find the $K$ rows corresponding to the $K$ smallest distances. +3. Classify the new observation based on a majority vote of the neighbor classes. +++ @@ -914,29 +899,18 @@ or predict the class for multiple new observations. Thankfully, in Python, the $K$-nearest neighbors algorithm is implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions -in the `scikit-learn` package will help keep our code simple, readable and accurate; the -less we have to code ourselves, the fewer mistakes we will likely make. We -start by importing `KNeighborsClassifier` from the `sklearn.neighbors` module. - +in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the +less we have to code ourselves, the fewer mistakes we will likely make. +Before getting started with $K$-nearest neighbors, we need to tell the `sklearn` package +that we prefer using `pandas` data frames over regular arrays via the `set_config` function. ```{code-cell} ipython3 -:tags: [remove-cell] - -## The above was based on: - -# Coding the $K$-nearest neighbors algorithm in R ourselves can get complicated, -# especially if we want to handle multiple classes, more than two variables, -# or predict the class for multiple new observations. Thankfully, in R, -# the $K$-nearest neighbors algorithm is -# implemented in [the `parsnip` R package](https://parsnip.tidymodels.org/) [@parsnip] -# included in `tidymodels`, along with -# many [other models](https://www.tidymodels.org/find/parsnip/) \index{tidymodels}\index{parsnip} -# that you will encounter in this and future chapters of the book. The `tidymodels` collection -# provides tools to help make and use models, such as classifiers. Using the packages -# in this collection will help keep our code simple, readable and accurate; the -# less we have to code ourselves, the fewer mistakes we will likely make. We -# start by loading `tidymodels`. +from sklearn import set_config +set_config(transform_output="pandas") ``` +We can now get started with $K$-nearest neighbors. The first step is to + import the `KNeighborsClassifier` from the `sklearn.neighbors` module. + ```{code-cell} ipython3 from sklearn.neighbors import KNeighborsClassifier ``` @@ -946,103 +920,63 @@ We will use the `cancer` data set from above, with perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then we will use the classifier to predict the diagnosis label for a new observation with perimeter 0, concavity 3.5, and an unknown diagnosis label. Let's pick out our two desired -predictor variables and class label and store them as a new data set named `cancer_train`: +predictor variables and class label and store them with the name `cancer_train`: ```{code-cell} ipython3 -cancer_train = cancer.loc[:, ['Class', 'Perimeter', 'Concavity']] +cancer_train = cancer[['Class', 'Perimeter', 'Concavity']] cancer_train ``` -```{index} scikit-learn; model instance, scikit-learn; KNeighborsClassifier +```{index} scikit-learn; model object, scikit-learn; KNeighborsClassifier ``` -Next, we create a *model specification* for $K$-nearest neighbors classification -by creating a `KNeighborsClassifier` instance, specifying that we want to use $K = 5$ neighbors -(we will discuss how to choose $K$ in the next chapter) and the straight-line -distance (`weights="uniform"`). The `weights` argument controls -how neighbors vote when classifying a new observation; by setting it to `"uniform"`, -each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, -which weigh each neighbor's vote differently, can be found on -[the `scikit-learn` website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier). - -```{code-cell} ipython3 -:tags: [remove-cell] +Next, we create a *model object* for $K$-nearest neighbors classification +by creating a `KNeighborsClassifier` instance, specifying that we want to use $K = 5$ neighbors; +we will discuss how to choose $K$ in the next chapter. -## The above was based on: +> **Note:** You can specify the `weights` argument in order to control +> how neighbors vote when classifying a new observation. The default is `"uniform"`, where +> each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, +> which weigh each neighbor's vote differently, can be found on +> [the `scikit-learn` website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier). -# Next, we create a *model specification* for \index{tidymodels!model specification} $K$-nearest neighbors classification -# by calling the `nearest_neighbor` function, specifying that we want to use $K = 5$ neighbors -# (we will discuss how to choose $K$ in the next chapter) and the straight-line -# distance (`weight_func = "rectangular"`). The `weight_func` argument controls -# how neighbors vote when classifying a new observation; by setting it to `"rectangular"`, -# each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, -# which weigh each neighbor's vote differently, can be found on -# [the `parsnip` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html). -# In the `set_engine` \index{tidymodels!engine} argument, we specify which package or system will be used for training -# the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification. -# Finally, we specify that this is a classification problem with the `set_mode` function. -``` ```{code-cell} ipython3 -knn_spec = KNeighborsClassifier(n_neighbors=5) -knn_spec +knn = KNeighborsClassifier(n_neighbors=5) +knn ``` ```{index} scikit-learn; X & y ``` -In order to fit the model on the breast cancer data, we need to call `fit` on the classifier object and pass the data in the argument. We also need to specify what variables to use as predictors and what variable to use as the target. Below, the `X=cancer_train[["Perimeter", "Concavity"]]` and the `y=cancer_train['Class']` argument specifies -that `Class` is the target variable (the one we want to predict), -and both `Perimeter` and `Concavity` are to be used as the predictors. - -```{code-cell} ipython3 -:tags: [remove-cell] - -## The above was based on: - -# In order to fit the model on the breast cancer data, we need to pass the model specification -# and the data set to the `fit` function. We also need to specify what variables to use as predictors -# and what variable to use as the target. Below, the `Class ~ Perimeter + Concavity` argument specifies -# that `Class` is the target variable (the one we want to predict), -# and both `Perimeter` and `Concavity` are to be used as the predictors. - - -# We can also use a convenient shorthand syntax using a period, `Class ~ .`, to indicate -# that we want to use every variable *except* `Class` \index{tidymodels!model formula} as a predictor in the model. -# In this particular setup, since `Concavity` and `Perimeter` are the only two predictors in the `cancer_train` -# data frame, `Class ~ Perimeter + Concavity` and `Class ~ .` are equivalent. -# In general, you can choose individual predictors using the `+` symbol, or you can specify to -# use *all* predictors using the `.` symbol. -``` - -```{code-cell} ipython3 -knn_spec.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]); -``` +In order to fit the model on the breast cancer data, we need to call `fit` on +the model object. The `X` argument is used to specify the data for the predictor +variables, while the `y` argument is used to specify the data for the response variable. +So below, we set `X=cancer_train[["Perimeter", "Concavity"]]` and +`y=cancer_train['Class']` to specify that `Class` is the target +variable (the one we want to predict), and both `Perimeter` and `Concavity` are +to be used as the predictors. Note that the `fit` function might look like it does not +do much from the outside, but it is actually doing all the heavy lifting to train +the K-nearest neighbors model, and modifies the `knn` model object. ```{code-cell} ipython3 -:tags: [remove-cell] - -# Here you can see the final trained model summary. It confirms that the computational engine used -# to train the model was `kknn::train.kknn`. It also shows the fraction of errors made by -# the nearest neighbor model, but we will ignore this for now and discuss it in more detail -# in the next chapter. -# Finally, it shows (somewhat confusingly) that the "best" weight function -# was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier, -# R is just repeating those settings to us here. In the next chapter, we will actually -# let R find the value of $K$ for us. +knn.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]); ``` ```{index} scikit-learn; predict ``` -Finally, we make the prediction on the new observation by calling `predict` on the classifier object, -passing the new observation itself. As above, -when we ran the $K$-nearest neighbors -classification algorithm manually, the `knn_fit` object classifies the new observation as "Malignant". Note that the `predict` function outputs a `numpy` array with the model's prediction. +After using the `fit` function, we can make a prediction on a new observation +by calling `predict` on the classifier object, passing the new observation +itself. As above, when we ran the $K$-nearest neighbors classification +algorithm manually, the `knn` model object classifies the new observation as +"Malignant". Note that the `predict` function outputs an `array` with the +model's prediction; you can actually make multiple predictions at the same +time using the `predict` function, which is why the output is stored as an `array`. ```{code-cell} ipython3 new_obs = pd.DataFrame({'Perimeter': [0], 'Concavity': [3.5]}) -knn_spec.predict(new_obs) +knn.predict(new_obs) ``` Is this predicted malignant label the true class for this observation? @@ -1101,19 +1035,23 @@ is said to be *standardized*, and all variables in a data set will have a mean o and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest neighbor algorithm, we will read in the original, unstandardized Wisconsin breast cancer data set; we have been using a standardized version of the data set up -until now. To keep things simple, we will just use the `Area`, `Smoothness`, and `Class` +until now. We will apply the same initial wrangling steps as we did earlier, +and to keep things simple we will just use the `Area`, `Smoothness`, and `Class` variables: ```{code-cell} ipython3 -unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv") -unscaled_cancer = unscaled_cancer[['Class', 'Area', 'Smoothness']] +unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv")[['Class', 'Area', 'Smoothness']] +unscaled_cancer['Class'] = unscaled_cancer['Class'].replace({ + 'M' : 'Malignant', + 'B' : 'Benign' +}).astype('category') +unscaled_cancer unscaled_cancer ``` Looking at the unscaled and uncentered data above, you can see that the differences between the values for area measurements are much larger than those for -smoothness. Will this affect -predictions? In order to find out, we will create a scatter plot of these two +smoothness. Will this affect predictions? In order to find out, we will create a scatter plot of these two predictors (colored by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data. But first, we need to standardize the `unscaled_cancer` data set with `scikit-learn`. @@ -1124,32 +1062,29 @@ standardize the `unscaled_cancer` data set with `scikit-learn`. ```{index} double: scikit-learn; pipeline ``` -In the `scikit-learn` framework, all data preprocessing and modeling can be built using a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), or a more convenient function [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) for simple pipeline construction. -Here we will initialize a preprocessor using `make_column_transformer` for -the `unscaled_cancer` data above, specifying -that we want to standardize the predictors `Area` and `Smoothness`: +The `scikit-learn` framework provides a collection of *preprocessors* used to manipulate +data in the [`preprocessing` module](https://scikit-learn.org/stable/modules/preprocessing.html). +Here we will use the `StandardScaler` transformer to standardize the predictor variables in +the `unscaled_cancer` data. In order to tell the `StandardScaler` which variables to standardize, +we wrap it in a +[`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) object +using the [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer) function. +`ColumnTransformer` objects also enable the use of multiple preprocessors at +once, which is especially handy when you want to apply different preprocessing to each of the predictor variables. +The primary argument of the `make_column_transformer` function is a sequence of +pairs of (1) a preprocessor, and (2) the columns to which you want to apply that preprocessor. +In the present case, we just have the one `StandardScaler` preprocessor to apply to the `Area` and `Smoothness` columns. ```{code-cell} ipython3 -:tags: [remove-cell] - -## The above was based on: - -# In the `tidymodels` framework, all data preprocessing happens -# using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes] -# Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for -# the `unscaled_cancer` data above, specifying -# that the `Class` variable is the target, and all other variables are predictors: -``` +from sklearn.preprocessing import StandardScaler +from sklearn.compose import make_column_transformer -```{code-cell} ipython3 preprocessor = make_column_transformer( (StandardScaler(), ["Area", "Smoothness"]), ) preprocessor ``` -So far, we have built a preprocessor so that each of the predictors have a mean of 0 and standard deviation of 1. - ```{index} scikit-learn; ColumnTransformer, scikit-learn; StandardScaler, scikit-learn; fit_transform ``` @@ -1159,68 +1094,75 @@ So far, we have built a preprocessor so that each of the predictors have a mean ```{index} scikit-learn; fit, scikit-learn; transform ``` -You can now see that the recipe includes a scaling and centering step for all predictor variables. -Note that when you add a step to a `ColumnTransformer`, you must specify what columns to apply the step to. -Here we specified that `StandardScaler` should be applied to -all predictor variables. - -```{index} see: fit, transform, fit_transform; scikit-learn -``` - -At this point, the data are not yet scaled and centered. To actually scale and center -the data, we need to call `fit` and `transform` on the unscaled data ( can be combined into `fit_transform`). +You can see that the preprocessor includes a single standardization step +that is applied to the `Area` and `Smoothness` columns. +Note that here we specified which columns to apply the preprocessing step to +by individual names; this approach can become quite difficult, e.g., when we have many +predictor variables. Rather than writing out the column names individually, +we can instead use the +[`make_column_selector`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) function. For +example, if we wanted to standardize all *numerical* predictors, +we would use `make_column_selector` and specify the `dtype_include` argument to be `'number'`. +This creates a preprocessor equivalent to the one we created previously. ```{code-cell} ipython3 -:tags: [remove-cell] +from sklearn.compose import make_column_selector -# So far, there is not much in the recipe; just a statement about the number of targets -# and predictors. Let's add -# scaling (`step_scale`) \index{recipe!step\_scale} and -# centering (`step_center`) \index{recipe!step\_center} steps for -# all of the predictors so that they each have a mean of 0 and standard deviation of 1. -# Note that `tidyverse` actually provides `step_normalize`, which does both centering and scaling in -# a single recipe step; in this book we will keep `step_scale` and `step_center` separate -# to emphasize conceptually that there are two steps happening. -# The `prep` function finalizes the recipe by using the data (here, `unscaled_cancer`) \index{tidymodels!prep}\index{prep|see{tidymodels}} -# to compute anything necessary to run the recipe (in this case, the column means and standard -# deviations): +preprocessor = make_column_transformer( + (StandardScaler(), make_column_selector(dtype_include='number')), +) +preprocessor ``` -```{code-cell} ipython3 -:tags: [remove-cell] - -# You can now see that the recipe includes a scaling and centering step for all predictor variables. -# Note that when you add a step to a recipe, you must specify what columns to apply the step to. -# Here we used the `all_predictors()` \index{recipe!all\_predictors} function to specify that each step should be applied to -# all predictor variables. However, there are a number of different arguments one could use here, -# as well as naming particular columns with the same syntax as the `select` function. -# For example: - -# - `all_nominal()` and `all_numeric()`: specify all categorical or all numeric variables -# - `all_predictors()` and `all_outcomes()`: specify all predictor or all target variables -# - `Area, Smoothness`: specify both the `Area` and `Smoothness` variable -# - `-Class`: specify everything except the `Class` variable - -# You can find a full set of all the steps and variable selection functions -# on the [`recipes` reference page](https://recipes.tidymodels.org/reference/index.html). - -# At this point, we have calculated the required statistics based on the data input into the -# recipe, but the data are not yet scaled and centered. To actually scale and center -# the data, we need to apply the `bake` \index{tidymodels!bake} \index{bake|see{tidymodels}} function to the unscaled data. +```{index} see: fit, transform, fit_transform; scikit-learn ``` +We are now ready to standardize the numerical predictor columns in the `unscaled_cancer` data frame. +This happens in two steps. We first use the `fit` function to compute the values necessary to apply +the standardization (the mean and standard deviation of each variable), passing the `unscaled_cancer` data as an argument. +Then we use the `transform` function to actually apply the standardization. +It may seem a bit unnecessary to use two steps---`fit` *and* `transform`---to standardize the data. +However, we do this in two steps so that we can specify a different data set in the `transform` step if we want. +This enables us to compute the quantities needed to standardize using one data set, and then +apply that standardization to another data set. + ```{code-cell} ipython3 preprocessor.fit(unscaled_cancer) scaled_cancer = preprocessor.transform(unscaled_cancer) -# scaled_cancer = preprocessor.fit_transform(unscaled_cancer) -scaled_cancer = pd.DataFrame(scaled_cancer, columns=['Area', 'Smoothness']) -scaled_cancer['Class'] = unscaled_cancer['Class'] scaled_cancer ``` +```{code-cell} ipython3 +:tags: [remove-cell] +glue('scaled-cancer-column-0', scaled_cancer.columns[0]) +glue('scaled-cancer-column-1', scaled_cancer.columns[1]) +``` +It looks like our `Smoothness` and `Area` variables have been standardized. Woohoo! +But there are two important things to notice about the new `scaled_cancer` data frame. First, it only keeps +the columns from the input to `transform` (here, `unscaled_cancer`) that had a preprocessing step applied +to them. The default behavior of the `ColumnTransformer` that we build using `make_column_transformer` +is to *drop* the remaining columns. This default behavior works well with the rest of `sklearn` (as we will see below +in the {ref}`08:puttingittogetherworkflow` section), but for visualizing the result of preprocessing it can be useful to keep the other columns +in our original data frame, such as the `Class` variable here. +To keep other columns, we need to set the `remainder` argument to `'passthrough'` in the `make_column_transformer` function. + Furthermore, you can see that the new column names---{glue:}`scaled-cancer-column-0` +and {glue:}`scaled-cancer-column-1`---include the name +of the preprocessing step separated by underscores. This default behavior is useful in `sklearn` because we sometimes want to apply +multiple different preprocessing steps to the same columns; but again, for visualization it can be useful to preserve +the original column names. To keep original column names, we need to set the `verbose_feature_names_out` argument to `False`. + +> **Note:** Only specify the `remainder` and `verbose_feature_names_out` arguments when you want to examine the result +> of your preprocessing step. In most cases, you should leave these arguments at their default values. -It may seem redundant that we had to both `fit` *and* `transform` to scale and center the data. - However, we do this in two steps so we can specify a different data set in the `transform` step if we want. - For example, we may want to specify new data that were not part of the training set. +```{code-cell} ipython3 +preprocessor_keep_all = make_column_transformer( + (StandardScaler(), make_column_selector(dtype_include='number')), + remainder='passthrough', + verbose_feature_names_out=False +) +preprocessor_keep_all.fit(unscaled_cancer) +scaled_cancer_all = preprocessor_keep_all.transform(unscaled_cancer) +scaled_cancer_all +``` You may wonder why we are doing so much work just to center and scale our variables. Can't we just manually scale and center the `Area` and @@ -1229,33 +1171,14 @@ technically *yes*; but doing so is error-prone. In particular, we might accidentally forget to apply the same centering / scaling when making predictions, or accidentally apply a *different* centering / scaling than what we used while training. Proper use of a `ColumnTransformer` helps keep our code simple, -readable, and error-free. Furthermore, note that using `fit` and `transform` on the preprocessor is required only when you want to inspect the result of the preprocessing steps -yourself. You will see further on in -Section {ref}`08:puttingittogetherworkflow` that `scikit-learn` provides tools to -automatically streamline the preprocesser and the model so that you can call`fit` +readable, and error-free. Furthermore, note that using `fit` and `transform` on +the preprocessor is required only when you want to inspect the result of the +preprocessing steps +yourself. You will see further on in the +{ref}`08:puttingittogetherworkflow` section that `scikit-learn` provides tools to +automatically streamline the preprocesser and the model so that you can call `fit` and `transform` on the `Pipeline` as necessary without additional coding effort. -```{code-cell} ipython3 -:tags: [remove-cell] - -# It may seem redundant that we had to both `bake` *and* `prep` to scale and center the data. -# However, we do this in two steps so we can specify a different data set in the `bake` step if we want. -# For example, we may want to specify new data that were not part of the training set. - -# You may wonder why we are doing so much work just to center and -# scale our variables. Can't we just manually scale and center the `Area` and -# `Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well, -# technically *yes*; but doing so is error-prone. In particular, we might -# accidentally forget to apply the same centering / scaling when making -# predictions, or accidentally apply a *different* centering / scaling than what -# we used while training. Proper use of a `recipe` helps keep our code simple, -# readable, and error-free. Furthermore, note that using `prep` and `bake` is -# required only when you want to inspect the result of the preprocessing steps -# yourself. You will see further on in Section -# \@ref(puttingittogetherworkflow) that `tidymodels` provides tools to -# automatically apply `prep` and `bake` as necessary without additional coding effort. -``` - {numref}`fig:05-scaling-plt` shows the two scatter plots side-by-side—one for `unscaled_cancer` and one for `scaled_cancer`. Each has the same new observation annotated with its $K=3$ nearest neighbors. In the original unstandardized data plot, you can see some odd choices @@ -1285,7 +1208,7 @@ def class_dscp(x): attrs = ["Area", "Smoothness"] -new_obs = pd.DataFrame({"Class": ["Unknwon"], "Area": 400, "Smoothness": 0.135}) +new_obs = pd.DataFrame({"Class": ["Unknown"], "Area": 400, "Smoothness": 0.135}) unscaled_cancer["Class"] = unscaled_cancer["Class"].apply(class_dscp) area_smoothness_new_df = pd.concat((unscaled_cancer, new_obs), ignore_index=True) my_distances = euclidean_distances(area_smoothness_new_df.loc[:, attrs])[ @@ -1302,7 +1225,6 @@ area_smoothness_new_point = ( y=alt.Y("Smoothness"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1359,13 +1281,13 @@ area_smoothness_new_point = area_smoothness_new_point + line1 + line2 + line3 :tags: [remove-cell] attrs = ["Area", "Smoothness"] -new_obs_scaled = pd.DataFrame({"Class": ["Unknwon"], "Area": -0.72, "Smoothness": 2.8}) -scaled_cancer["Class"] = scaled_cancer["Class"].apply(class_dscp) +new_obs_scaled = pd.DataFrame({"Class": ["Unknown"], "Area": -0.72, "Smoothness": 2.8}) +scaled_cancer_all["Class"] = scaled_cancer_all["Class"].apply(class_dscp) area_smoothness_new_df_scaled = pd.concat( - (scaled_cancer, new_obs_scaled), ignore_index=True + (scaled_cancer_all, new_obs_scaled), ignore_index=True ) my_distances_scaled = euclidean_distances(area_smoothness_new_df_scaled.loc[:, attrs])[ - len(scaled_cancer) + len(scaled_cancer_all) ][:-1] area_smoothness_new_point_scaled = ( alt.Chart( @@ -1378,7 +1300,6 @@ area_smoothness_new_point_scaled = ( y=alt.Y("Smoothness", title="Smoothness (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1390,21 +1311,21 @@ area_smoothness_new_point_scaled = ( min_3_idx_scaled = np.argpartition(my_distances_scaled, 3)[:3] neighbor1_scaled = pd.concat( ( - scaled_cancer.loc[min_3_idx_scaled[0], attrs], + scaled_cancer_all.loc[min_3_idx_scaled[0], attrs], new_obs_scaled[attrs].T, ), axis=1, ).T neighbor2_scaled = pd.concat( ( - scaled_cancer.loc[min_3_idx_scaled[1], attrs], + scaled_cancer_all.loc[min_3_idx_scaled[1], attrs], new_obs_scaled[attrs].T, ), axis=1, ).T neighbor3_scaled = pd.concat( ( - scaled_cancer.loc[min_3_idx_scaled[2], attrs], + scaled_cancer_all.loc[min_3_idx_scaled[2], attrs], new_obs_scaled[attrs].T, ), axis=1, @@ -1451,24 +1372,6 @@ Comparison of K = 3 nearest neighbors with standardized and unstandardized data. ```{code-cell} ipython3 :tags: [remove-cell] -# Could not find something mimicing `facet_zoom` in R, here are 2 plots trying to -# illustrate similar points -# 1. interactive plot which allows zooming in/out -glue('fig:05-scaling-plt-interactive', area_smoothness_new_point.interactive()) -``` - -+++ {"tags": ["remove-cell"]} - -:::{glue:figure} fig:05-scaling-plt-interactive -:name: fig:05-scaling-plt-interactive - -Close-up of three nearest neighbors for unstandardized data. -::: - -```{code-cell} ipython3 -:tags: [remove-cell] - -# 2. Static plot, Zoom-in zoom_area_smoothness_new_point = ( alt.Chart( area_smoothness_new_df, @@ -1480,7 +1383,6 @@ zoom_area_smoothness_new_point = ( y=alt.Y("Smoothness", scale=alt.Scale(domain=(0.08, 0.14))), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1523,39 +1425,25 @@ what the data would look like if the cancer was rare. We will do this by picking only 3 observations from the malignant group, and keeping all of the benign observations. We choose these 3 observations using the `.head()` method, which takes the number of rows to select from the top (`n`). -The new imbalanced data is shown in {numref}`fig:05-unbalanced`. +We use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) +function from `pandas` to glue the two resulting filtered +data frames back together by passing them together in a sequence. +The new imbalanced data is shown in {numref}`fig:05-unbalanced`, +and we print the counts of the classes using the `value_counts` function. ```{code-cell} ipython3 -:tags: [remove-cell] - -# To better illustrate the problem, let's revisit the scaled breast cancer data, -# `cancer`; except now we will remove many of the observations of malignant tumors, simulating -# what the data would look like if the cancer was rare. We will do this by -# picking only 3 observations from the malignant group, and keeping all -# of the benign observations. We choose these 3 observations using the `slice_head` -# function, which takes two arguments: a data frame-like object, -# and the number of rows to select from the top (`n`). -# The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced). -``` +rare_cancer = pd.concat(( + cancer[cancer["Class"] == 'Benign'], + cancer[cancer["Class"] == 'Malignant'].head(3) +)) -```{code-cell} ipython3 -cancer = pd.read_csv("data/wdbc.csv") -rare_cancer = pd.concat( - (cancer.query("Class == 'B'"), cancer.query("Class == 'M'").head(3)) -) -colors = ["#86bfef", "#efb13f"] -rare_cancer["Class"] = rare_cancer["Class"].apply( - lambda x: "Malignant" if (x == "M") else "Benign" -) rare_plot = ( - alt.Chart( - rare_cancer - ) - .mark_point(opacity=0.6, filled=True, size=40) + alt.Chart(rare_cancer) + .mark_circle() .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) rare_plot @@ -1568,6 +1456,10 @@ rare_plot Imbalanced data. ``` +```{code-cell} ipython3 +rare_cancer['Class'].value_counts() +``` + +++ Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification. @@ -1585,7 +1477,7 @@ in the training data that were tagged as malignant. attrs = ["Perimeter", "Concavity"] new_point = [2, 2] new_point_df = pd.DataFrame( - {"Class": ["Unknwon"], "Perimeter": new_point[0], "Concavity": new_point[1]} + {"Class": ["Unknown"], "Perimeter": new_point[0], "Concavity": new_point[1]} ) rare_cancer["Class"] = rare_cancer["Class"].apply(class_dscp) rare_cancer_with_new_df = pd.concat((rare_cancer, new_point_df), ignore_index=True) @@ -1604,7 +1496,6 @@ rare_plot = ( y=alt.Y("Concavity", title="Concavity (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1619,9 +1510,9 @@ min_7_idx = np.argpartition(my_distances, 7)[:7] # For loop: each iteration adds a line segment of corresponding color for i in range(7): - clr = "#86bfef" + clr = "#1f77b4" if rare_cancer.iloc[min_7_idx[i], :]["Class"] == "Malignant": - clr = "#efb13f" + clr = "#ff7f0e" neighbor = pd.concat( ( rare_cancer.iloc[min_7_idx[i], :][attrs], @@ -1654,21 +1545,24 @@ always "benign," corresponding to the blue color. ```{code-cell} ipython3 :tags: [remove-cell] -knn_spec = KNeighborsClassifier(n_neighbors=7) -knn_spec.fit(X=rare_cancer.loc[:, ["Perimeter", "Concavity"]], y=rare_cancer["Class"]) +knn = KNeighborsClassifier(n_neighbors=7) +knn.fit(X=rare_cancer.loc[:, ["Perimeter", "Concavity"]], y=rare_cancer["Class"]) # create a prediction pt grid per_grid = np.linspace( - rare_cancer["Perimeter"].min(), rare_cancer["Perimeter"].max(), 100 + rare_cancer["Perimeter"].min(), rare_cancer["Perimeter"].max(), 50 ) con_grid = np.linspace( - rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max(), 100 + rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max(), 50 ) pcgrid = np.array(np.meshgrid(per_grid, con_grid)).reshape(2, -1).T pcgrid = pd.DataFrame(pcgrid, columns=["Perimeter", "Concavity"]) -knnPredGrid = knn_spec.predict(pcgrid) +pcgrid + +knnPredGrid = knn.predict(pcgrid) prediction_table = pcgrid.copy() prediction_table["Class"] = knnPredGrid +prediction_table # create the scatter plot rare_plot = ( @@ -1679,7 +1573,7 @@ rare_plot = ( .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) @@ -1689,7 +1583,7 @@ prediction_plot = ( prediction_table, title="Imbalanced data", ) - .mark_point(opacity=0.02, filled=True, size=200) + .mark_point(opacity=0.05, filled=True, size=300) .encode( x=alt.X( "Perimeter", @@ -1705,10 +1599,10 @@ prediction_plot = ( domain=(rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max()) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) -rare_plot + prediction_plot +#rare_plot + prediction_plot glue("fig:05-upsample-2", (rare_plot + prediction_plot)) ``` @@ -1727,26 +1621,22 @@ Despite the simplicity of the problem, solving it in a statistically sound manne fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. In other words, we will replicate rare observations multiple times in our data set to give them more -voting power in the $K$-nearest neighbor algorithm. In order to do this, we will need an oversampling -step with the `resample` function from the `sklearn` Python package. -We show below how to do this, and also -use the `.groupby()` and `.count()` methods to see that our classes are now balanced: +voting power in the $K$-nearest neighbor algorithm. In order to do this, we will +first separate the classes out into their own data frames by filtering. +Then, we will +use the [`resample`](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) function +from the `sklearn` package to increase the number of `Malignant` observations to be the same as the number +of `Benign` observations. We set the `n_samples` argument to be the number of `Malignant` observations we want. +Finally, we use the `value_counts` method to see that our classes are now balanced. +Note that `resample` picks which data to replicate *randomly*; we will learn more about properly handling randomness +in data analysis in the {ref}`classification2` chapter. ```{code-cell} ipython3 :tags: [remove-cell] - -# Despite the simplicity of the problem, solving it in a statistically sound manner is actually -# fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. -# For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. \index{oversampling} -# In other words, we will replicate rare observations multiple times in our data set to give them more -# voting power in the $K$-nearest neighbor algorithm. In order to do this, we will add an oversampling -# step to the earlier `uc_recipe` recipe with the `step_upsample` function from the `themis` R package. \index{recipe!step\_upsample} -# We show below how to do this, and also -# use the `group_by` and `summarize` functions to see that our classes are now balanced: -``` - -```{code-cell} ipython3 -rare_cancer['Class'].value_counts() +# hidden seed call to make the below resample reproducible +# we haven't taught students about seeds / prngs yet, so +# for now just hide this. +np.random.seed(1) ``` ```{code-cell} ipython3 @@ -1755,10 +1645,10 @@ from sklearn.utils import resample malignant_cancer = rare_cancer[rare_cancer["Class"] == "Malignant"] benign_cancer = rare_cancer[rare_cancer["Class"] == "Benign"] malignant_cancer_upsample = resample( - malignant_cancer, n_samples=len(benign_cancer), random_state=100 + malignant_cancer, n_samples=len(benign_cancer) ) upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer)) -upsampled_cancer.groupby(by='Class')['Class'].count() +upsampled_cancer['Class'].value_counts() ``` Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data. @@ -1771,13 +1661,13 @@ closer to the benign tumor observations. ```{code-cell} ipython3 :tags: [remove-cell] -knn_spec = KNeighborsClassifier(n_neighbors=7) -knn_spec.fit( +knn = KNeighborsClassifier(n_neighbors=7) +knn.fit( X=upsampled_cancer.loc[:, ["Perimeter", "Concavity"]], y=upsampled_cancer["Class"] ) # create a prediction pt grid -knnPredGrid = knn_spec.predict(pcgrid) +knnPredGrid = knn.predict(pcgrid) prediction_table = pcgrid prediction_table["Class"] = knnPredGrid @@ -1800,21 +1690,21 @@ rare_plot = ( domain=(rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max()) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) # add a prediction layer, also scatter plot upsampled_plot = ( alt.Chart(prediction_table) - .mark_point(opacity=0.02, filled=True, size=200) + .mark_point(opacity=0.05, filled=True, size=300) .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) -rare_plot + upsampled_plot +#rare_plot + upsampled_plot glue("fig:05-upsample-plot", (rare_plot + upsampled_plot)) ``` @@ -1827,33 +1717,28 @@ Upsampled data with background color indicating the decision of the classifier. +++ (08:puttingittogetherworkflow)= -## Putting it together in a `pipeline` +## Putting it together in a `Pipeline` ```{index} scikit-learn; pipeline ``` -The `scikit-learn` package collection also provides the `pipeline`, a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. -To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data. -First we will load the data, create a model, and specify a preprocessor for how the data should be preprocessed: - -```{code-cell} ipython3 -:tags: [remove-cell] - -# The `tidymodels` package collection also provides the `workflow`, -# a way to chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} -# together multiple data analysis steps without a lot of otherwise necessary code -# for intermediate steps. -# To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data. -# First we will load the data, create a model, -# and specify a recipe for how the data should be preprocessed: -``` +The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), +a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. +To illustrate the whole workflow, let's start from scratch with the `unscaled_wdbc.csv` data. +First we will load the data, create a model, and specify a preprocessor for the data. ```{code-cell} ipython3 -# load the unscaled cancer data +# load the unscaled cancer data, make Class readable unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv") +unscaled_cancer['Class'] = unscaled_cancer['Class'].replace({ + 'M' : 'Malignant', + 'B' : 'Benign' +}).astype('category') +unscaled_cancer + # create the KNN model -knn_spec = KNeighborsClassifier(n_neighbors=7) +knn = KNeighborsClassifier(n_neighbors=7) # create the centering / scaling preprocessor preprocessor = make_column_transformer( @@ -1861,72 +1746,52 @@ preprocessor = make_column_transformer( ) ``` -You will also notice that we did not call `.fit()` on the preprocessor; this is unnecessary when it is -placed in a `Pipeline`. - ```{index} scikit-learn; make_pipeline, scikit-learn; fit ``` -We will now place these steps in a `Pipeline` using the `make_pipeline` function, -and finally we will call `.fit()` to run the whole `Pipeline` on the `unscaled_cancer` data. +Next we place these steps in a `Pipeline` using +the [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) function. +The `make_pipeline` function takes a list of steps to apply in your data analysis; in this +case, we just have the `preprocessor` and `knn` steps. +Finally, we call `fit` on the pipeline. +Notice that we do not need to separately call `fit` and `transform` on the `preprocessor`; the +pipeline handles doing this properly for us. +Also notice that when we call `fit` on the pipeline, we can pass +the whole `unscaled_cancer` data frame to the `X` argument, since the preprocessing +step drops all the variables except the two we listed: `Area` and `Smoothness`. +For the `y` response variable argument, we pass the `unscaled_cancer["Class"]` series as before. ```{code-cell} ipython3 -:tags: [remove-cell] +from sklearn.pipeline import make_pipeline -# Note that each of these steps is exactly the same as earlier, except for one major difference: -# we did not use the `select` function to extract the relevant variables from the data frame, -# and instead simply specified the relevant variables to use via the -# formula `Class ~ Area + Smoothness` (instead of `Class ~ .`) in the recipe. -# You will also notice that we did not call `prep()` on the recipe; this is unnecessary when it is -# placed in a workflow. - -# We will now place these steps in a `workflow` using the `add_recipe` and `add_model` functions, \index{tidymodels!add\_recipe}\index{tidymodels!add\_model} -# and finally we will use the `fit` function to run the whole workflow on the `unscaled_cancer` data. -# Note another difference from earlier here: we do not include a formula in the `fit` function. This \index{tidymodels!fit} -# is again because we included the formula in the recipe, so there is no need to respecify it: -``` - -```{code-cell} ipython3 -knn_fit = make_pipeline(preprocessor, knn_spec).fit( - X=unscaled_cancer.loc[:, ["Area", "Smoothness"]], y=unscaled_cancer["Class"] +knn_fit = make_pipeline(preprocessor, knn).fit( + X=unscaled_cancer, + y=unscaled_cancer["Class"] ) knn_fit ``` As before, the fit object lists the function that trains the model. But now the fit object also includes information about -the overall workflow, including the standardizing preprocessing step. +the overall workflow, including the standardization preprocessing step. In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new observation, it will first apply the same preprocessing steps to the new observation. As an example, we will predict the class label of two new observations: one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`. -```{code-cell} ipython3 -:tags: [remove-cell] - -# As before, the fit object lists the function that trains the model as well as the "best" settings -# for the number of neighbors and weight function (for now, these are just the values we chose -# manually when we created `knn_spec` above). But now the fit object also includes information about -# the overall workflow, including the centering and scaling preprocessing steps. -# In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new -# observation, it will first apply the same recipe steps to the new observation. -# As an example, we will predict the class label of two new observations: -# one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`. -``` - ```{code-cell} ipython3 new_observation = pd.DataFrame({"Area": [500, 1500], "Smoothness": [0.075, 0.1]}) prediction = knn_fit.predict(new_observation) prediction ``` -The classifier predicts that the first observation is benign ("B"), while the second is -malignant ("M"). {numref}`fig:05-workflow-plot-show` visualizes the predictions that this +The classifier predicts that the first observation is benign, while the second is +malignant. {numref}`fig:05-workflow-plot` visualizes the predictions that this trained $K$-nearest neighbor model will make on a large range of new observations. Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. For the interested reader who wants a learning challenge, we now include it below. -The basic idea is to create a grid of synthetic new observations using the `numpy.meshgrid` function, +The basic idea is to create a grid of synthetic new observations using the `meshgrid` function from `numpy`, predict the label of each, and visualize the predictions with a colored scatter having a very high transparency (low `opacity` value) and large point radius. See if you can figure out what each line is doing! @@ -1935,12 +1800,15 @@ predict the label of each, and visualize the predictions with a colored scatter > visualizations in their own data analyses. ```{code-cell} ipython3 +:tags: [remove-output] +import numpy as np + # create the grid of area/smoothness vals, and arrange in a data frame are_grid = np.linspace( - unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max(), 100 + unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max(), 50 ) smo_grid = np.linspace( - unscaled_cancer["Smoothness"].min(), unscaled_cancer["Smoothness"].max(), 100 + unscaled_cancer["Smoothness"].min(), unscaled_cancer["Smoothness"].max(), 50 ) asgrid = np.array(np.meshgrid(are_grid, smo_grid)).reshape(2, -1).T asgrid = pd.DataFrame(asgrid, columns=["Area", "Smoothness"]) @@ -1977,26 +1845,30 @@ unscaled_plot = ( ) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) # 2. the faded colored scatter for the grid points prediction_plot = ( alt.Chart(prediction_table) - .mark_point(opacity=0.02, filled=True, size=200) + .mark_point(opacity=0.05, filled=True, size=300) .encode( x=alt.X("Area"), y=alt.Y("Smoothness"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) - unscaled_plot + prediction_plot ``` +```{code-cell} ipython3 +:tags: [remove-input] +glue("fig:05-workflow-plot", (unscaled_plot + prediction_plot)) +``` + ```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 -:name: fig:05-workflow-plot-show +:name: fig:05-workflow-plot :figclass: caption-hack Scatter plot of smoothness versus area where background color indicates the decision of the classifier. @@ -2008,13 +1880,13 @@ Scatter plot of smoothness versus area where background color indicates the deci Practice exercises for the material covered in this chapter can be found in the accompanying -[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme) +[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme) in the "Classification I: training and predicting" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. You can also preview a non-interactive version of the worksheet by clicking "view worksheet." If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup -found in Chapter {ref}`move-to-your-own-machine`. This will ensure that the automated feedback +found in the {ref}`move-to-your-own-machine` chapter. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. +++ diff --git a/source/intro.md b/source/intro.md index bad9f768..a4f51df2 100644 --- a/source/intro.md +++ b/source/intro.md @@ -38,7 +38,7 @@ By the end of the chapter, readers will be able to do the following: - Read tabular data with `read_csv`. - Use `help()` to access help and documentation tools in Python. - Create new variables and objects in Python. -- Create and organize subsets of tabular data using `[]`, `loc[]`, and `sort_values`. +- Create and organize subsets of tabular data using `[]`, `loc[]`, `sort_values`, and `head`. - Chain multiple operations in sequence. - Visualize data with an `altair` bar plot. @@ -586,7 +586,7 @@ with multiple kinds of `category`. The data frame `aboriginal_lang` contains only 67 rows, and looks like it only contains Aboriginal languages. So it looks like the `loc[]` operation gave us the result we wanted! -### Using `sort_values` to order and `head` to select rows by value +### Using `sort_values` and `head` to select rows by ordered values ```{index} pandas.DataFrame; sort_values, pandas.DataFrame; head ``` diff --git a/source/viz.md b/source/viz.md index d5c0f49b..ff194776 100644 --- a/source/viz.md +++ b/source/viz.md @@ -16,11 +16,9 @@ kernelspec: :tags: [remove-cell] # ignore warnings from altair - import warnings -def warn(*args, **kwargs): - pass -warnings.warn = warn +warnings.filterwarnings("ignore", category=DeprecationWarning) +warnings.filterwarnings("ignore", category=FutureWarning) ```