From 4a67b26c818a4ad7d9a0bde9127e3cba12af38e0 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 14:33:27 -0800 Subject: [PATCH] bugfixing index --- source/classification1.md | 12 +++++------ source/clustering.md | 2 +- source/inference.md | 20 +++++++++--------- source/intro.md | 14 ++++++------- source/reading.md | 6 +++--- source/regression1.md | 4 ++-- source/viz.md | 6 +++--- source/wrangling.md | 44 +++++++++++++++++++-------------------- 8 files changed, 54 insertions(+), 54 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index f756a004..d5179ad5 100755 --- a/source/classification1.md +++ b/source/classification1.md @@ -183,7 +183,7 @@ total set of variables per image in this data set is: +++ -```{index} pandas.DataFrame; info +```{index} DataFrame; info ``` Below we use the `info` method to preview the data frame. This method can @@ -195,7 +195,7 @@ as well as their data types and the number of non-missing entries. cancer.info() ``` -```{index} pandas.Series; unique +```{index} Series; unique ``` From the summary of the data above, we can see that `Class` is of type `object`. @@ -213,7 +213,7 @@ method. The `replace` method takes one argument: a dictionary that maps previous values to desired new values. We will verify the result using the `unique` method. -```{index} pandas.Series; replace +```{index} Series; replace ``` ```{code-cell} ipython3 @@ -227,7 +227,7 @@ cancer["Class"].unique() ### Exploring the cancer data -```{index} pandas.DataFrame; groupby, pandas.Series;size +```{index} DataFrame; groupby, Series;size ``` ```{code-cell} ipython3 @@ -256,7 +256,7 @@ tumor observations. 100 * cancer.groupby("Class").size() / cancer.shape[0] ``` -```{index} pandas.Series; value_counts +```{index} Series; value_counts ``` The `pandas` package also has a more convenient specialized `value_counts` method for @@ -1607,7 +1607,7 @@ Imbalanced data with background color indicating the decision of the classifier +++ -```{index} oversampling, pandas.DataFrame; sample +```{index} oversampling, DataFrame; sample ``` Despite the simplicity of the problem, solving it in a statistically sound manner is actually diff --git a/source/clustering.md b/source/clustering.md index 6d8d0b31..d5ad99be 100755 --- a/source/clustering.md +++ b/source/clustering.md @@ -308,7 +308,7 @@ have. clus = penguins_clustered[penguins_clustered["cluster"] == 0][["bill_length_standardized", "flipper_length_standardized"]] ``` -```{index} see: within-cluster sum-of-squared-distances; WSSD +```{index} see: within-cluster sum of squared distances; WSSD ``` ```{index} WSSD diff --git a/source/inference.md b/source/inference.md index 9188b98d..44136c9c 100755 --- a/source/inference.md +++ b/source/inference.md @@ -168,7 +168,7 @@ We can find the proportion of listings for each room type by using the `value_counts` function with the `normalize` parameter as we did in previous chapters. -```{index} pandas.DataFrame; [], pandas.DataFrame; value_counts +```{index} DataFrame; [], DataFrame; value_counts ``` ```{code-cell} ipython3 @@ -187,13 +187,13 @@ value, {glue:text}`population_proportion`, is the population parameter. Remember parameter value is usually unknown in real data analysis problems, as it is typically not possible to make measurements for an entire population. -```{index} pandas.DataFrame; sample, seed;numpy.random.seed +```{index} DataFrame; sample, seed;numpy.random.seed ``` Instead, perhaps we can approximate it with a small subset of data! To investigate this idea, let's try randomly selecting 40 listings (*i.e.,* taking a random sample of size 40 from our population), and computing the proportion for that sample. -We will use the `sample` method of the `pandas.DataFrame` +We will use the `sample` method of the `DataFrame` object to take the sample. The argument `n` of `sample` is the size of the sample to take and since we are starting to use randomness here, we are also setting the random seed via numpy to make the results reproducible. @@ -213,7 +213,7 @@ airbnb.sample(n=40)["room_type"].value_counts(normalize=True) glue("sample_1_proportion", "{:.3f}".format(airbnb.sample(n=40, random_state=155)["room_type"].value_counts(normalize=True)["Entire home/apt"])) ``` -```{index} pandas.DataFrame; value_counts +```{index} DataFrame; value_counts ``` Here we see that the proportion of entire home/apartment listings in this @@ -248,7 +248,7 @@ commonly refer to as $n$) from a population is called a **sampling distribution**. The sampling distribution will help us see how much we would expect our sample proportions from this population to vary for samples of size 40. -```{index} pandas.DataFrame; sample +```{index} DataFrame; sample ``` We again use the `sample` to take samples of size 40 from our @@ -284,7 +284,7 @@ to compute the number of qualified observations in each sample; finally compute Both the first and last few entries of the resulting data frame are printed below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples. -```{index} pandas.DataFrame;groupby, pandas.DataFrame;reset_index +```{index} DataFrame;groupby, DataFrame;reset_index ``` ```{code-cell} ipython3 @@ -479,7 +479,7 @@ The price per night of all Airbnb rentals in Vancouver, BC is \${glue:text}`population_mean`, on average. This value is our population parameter since we are calculating it using the population data. -```{index} pandas.DataFrame; sample +```{index} DataFrame; sample ``` Now suppose we did not have access to the population data (which is usually the @@ -987,7 +987,7 @@ mean of the sample is \${glue:text}`estimate_mean`. Remember, in practice, we usually only have this one sample from the population. So this sample and estimate are the only data we can work with. -```{index} bootstrap; in Python, pandas.DataFrame; sample (bootstrap) +```{index} bootstrap; in Python, DataFrame; sample (bootstrap) ``` We now perform steps 1–5 listed above to generate a single bootstrap @@ -1106,7 +1106,7 @@ generate a bootstrap distribution of these point estimates. The bootstrap distribution ({numref}`fig:11-bootstrapping5`) suggests how we might expect our point estimate to behave if we take multiple samples. -```{index} pandas.DataFrame;reset_index, pandas.DataFrame;rename, pandas.DataFrame;groupby, pandas.Series;mean +```{index} DataFrame;reset_index, DataFrame;rename, DataFrame;groupby, Series;mean ``` ```{code-cell} ipython3 @@ -1252,7 +1252,7 @@ Quantiles are expressed in proportions rather than percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively. -```{index} pandas.DataFrame; [], pandas.DataFrame;quantile +```{index} DataFrame; [], DataFrame;quantile ``` ```{index} percentile diff --git a/source/intro.md b/source/intro.md index 53645b31..606f3d27 100755 --- a/source/intro.md +++ b/source/intro.md @@ -437,13 +437,13 @@ can_lang ## Creating subsets of data frames with `[]` & `loc[]` -```{index} see: []; pandas.DataFrame +```{index} see: []; DataFrame ``` -```{index} see: loc[]; pandas.DataFrame +```{index} see: loc[]; DataFrame ``` -```{index} pandas.DataFrame; [], pandas.DataFrame; loc[], selecting columns +```{index} DataFrame; [], DataFrame; loc[], selecting columns ``` Now that we've loaded our data into Python, we can start wrangling the data to @@ -475,7 +475,7 @@ high-level categories of languages, which include "Aboriginal languages", our question we want to filter our data set so we restrict our attention to only those languages in the "Aboriginal languages" category. -```{index} pandas.DataFrame; [], filtering rows, logical statement, logical operator; equivalency (==), string +```{index} DataFrame; [], filtering rows, logical statement, logical operator; equivalency (==), string ``` We can use the `[]` operation to obtain the subset of rows with desired values @@ -521,7 +521,7 @@ can_lang[can_lang["category"] == "Aboriginal languages"] ### Using `[]` to select columns -```{index} pandas.DataFrame; [], selecting columns +```{index} DataFrame; [], selecting columns ``` We can also use the `[]` operation to select columns from a data frame. @@ -551,7 +551,7 @@ can_lang[["language", "mother_tongue"]] ### Using `loc[]` to filter rows and select columns -```{index} pandas.DataFrame; loc[], selecting columns +```{index} DataFrame; loc[], selecting columns ``` The `[]` operation is only used when you want to filter rows *or* select columns; @@ -612,7 +612,7 @@ So it looks like the `loc[]` operation gave us the result we wanted! ## Using `sort_values` and `head` to select rows by ordered values -```{index} pandas.DataFrame; sort_values, pandas.DataFrame; head +```{index} DataFrame; sort_values, DataFrame; head ``` We have used the `[]` and `loc[]` operations on a data frame to obtain a table diff --git a/source/reading.md b/source/reading.md index 36f2dbee..327ede0a 100755 --- a/source/reading.md +++ b/source/reading.md @@ -407,7 +407,7 @@ canlang_data = pd.read_csv( canlang_data ``` -```{index} pandas.DataFrame; rename, pandas +```{index} DataFrame; rename, pandas ``` It is best to rename your columns manually in this scenario. The current column names @@ -790,7 +790,7 @@ that we need for analysis; we do eventually need to call `execute`. For example, `ibis` does not provide the `tail` function to look at the last rows in a database, even though `pandas` does. -```{index} pandas.DataFrame; tail +```{index} DataFrame; tail ``` ```{code-cell} ipython3 @@ -951,7 +951,7 @@ Databases are beneficial in a large-scale setting: ## Writing data from Python to a `.csv` file -```{index} write function; to_csv, pandas.DataFrame; to_csv +```{index} write function; to_csv, DataFrame; to_csv ``` At the middle and end of a data analysis, we often want to write a data frame diff --git a/source/regression1.md b/source/regression1.md index 33859a25..b7cd28f0 100755 --- a/source/regression1.md +++ b/source/regression1.md @@ -233,7 +233,7 @@ how well it predicts house sale price. This subsample is taken to allow us to illustrate the mechanics of K-NN regression with a few data points; later in this chapter we will use all the data. -```{index} pandas.DataFrame; sample +```{index} DataFrame; sample ``` To take a small random sample of size 30, we'll use the @@ -287,7 +287,7 @@ Scatter plot of price (USD) versus house size (square feet) with vertical line i +++ -```{index} pandas.DataFrame; abs, pandas.DataFrame; nsmallest +```{index} DataFrame; abs, DataFrame; nsmallest ``` We will employ the same intuition from {numref}`Chapters %s ` and {numref}`%s `, and use the diff --git a/source/viz.md b/source/viz.md index 56c52179..e500fbb7 100755 --- a/source/viz.md +++ b/source/viz.md @@ -718,7 +718,7 @@ in the magnitude of these two numbers! We can confirm that the two points in the upper right-hand corner correspond to Canada's two official languages by filtering the data: -```{index} pandas.DataFrame; loc[] +```{index} DataFrame; loc[] ``` ```{code-cell} ipython3 @@ -848,7 +848,7 @@ using `_` so that it is easier to read; this does not affect how Python interprets the number and is just added for readability. -```{index} pandas.DataFrame; column assignment, pandas.DataFrame; [] +```{index} DataFrame; column assignment, DataFrame; [] ``` ```{code-cell} ipython3 @@ -1228,7 +1228,7 @@ as `sort_values` followed by `head`, but are slightly more efficient because the In general, it is good to use more specialized functions when they are available! ``` -```{index} pandas.DataFrame; nlargest, pandas.DataFrame; nsmallest +```{index} DataFrame; nlargest, DataFrame; nsmallest ``` ```{code-cell} ipython3 diff --git a/source/wrangling.md b/source/wrangling.md index 30f2f2de..4cd6d36e 100755 --- a/source/wrangling.md +++ b/source/wrangling.md @@ -72,10 +72,10 @@ This knowledge will be helpful in effectively utilizing these objects in our dat ```{index} data frame; definition ``` -```{index} see: data frame; pandas.DataFrame +```{index} see: data frame; DataFrame ``` -```{index} pandas.DataFrame +```{index} DataFrame ``` A data frame is a table-like structure for storing data in Python. Data frames are @@ -112,7 +112,7 @@ A data frame storing data regarding the population of various regions in Canada. ### What is a series? -```{index} pandas.Series +```{index} Series ``` In Python, `pandas` **series** are are objects that can contain one or more elements (like a list). @@ -375,7 +375,7 @@ represented as individual columns to make the data tidy. ### Tidying up: going from wide to long using `melt` -```{index} pandas.DataFrame; melt +```{index} DataFrame; melt ``` One task that is commonly performed to get data into a tidy format @@ -545,7 +545,7 @@ been met: (pivot-wider)= ### Tidying up: going from long to wide using `pivot` -```{index} pandas.DataFrame; pivot +```{index} DataFrame; pivot ``` Suppose we have observations spread across multiple rows rather than in a single @@ -651,7 +651,7 @@ lang_home_tidy.columns = [ lang_home_tidy ``` -```{index} pandas.DataFrame; reset_index +```{index} DataFrame; reset_index ``` In the first step, note that we added a call to `reset_index`. When `pivot` is called with @@ -665,7 +665,7 @@ The second operation we applied is to rename the columns. When we perform the `p operation, it keeps the original column name `"count"` and adds the `"type"` as a second column name. Having two names for a column can be confusing! So we rename giving each column only one name. -```{index} pandas.DataFrame; info +```{index} DataFrame; info ``` We can print out some useful information about our data frame using the `info` function. @@ -702,7 +702,7 @@ more columns, and we would see the data set "widen." (str-split)= ### Tidying up: using `str.split` to deal with multiple separators -```{index} pandas.Series; str.split, separator +```{index} Series; str.split, separator ``` ```{index} see: delimiter; separator @@ -834,7 +834,7 @@ This section will highlight more advanced usage of the `[]` function, including an in-depth treatment of the variety of logical statements one can use in the `[]` to select subsets of rows. -```{index} pandas.DataFrame; [], logical statement +```{index} DataFrame; [], logical statement ``` ```{index} see: logical statement; logical operator @@ -1093,7 +1093,7 @@ to make long chains of filtering operations a bit easier to read. (loc-iloc)= ## Using `loc[]` to filter rows and select columns -```{index} pandas.DataFrame; loc[] +```{index} DataFrame; loc[] ``` The `[]` operation is only used when you want to either filter rows **or** select columns; @@ -1172,7 +1172,7 @@ corresponding to the column names that start with the desired characters. tidy_lang.loc[:, tidy_lang.columns.str.startswith("most")] ``` -```{index} pandas.Series; str.contains +```{index} Series; str.contains ``` We could also have chosen the columns containing an underscore `_` by using the @@ -1184,7 +1184,7 @@ tidy_lang.loc[:, tidy_lang.columns.str.contains("_")] ``` ## Using `iloc[]` to extract rows and columns by position -```{index} pandas.DataFrame; iloc[], column range +```{index} DataFrame; iloc[], column range ``` Another approach for selecting rows and columns is to use `iloc[]`, which provides the ability to index with the position rather than the label of the columns. @@ -1219,7 +1219,7 @@ accidentally put in the wrong integer index! If you did not correctly remember that the `language` column was index `1`, and used `2` instead, your code might end up having a bug that is quite hard to track down. -```{index} pandas.Series; str.startswith +```{index} Series; str.startswith ``` +++ {"tags": []} @@ -1264,7 +1264,7 @@ region_lang = pd.read_csv("data/region_lang.csv") region_lang ``` -```{index} pandas.Series; min, pandas.Series; max +```{index} Series; min, Series; max ``` We use `.min` to calculate the minimum @@ -1294,7 +1294,7 @@ total number of people in the survey, we could use the `sum` summary statistic m region_lang["most_at_home"].sum() ``` -```{index} pandas.Series; sum, pandas.Series; mean, pandas.Series; median, pandas.Series; std, summary statistic +```{index} Series; sum, Series; mean, Series; median, Series; std, summary statistic ``` Other handy summary statistics include the `mean`, `median` and `std` for @@ -1402,7 +1402,7 @@ region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"]) +++ -```{index} pandas.DataFrame; groupby +```{index} DataFrame; groupby ``` What happens if we want to know how languages vary by region? In this case, we need a new tool that lets us group rows by region. This can be achieved @@ -1507,7 +1507,7 @@ region_lang.groupby("region")[["most_at_home", "most_at_work", "lang_known"]].ma To see how many observations there are in each group, we can use `value_counts`. -```{index} pandas.DataFrame; value_counts +```{index} DataFrame; value_counts ``` ```{code-cell} ipython3 @@ -1552,12 +1552,12 @@ we can see that this would be the columns from `mother_tongue` to `lang_known`. region_lang ``` -```{index} pandas.DataFrame; apply, pandas.DataFrame; loc[] +```{index} DataFrame; apply, DataFrame; loc[] ``` We can simply call the `.astype` function to apply it across the desired range of columns. -```{index} pandas.DataFrame; astype, pandas.Series; astype +```{index} DataFrame; astype, Series; astype ``` ```{code-cell} ipython3 @@ -1609,7 +1609,7 @@ you can use the more general [`apply`](https://pandas.pydata.org/docs/reference/ ## Modifying and adding columns -```{index} pandas.DataFrame; [], column assignment, assign +```{index} DataFrame; [], column assignment, assign ``` When we compute summary statistics or apply functions, @@ -1763,7 +1763,7 @@ For the rest of the book, we will silence that warning to help with readability. pd.options.mode.chained_assignment = None ``` -```{index} pandas.DataFrame; merge +```{index} DataFrame; merge ``` ```{note} @@ -1800,7 +1800,7 @@ english_lang ## Using `merge` to combine data frames -```{index} pandas.DataFrame; merge +```{index} DataFrame; merge ``` Let's return to the situation right before we added the city populations