Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphic design updates #565

Merged
merged 22 commits into from
Dec 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
ad3ebe4
empty commit
trevorcampbell Nov 17, 2023
e75f9d2
use default colors in inference
trevorcampbell Nov 20, 2023
c08434c
consistent font label clustering elbow
trevorcampbell Nov 20, 2023
b523442
consistent cluster centre style in clustering
trevorcampbell Nov 20, 2023
a066346
classification2 new graphics
trevorcampbell Nov 21, 2023
c737dff
added source files for cls2 new graphics
trevorcampbell Nov 21, 2023
6c81a90
orange2 -> darkorange; steelblue2 -> steelblue
trevorcampbell Nov 21, 2023
29ef8bf
landmass bar colors steelblue,darkorange
trevorcampbell Nov 21, 2023
072a8a7
steelblue/orange in cls2 predictor selection irrelevant plot
trevorcampbell Nov 21, 2023
71a0f67
dotted to dashed vert rule in reg1; thinner default dash in viz
trevorcampbell Nov 21, 2023
610ae81
improvements to consistency in visualizations in reg1,2
trevorcampbell Nov 22, 2023
3d4ec8b
Steelblue, darkorange consistency with prev chps
trevorcampbell Nov 22, 2023
a42ed8f
consistent style in inference
trevorcampbell Nov 22, 2023
7f7db50
centering all figs; new figs in wrangling (just names; not committing…
trevorcampbell Nov 22, 2023
d7ce717
change jpegs to pngs in intro chp
trevorcampbell Nov 25, 2023
a14b97c
graphic design: ch3
trevorcampbell Nov 25, 2023
be05f02
fix 1-6
trevorcampbell Nov 26, 2023
ec97921
version control graphics
trevorcampbell Dec 10, 2023
6e2b6ff
frontmatter figure chapter overview
trevorcampbell Dec 20, 2023
8e365c2
pop vs sample figure inference
trevorcampbell Dec 20, 2023
d4ac466
reading chp file tree fig update
trevorcampbell Dec 20, 2023
df6fd34
Added canada mapto ch1
trevorcampbell Dec 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,707 changes: 1,707 additions & 0 deletions img/classification2/ML-paradigm-test.ai

Large diffs are not rendered by default.

Binary file modified img/classification2/ML-paradigm-test.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3,059 changes: 3,059 additions & 0 deletions img/classification2/cv.ai

Large diffs are not rendered by default.

Binary file modified img/classification2/cv.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2,005 changes: 2,005 additions & 0 deletions img/classification2/train-test-overview.ai

Large diffs are not rendered by default.

Binary file removed img/classification2/train-test-overview.jpeg
Binary file not shown.
Binary file added img/classification2/train-test-overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,676 changes: 1,676 additions & 0 deletions img/classification2/training_test.ai

Large diffs are not rendered by default.

Binary file removed img/classification2/training_test.jpeg
Binary file not shown.
Binary file added img/classification2/training_test.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/frontmatter/chapter_overview.jpeg
Binary file not shown.
Binary file added img/frontmatter/chapter_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/inference/intro-bootstrap.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/inference/population_vs_sample.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/intro/arrange_function.jpeg
Binary file not shown.
Binary file added img/intro/arrange_function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/intro/canada_map.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/intro/filter_function.jpeg
Binary file not shown.
Binary file added img/intro/filter_function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/intro/ggplot_function.jpeg
Binary file not shown.
Binary file added img/intro/ggplot_function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/intro/read_csv_function.jpeg
Binary file not shown.
Binary file added img/intro/read_csv_function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/intro/select_function.jpeg
Binary file not shown.
Binary file added img/intro/select_function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/reading/filesystem.jpeg
Binary file not shown.
Binary file added img/reading/filesystem.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/version-control/vc-ba2-add.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/version-control/vc-ba3-commit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/version-control/vc1-no-changes.png
Binary file modified img/version-control/vc2-changes.png
Binary file modified img/version-control/vc5-push.png
Binary file modified img/version-control/vc6-remote-changes.png
Binary file modified img/version-control/vc7-pull.png
1,774 changes: 1,774 additions & 0 deletions img/wrangling/data_frame_slides_cdn.004.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/data_frame_slides_cdn.004.jpeg
Diff not rendered.
Binary file added img/wrangling/data_frame_slides_cdn.004.png
2,461 changes: 2,461 additions & 0 deletions img/wrangling/data_frame_slides_cdn.005.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/data_frame_slides_cdn.005.jpeg
Diff not rendered.
Binary file added img/wrangling/data_frame_slides_cdn.005.png
1,597 changes: 1,597 additions & 0 deletions img/wrangling/data_frame_slides_cdn.007.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/data_frame_slides_cdn.007.jpeg
Diff not rendered.
Binary file added img/wrangling/data_frame_slides_cdn.007.png
1,654 changes: 1,654 additions & 0 deletions img/wrangling/data_frame_slides_cdn.008.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/data_frame_slides_cdn.008.jpeg
Diff not rendered.
Binary file added img/wrangling/data_frame_slides_cdn.008.png
2,237 changes: 2,237 additions & 0 deletions img/wrangling/data_frame_slides_cdn.009.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/data_frame_slides_cdn.009.jpeg
Diff not rendered.
Binary file added img/wrangling/data_frame_slides_cdn.009.png
Binary file removed img/wrangling/mutate_function.jpeg
Diff not rendered.
Binary file added img/wrangling/mutate_function.png
2,455 changes: 2,455 additions & 0 deletions img/wrangling/pivot_functions.001.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/pivot_functions.001.jpeg
Diff not rendered.
Binary file added img/wrangling/pivot_functions.001.png
2,447 changes: 2,447 additions & 0 deletions img/wrangling/pivot_functions.002.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/pivot_functions.002.jpeg
Diff not rendered.
Binary file added img/wrangling/pivot_functions.002.png
2,327 changes: 2,327 additions & 0 deletions img/wrangling/pivot_functions.003.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/pivot_functions.003.jpeg
Diff not rendered.
Binary file added img/wrangling/pivot_functions.003.png
2,012 changes: 2,012 additions & 0 deletions img/wrangling/pivot_functions.004.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/pivot_functions.004.jpeg
Diff not rendered.
Binary file added img/wrangling/pivot_functions.004.png
Binary file removed img/wrangling/pivot_longer.jpeg
Diff not rendered.
Binary file added img/wrangling/pivot_longer.png
Binary file removed img/wrangling/pivot_wider.jpeg
Diff not rendered.
Binary file added img/wrangling/pivot_wider.png
Binary file removed img/wrangling/separate_function.jpeg
Diff not rendered.
Binary file added img/wrangling/separate_function.png
1,749 changes: 1,749 additions & 0 deletions img/wrangling/summarize.001.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/summarize.001.jpeg
Diff not rendered.
Binary file added img/wrangling/summarize.001.png
2,501 changes: 2,501 additions & 0 deletions img/wrangling/summarize.002.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/summarize.002.jpeg
Diff not rendered.
Binary file added img/wrangling/summarize.002.png
2,446 changes: 2,446 additions & 0 deletions img/wrangling/summarize.003.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/summarize.003.jpeg
Diff not rendered.
Binary file added img/wrangling/summarize.003.png
2,045 changes: 2,045 additions & 0 deletions img/wrangling/summarize.004.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/summarize.004.jpeg
Diff not rendered.
Binary file added img/wrangling/summarize.004.png
3,130 changes: 3,130 additions & 0 deletions img/wrangling/summarize.005.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/summarize.005.jpeg
Diff not rendered.
Binary file added img/wrangling/summarize.005.png
2,380 changes: 2,380 additions & 0 deletions img/wrangling/tidy_data.001.ai

Large diffs are not rendered by default.

Binary file removed img/wrangling/tidy_data.001.jpeg
Diff not rendered.
Binary file added img/wrangling/tidy_data.001.png
74 changes: 37 additions & 37 deletions source/classification1.Rmd

Large diffs are not rendered by default.

34 changes: 17 additions & 17 deletions source/classification2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,8 @@ labels for new observations without known class labels.
> is. Imagine how bad it would be to overestimate your classifier's accuracy
> when predicting whether a patient's tumor is malignant or benign!

```{r 06-training-test, echo = FALSE, warning = FALSE, fig.cap = "Splitting the data into training and testing sets.", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/classification2/training_test.jpeg")
```{r 06-training-test, echo = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Splitting the data into training and testing sets.", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/classification2/training_test.png")
```

How exactly can we assess how well our predictions match the actual labels for
Expand All @@ -108,7 +108,7 @@ test set is illustrated in Figure \@ref(fig:06-ML-paradigm-test).

$$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}}$$

```{r 06-ML-paradigm-test, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Process for splitting the data and finding the prediction accuracy.", fig.retina = 2, out.width = "100%"}
```{r 06-ML-paradigm-test, echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Process for splitting the data and finding the prediction accuracy.", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/classification2/ML-paradigm-test.png")
```

Expand Down Expand Up @@ -322,7 +322,7 @@ tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:
You will also notice that we set the random seed here at the beginning of the analysis
using the `set.seed` function, as described in Section \@ref(randomseeds).

```{r 06-precode, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
```{r 06-precode, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
# load packages
library(tidyverse)
library(tidymodels)
Expand All @@ -343,7 +343,7 @@ perim_concav <- cancer |>
ggplot(aes(x = Smoothness, y = Concavity, color = Class)) +
geom_point(alpha = 0.5) +
labs(color = "Diagnosis") +
scale_color_manual(values = c("orange2", "steelblue2")) +
scale_color_manual(values = c("darkorange", "steelblue")) +
theme(text = element_text(size = 12))

perim_concav
Expand Down Expand Up @@ -793,7 +793,7 @@ Here, $C=5$ different chunks of the data set are used,
resulting in 5 different choices for the **validation set**; we call this
*5-fold* cross-validation.

```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/classification2/cv.png")
```

Expand Down Expand Up @@ -989,7 +989,7 @@ accuracies
We can decide which number of neighbors is best by plotting the accuracy versus $K$,
as shown in Figure \@ref(fig:06-find-k).

```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.align = "center", fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
Expand Down Expand Up @@ -1049,7 +1049,7 @@ we vary $K$ from 1 to almost the number of observations in the training set.
set.seed(1)
```

```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.align = "center", fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10))

knn_results <- workflow() |>
Expand Down Expand Up @@ -1093,7 +1093,7 @@ new data: if we had a different training set, the predictions would be
completely different. In general, if the model *is influenced too much* by the
training data, it is said to **overfit** the data.

```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Effect of K in overfitting and underfitting."}
ks <- c(1, 7, 20, 300)
plots <- list()

Expand Down Expand Up @@ -1137,7 +1137,7 @@ for (i in 1:length(ks)) {
size = 5.) +
labs(color = "Diagnosis") +
ggtitle(paste("K = ", ks[[i]])) +
scale_color_manual(values = c("orange2", "steelblue2")) +
scale_color_manual(values = c("darkorange", "steelblue")) +
theme(text = element_text(size = 18), axis.title=element_text(size=18))
}

Expand Down Expand Up @@ -1256,8 +1256,8 @@ by maximizing estimated accuracy via cross-validation. After we have tuned the
model we can use the test set to estimate its accuracy.
The overall process is summarized in Figure \@ref(fig:06-overview).

```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of K-NN classification.", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/classification2/train-test-overview.jpeg")
```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Overview of K-NN classification.", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/classification2/train-test-overview.png")
```

The overall workflow for performing K-nearest neighbors classification using `tidymodels` is as follows:
Expand Down Expand Up @@ -1344,7 +1344,7 @@ variables there are, the more (random) influence they have, and the more they
corrupt the set of nearest neighbors that vote on the class of the new
observation to predict.

```{r 06-performance-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Effect of inclusion of irrelevant predictors."}
```{r 06-performance-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Effect of inclusion of irrelevant predictors."}
# get accuracies after including k irrelevant features
ks <- c(0, 5, 10, 15, 20, 40)
fixedaccs <- list()
Expand Down Expand Up @@ -1418,7 +1418,7 @@ variables, the number of neighbors does not increase smoothly; but the general t
Figure \@ref(fig:06-fixed-irrelevant-features) corroborates
this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls off more quickly.

```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
plt_irrelevant_nghbrs <- ggplot(res) +
geom_line(mapping = aes(x=ks, y=nghbrs)) +
labs(x = "Number of Irrelevant Predictors",
Expand All @@ -1428,15 +1428,15 @@ plt_irrelevant_nghbrs <- ggplot(res) +
plt_irrelevant_nghbrs
```

```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.align = "center", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"),
names_to="Type",
values_to="accuracy")

plt_irrelevant_nghbrs <- ggplot(res_tmp) +
geom_line(mapping = aes(x=ks, y=accuracy, color=Type)) +
labs(x = "Number of Irrelevant Predictors", y = "Accuracy") +
scale_color_discrete(labels= c("Tuned K", "K = 3")) +
scale_color_manual(labels= c("Tuned K", "K = 3"), values = c("darkorange", "steelblue")) +
theme(text = element_text(size = 17), axis.title=element_text(size=17))

plt_irrelevant_nghbrs
Expand Down Expand Up @@ -1657,7 +1657,7 @@ predictors from the model! It is always worth remembering, however, that what cr
is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.

```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}

fwd_sel_accuracies_plot <- accuracies |>
ggplot(aes(x = size, y = accuracy)) +
Expand Down
50 changes: 28 additions & 22 deletions source/clustering.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ hidden_print_cli <- function(x){
# set the colors in the graphs,
# some graphs with the code shown to students are hard coded
cbbPalette <- c(brewer.pal(9, "Paired"))
cbpalette <- c("darkorange3", "dodgerblue3", "goldenrod1")
cbpalette <- c("darkorange", "steelblue", "goldenrod1")

theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
```
Expand Down Expand Up @@ -143,7 +143,7 @@ Understanding this might help us with species discovery and classification in a
way. Note that we have reduced the size of the data set to 18 observations and 2 variables;
this will help us make clear visualizations that illustrate how clustering works for learning purposes.

```{r 09-penguins, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "A Gentoo penguin.", out.width="60%", fig.align = "center", fig.retina = 2}
```{r 09-penguins, echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "A Gentoo penguin.", out.width="60%", fig.align = "center", fig.retina = 2}
# image source: https://commons.wikimedia.org/wiki/File:Gentoo_Penguin._(8671680772).jpg
knitr::include_graphics("img/clustering/gentoo.jpg")
```
Expand Down Expand Up @@ -258,7 +258,7 @@ ggplot(penguins_clustered, aes(y = bill_length_standardized,
geom_point() +
xlab("Flipper Length (standardized)") +
ylab("Bill Length (standardized)") +
scale_color_manual(values= c("darkorange3", "dodgerblue3", "goldenrod1"))
scale_color_manual(values= c("darkorange", "steelblue", "goldenrod1"))
```

What are the labels for these groups? Unfortunately, we don't have any. K-means,
Expand Down Expand Up @@ -308,7 +308,7 @@ In the first cluster from the example, there are `r nrow(clus1)` data points. Th
(standardized flipper length `r round(mean(clus1$flipper_length_standardized),2)`, standardized bill length `r round(mean(clus1$bill_length_standardized),2)`) highlighted
in Figure \@ref(fig:10-toy-example-clus1-center).

(ref:10-toy-example-clus1-center) Cluster 1 from the `penguins_standardized` data set example. Observations are in blue, with the cluster center highlighted in red.
(ref:10-toy-example-clus1-center) Cluster 1 from the `penguins_standardized` data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline.

```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-center)"}
base <- ggplot(penguins_clustered, aes(x = flipper_length_standardized, y = bill_length_standardized)) +
Expand All @@ -318,7 +318,7 @@ base <- ggplot(penguins_clustered, aes(x = flipper_length_standardized, y = bill

base <- ggplot(clus1) +
geom_point(aes(y = bill_length_standardized, x = flipper_length_standardized),
col = "dodgerblue3") +
col = "steelblue") +
labs(x = "Flipper Length (standardized)", y = "Bill Length (standardized)") +
xlim(c(
min(clus1$flipper_length_standardized) - 0.25 *
Expand All @@ -334,8 +334,11 @@ base <- ggplot(clus1) +
)) +
geom_point(aes(y = mean(bill_length_standardized),
x = mean(flipper_length_standardized)),
color = "#F8766D",
size = 5) +
size = 4,
shape = 21,
stroke = 1,
color = "black",
fill = "steelblue")+
theme(legend.position = "none")

base
Expand All @@ -354,7 +357,7 @@ S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (

These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-dists) for the first cluster of the penguin data example.

(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguins_standardized` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.
(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguins_standardized` data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline. The distances from the observations to the cluster center are represented as black lines.

```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
base <- ggplot(clus1)
Expand All @@ -373,13 +376,16 @@ for (i in 1:nrow(clus1)) {
base <- base +
geom_point(aes(y = mean(bill_length_standardized),
x = mean(flipper_length_standardized)),
color = "#F8766D",
size = 5)
size = 4,
shape = 21,
stroke = 1,
color = "black",
fill = "steelblue")

base <- base +
geom_point(aes(y = bill_length_standardized,
x = flipper_length_standardized),
col = "dodgerblue3") +
col = "steelblue") +
labs(x = "Flipper Length (standardized)", y = "Bill Length (standardized)") +
theme(legend.position = "none")

Expand All @@ -397,7 +403,7 @@ this means adding up all the squared distances for the 18 observations.
These distances are denoted by black lines in
Figure \@ref(fig:10-toy-example-all-clus-dists).

(ref:10-toy-example-all-clus-dists) All clusters from the `penguins_standardized` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines.
(ref:10-toy-example-all-clus-dists) All clusters from the `penguins_standardized` data set example. Observations are small orange, blue, and yellow points with cluster centers denoted by larger points with a black outline. The distances from the observations to each of the respective cluster centers are represented as black lines.

```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.cap = "(ref:10-toy-example-all-clus-dists)"}
all_clusters_base <- ggplot(penguins_clustered)
Expand Down Expand Up @@ -434,20 +440,20 @@ all_clusters_base <- all_clusters_base +
color = cluster)) +
xlab("Flipper Length (standardized)") +
ylab("Bill Length (standardized)") +
scale_color_manual(values= c("darkorange3",
"dodgerblue3",
scale_color_manual(values= c("darkorange",
"steelblue",
"goldenrod1"))

all_clusters_base <- all_clusters_base +
geom_point(aes(y = cluster_centers$y[1],
x = cluster_centers$x[1]),
color = "#F8766D", size = 3) +
color = "black", fill = "darkorange", size = 4, stroke = 1, shape = 21) +
geom_point(aes(y = cluster_centers$y[2],
x = cluster_centers$x[2]),
color = "#F8766D", size = 3) +
color = "black", fill = "steelblue", size = 4, stroke = 1, shape = 21) +
geom_point(aes(y = cluster_centers$y[3],
x = cluster_centers$x[3]),
color = "#F8766D", size = 3)
color = "black", fill = "goldenrod1", size = 4, stroke = 1, shape = 21)

all_clusters_base
```
Expand Down Expand Up @@ -821,7 +827,7 @@ Figure \@ref(fig:10-toy-kmeans-vary-k) illustrates the impact of K
on K-means clustering of our penguin flipper and bill length data
by showing the different clusterings for K's ranging from 1 to 9.

```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.pos = "H", out.extra="", fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
set.seed(3)

kclusts <- tibble(k = 1:9) |>
Expand Down Expand Up @@ -885,7 +891,7 @@ decrease the total WSSD, but by only a *diminishing amount*. If we plot the tota
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
the right number of clusters (Figure \@ref(fig:10-toy-kmeans-elbow)).

```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = "center", fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
p2 <- ggplot(clusterings, aes(x = k, y = tot.withinss)) +
geom_point(size = 2) +
geom_line() +
Expand All @@ -894,7 +900,7 @@ p2 <- ggplot(clusterings, aes(x = k, y = tot.withinss)) +
xend = 3.1,
yend = 6),
arrow = arrow(length = unit(0.2, "cm"))) +
annotate("text", x = 4.4, y = 19, label = "Elbow", size = 7, color = "blue") +
annotate("text", x = 4.4, y = 19, label = "Elbow", size = 5) +
labs(x = "Number of Clusters", y = "Total WSSD") +
#theme(text = element_text(size = 20)) +
scale_x_continuous(breaks = 1:9)
Expand Down Expand Up @@ -1009,8 +1015,8 @@ cluster_plot <- ggplot(clustered_data,
labs(x = "Flipper Length",
y = "Bill Length",
color = "Cluster") +
scale_color_manual(values = c("dodgerblue3",
"darkorange3",
scale_color_manual(values = c("steelblue",
"darkorange",
"goldenrod1")) +
theme(text = element_text(size = 12))

Expand Down
Loading
Loading