05-chi-squared.qmd

# Chi-squared Test {#sec-chap05}

```{r}
#| label: setup
#| include: false

base::source(file = "R/helper.R")
ggplot2::theme_set(ggplot2::theme_bw()) 
```


```{r}
#| label: cranlogs
#| include: false
#| eval: false

## run only once manually #########
cranlogs_chi_squared_residuals <-  
    pkgs_dl(c("janitor", "questionr", "rstatix", "descr"))
save_data_file("chap05", cranlogs_chi_squared_residuals, "cranlogs_chi_squared_residuals.rds")

cranlogs_cramers_v <- pkgs_dl(c("lsr", "rcompanion", "rstatix", "DescTools",
         "confintr", "sjstats", "collinear"))
save_data_file("chap05", cranlogs_cramers_v, "cranlogs_cramers_v.rds")
```


## Achievements to unlock

::: {#obj-chap05}
::: {.my-objectives}
::: {.my-objectives-header}
Objectives for chapter 05
:::

::: {.my-objectives-container}
**SwR Achievements**

-   **Achievement 1**: Understanding the relationship between two
    categorical variables using bar charts, frequencies, and percentages (@sec-chap05-achievement1)
-   **Achievement 2**: Computing and comparing observed and expected
    values for the groups (@sec-chap05-achievement2)
-   **Achievement 3**: Calculating the chi-squared statistic for the
    test of independence (@sec-chap05-achievement3)
-   **Achievement 4**: Interpreting the chi-squared statistic and making
    a conclusion about whether or not there is a relationship (@sec-chap05-achievement4)
-   **Achievement 5**: Using Null Hypothesis Significance Testing to
    organize statistical testing (@sec-chap05-achievement5)
-   **Achievement 6**: Using standardized residuals to understand which
    groups contributed to significant relationships (@sec-chap05-achievement6)
-   **Achievement 7**: Computing and interpreting effect sizes to
    understand the strength of a significant chi-squared relationship (@sec-chap05-achievement7)
-   **Achievement 8**: Understanding the options for failed chi-squared
    assumptions (@sec-chap05-achievement8)
:::
:::
Achievements for chapter 05
:::

## The voter fraud problem

Information from studies suggests that voter fraud does happen but it
is rare. In contrast to these studies a great minority of people (20-30%) in
the US believe that voter fraud is a big problem. Many states are
building barriers to vote, and other states make voting more easily, for
instance with automatic voter registration bills.

## Resources & Chapter Outline

### Data, codebook, and R packages {#sec-chap05-data-codebook-packages}

::: {.my-resource}
::: {.my-resource-header}
:::::: {#lem-chap05-resources}
: Data, codebook, and R packages for learning about descriptive statistics
::::::
:::

::: {.my-resource-container}
**Data**

Two options for assessing the data:

1.  Download the data set `pew_apr_19-23_2017_weekly_ch5.sav` from
    <https://edge.sagepub.com/harris1e>
2.  Download the data set from the `r glossary("Pew Research Center")`
    website
    (<https://www.people-press.org/2017/06/28/public-supports-aimof-making-it-easy-for-all-citizens-to-vote/>)

**Codebook**

Two options for assessing the documentation:

1.  Download the documentation files `pew_voting_april_2017_ch5.pdf`,
    `pew_voting_demographics_april_2017_ch5.docx`, and
    `pew_chap5_readme.txt` from <https://edge.sagepub.com/harris1e>
2.  Download the data set from the [Pew Research Center website](https://www.pewresearch.org/download-datasets/) and the
    documentation will be included with the zipped file.

**Packages**

1.  Packages used with the book (sorted alphabetically)

-   {**desc**}: @pak-descr (Jakson Alves de Aquino)
-   {**fmsb**}: @pak-fmsb (Minato Nakazawa)
-   {**haven**}: @pak-haven (Hadley Wickham)
-   {**lsr**}: @pak-lsr (Danielle Navarro[^05-chi-squared-1])
-   {**tidyverse**}: @pak-tidyverse (Hadley Wickham)

2.  My additional packages (sorted alphabetically)
:::
:::

[^05-chi-squared-1]: Not Daniel Navarro as mentioned in the book.
    Danielle has changed her gender.

### Get data

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-get-pew-data}
: Get pew data about public support for making it easy to vote
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: get-pew-data
#| eval: false

## run only once manually #########
vote <- haven::read_sav("data/chap05/pew_apr_19-23_2017_weekly_ch5.sav")

vote <- vote |> 
    labelled::remove_labels()
save_data_file("chap05", vote, "vote.rds")

```

***

(*For this R code chunk is no output available*)

:::::{.my-remark}
:::{.my-remark-header}
Removing labels
:::
::::{.my-remark-container}
`haven::zap_labels()` as used in the book removes value labels and not variable labels. The correct function would be `haven::zap_label()`. I have used the {**labelled**} package where you can use `labelled::remove_labels()` to delete both (variable & value labels).
::::
:::::

::::
:::::

:::::{.my-watch-out}
:::{.my-watch-out-header}
Error message with labelled data
:::
::::{.my-watch-out-container}
I have removed the labelled data immediately, because I got an error message caused by summary statistics (e.g., `base::summary()`, `skimr::skim()`, `dplyr::summarize()`) whenever I rendered the file (but not when I compiled the code chunk.) 

I didn't have time to look into this issue --- and I had to remove the labels anyway. 

What follows is the error message:

```
Quitting from lines 180-186 [show-pew-raw-data] (05-chi-squared.qmd)
Error in `dplyr::summarize()`:
ℹ In argument: `skimmed = purrr::map2(...)`.
Caused by error in `purrr::map2()`:
ℹ In index: 1.
ℹ With name: character.
Caused by error in `dplyr::summarize()`:
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
Caused by error in `across()`:
! Can't compute column `state_~!@#$%^&*()-+character.empty`.
Caused by error in `as.character()`:
! Can't convert `x` <haven_labelled> to <character>.
Backtrace:
  1. skimr::skim(vote)
 28. skimr (local) `<fn>`(state)
 29. x %in% empty_strings
 31. base::mtfrm.default(`<hvn_lbll>`)
 33. vctrs:::as.character.vctrs_vctr(x)
 ```
::::
:::::


### Show raw data

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-show-pew-raw-data}
: Get pew data about public support for making it easy to vote
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: show-pew-raw-data
#| results: hold
#| cache: true

vote <-  base::readRDS("data/chap05/vote.rds")
skimr::skim(vote)
```

::::
:::::


### Recode data for chapter 5

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-recode-pew-data}
: Select some columns from the pew data set
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: recode-pew-data
#| results: hold

vote <-  base::readRDS("data/chap05/vote.rds")

## create vote_clean #############
vote_clean <-  vote |> 
    dplyr::select(pew1a, pew1b, race, sex, 
                  mstatus, ownhome, employ, polparty) |> 
    labelled::remove_labels() |> 
    dplyr::mutate(dplyr::across(1:8, forcats::as_factor)) |> 
    naniar::replace_with_na(replace = list(
        pew1a = c(5, 9),
        pew1b = c(5, 9),
        race = 99,
        ownhome = c(8, 9)
    )) |> 
    dplyr::mutate(pew1a = forcats::fct_recode(pew1a,
             "Register to vote" = "1",
             "Make easy to vote" = "2",
             )) |> 
    dplyr::mutate(pew1b = forcats::fct_recode(pew1b,
             "Require to vote" = "1",
             "Choose to vote" = "2",
             )) |> 
    dplyr::mutate(race = forcats::fct_recode(race,
             "White non-Hispanic" = "1",
             "Black non-Hispanic" = "2",
             )) |> 
    dplyr::mutate(race = forcats::fct_collapse(race,
             "Hispanic" = c("3", "4", "5"),
             "Other" = c("6", "7", "8", "9", "10")
    )) |> 
    dplyr::mutate(sex = forcats::fct_recode(sex,
             "Male" = "1",
             "Female" = "2",
             )) |> 
    dplyr::mutate(ownhome = forcats::fct_recode(ownhome,
             "Owned" = "1",
             "Rented" = "2",
             )) |> 
    dplyr::mutate(dplyr::across(1:8, forcats::fct_drop)) |> 
    dplyr::rename(ease_vote = "pew1a",
                  require_vote = "pew1b")

save_data_file("chap05", vote_clean, "vote_clean.rds")
    
skimr::skim(vote_clean)
```

***
I have used in this recoding R chunk several functions for the first time:

- I turned all character columns into factor variables with just one line of code using `dplyr::across()` in combination with `forcats::as_factor()`.
- I replaced missing values (NAs) with the `replace_with_na()` function of the {**naniar**} package (see @pak-naniar).
- I combined several levels with `forcats::fct_collapse()`.
- And finally I dropped all unused levels in the whole data.frame using `dplyr::across()` in conjunction with `forcats::fct_drop()`.


::::
:::::


## Achievement 1: Relationship of two categorical variables {#sec-chap05-achievement1}

### Descriptive statistics

For better display I have reversed the order of the variables: Instead of grouping y ease of vote I will group by race/ethnicity. This will give a smaller table with only two columns instead of four that will not fit on the screen without horizontal scrolling.


:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap05-stats-voting-data}
: Frequencies between two categorical variables
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### summarize()

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-summarize-ease-voting}
: Summarize relationship ease of vote and race/ethnicity
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: summarize-ease-voting
#| results: hold
#| cache: true

## load vote_clean ##########
vote_clean <-  base::readRDS("data/chap05/vote_clean.rds")

ease_vote_sum <- vote_clean |> 
    tidyr::drop_na(ease_vote) |> 
    tidyr::drop_na(race) |> 
    dplyr::group_by(race, ease_vote) |> 
    ## either summarize
    dplyr::summarize(n = dplyr::n(),
                     .groups = "keep")
    ## or count the observation in each group
    # dplyr::count()
ease_vote_sum
```
***
Here I used "standard" tidyverse code to count frequencies. Instead of the somewhat complex last code line I could have used just `dplyr::count()` with the same result.


::::
:::::

:::::{.my-watch-out}
:::{.my-watch-out-header}
WATCH OUT! Prevent warning with `.groups` argument
:::
::::{.my-watch-out-container}
By using two variables inside `dplyr::group_by()` I got a warning message:

> `summarise()` has grouped output by 'ease_vote'. 
> You can override using the `.groups` argument.

At first I had to set the chunk option `warning: false` to turn off this warning. But finally I managed to prevent the warning with R code. See the [summarize help page](https://dplyr.tidyverse.org/reference/summarise.html) under arguments `.groups`. Another option to suppress the warning would have been `options(dplyr.summarise.inform = FALSE)`. See also the two [comments in StackOverflow](https://stackoverflow.com/questions/71914704/override-using-groups-argument) and [r-stats-tips](https://rstats-tips.net/2020/07/31/get-rid-of-info-of-dplyr-when-grouping-summarise-regrouping-output-by-species-override-with-groups-argument/).
::::
:::::


###### pivot_wider()

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pivot-wider-ease-voting}
: Summarize by converting data from long to wide with `pivot_wider()` from {**tidyr**}
::::::
:::
::::{.my-r-code-container}

:::{#lst-chap05-pivot-wider-ease-voting}
```{r}
#| label: pivot-wider-ease-voting
#| cache: true

ease_vote_wider <- vote_clean |> 
    tidyr::drop_na(ease_vote) |> 
    tidyr::drop_na(race) |> 
    dplyr::group_by(race, ease_vote) |> 
    dplyr::summarize(
        n = dplyr::n(),
        .groups = "keep") |> 
    tidyr::pivot_wider(
        names_from = ease_vote,
        values_from = n
    )
ease_vote_wider
```
Summarizing and converting data from long to wide with `pivot_wider()` from {**tidyr**}
:::

***

We get with `dplyr::pivot_wider()` a more neatly arranged table.
::::
:::::

###### table()

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-base-table-ease-voting}
: Summarize with `base::table()`
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: base-table-ease-voting
#| results: hold

ease_vote_table <- base::table(
    vote_clean$race, 
    vote_clean$ease_vote,
    dnn = c("Race", "Ease of voting")
)
ease_vote_table
```
***

Note that NA's are automatically excluded from the table.


::::
:::::

With the simple `base::table()` we will get a very similar result as in the more complex `dplyr::pivot_wider()` code variant in @lst-chap05-pivot-wider-ease-voting. 

But I prefer in any case the tidyverse version for several reasons:

:::::{.my-remark}
:::{.my-remark-header}
Some deficiencies of `base::table()` 
:::
::::{.my-remark-container}

- `table()` does not accept data.frame as input and you can't therefore chain several commands together with the ` |> ` pipe.
- `table()` does not output data.frames
- `table()` is very difficult to format and to make it print ready.
::::
:::::

###### xtabs()

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-base-xtabs-ease-voting}
: Summarize with a `stats::xtabs()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: base-xtabs-ease-voting

ease_vote_xtabs <- stats::xtabs(n ~ race + ease_vote, data = ease_vote_sum)
ease_vote_xtabs
```

::::
:::::


###### tabyl()

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-tabyl-voting-data}
: Frequencies with `tabyl()` from {**janitor**}
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: tabyl-voting-data
#| results: hold

ease_vote_tabyl <- vote_clean |> 
    janitor::tabyl(race, ease_vote, show_na = FALSE)
ease_vote_tabyl
```
***

`janitor::tabyl()` prevents the weaknesses of the `base::table()` function. It works with data.frames, is tidyverse compatible and has many `adorn_*` functions (`adorn_` stands for "adornment") to format the output values.
::::
:::::

###### prop.table()

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-contingency-prop-table-voting-data}
: Summarize with a base R proportion contingency table 
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: prop-contingency-table-voting-data
#| results: hold

base::prop.table(
    base::table(`Race / Ethnicity` = vote_clean$race,
          `Ease of voting` = vote_clean$ease_vote), margin = 1)
```
***
All was I said about flaws for `base::table()` is of course valid for the `base::prop.table()` function as well.

::::
:::::

###### tabyl() formatted

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-janitor-voting-data}
: Frequencies with `tabyl()` from {**janitor**} formatted
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: janitor-voting-data
#| results: hold

vote_clean |> 
    janitor::tabyl(race, ease_vote, show_na = FALSE) |> 
    janitor::adorn_percentages("row")  |> 
    janitor::adorn_pct_formatting(digits = 2)  |> 
    janitor::adorn_ns() |> 
    janitor::adorn_title(row_name = "Race / Ethnicity",
                         col_name = "Ease of voting")
```
***

In this example you can see the power of the {**janitor**} package. The main purpose of the {**janitor**} is data cleaning, but because counting is such a fundamental part of data cleaning and exploration the `tabyl()` and `adorn_*()` has been included in this package.
::::
:::::

###### Ease of voting

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-ease-voting-data}
: Ease of voting by race / ethnicity
::::::
:::
::::{.my-r-code-container}

::: {#lst-chap05-ease-voting}
```{r}
#| label: ease-voting-data
#| results: hold

vote_clean |> 
    janitor::tabyl(race, ease_vote, show_na = FALSE) |> 
    janitor::adorn_percentages("row")  |> 
    janitor::adorn_pct_formatting(digits = 2)  |> 
    janitor::adorn_ns() |> 
    janitor::adorn_title(row_name = "Race / Ethnicity",
                         col_name = "Ease of voting")
```

Ease of voting by race / ethnicity
:::

***

::: {.callout-tip}
The voting registration policy a person favors differed by race/ethnicity.

- White non-Hispanic participants were fairly evenly divided between those who thought people should register if they want to vote and those who thought voting should be made as easy as possible.
- The other three race-ethnicity groups had larger percentages in favor of making it as easy as possible to vote.
- Black non-Hispanic participants have the highest percentage (77.78%) in favor of making it easy to vote.
:::


::::
:::::

###### Require to vote

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-require-voting-data}
: Voting as requirement or free choice by race /ethnicity
::::::
:::
::::{.my-r-code-container}

::: {#lst-chap05-require-voting}
```{r}
#| label: require-voting-data
#| results: hold

vote_clean |> 
    janitor::tabyl(race, require_vote, show_na = FALSE) |> 
    janitor::adorn_percentages("row")  |> 
    janitor::adorn_pct_formatting(digits = 2)  |> 
    janitor::adorn_ns() |> 
    janitor::adorn_title(row_name = "Race / Ethnicity",
                         col_name = "Voting as citizen duty or as a free choice?")
```
Voting as requirement or free choice by race /ethnicity

:::

***

::: {.callout-tip}
Different ethnicities have distinct opinions about the character of voting. 

- About one-third of Black non-Hispanic and Hispanic believe that voting should be a requirement. But this means on the other hand, that at least two-third of both groups see voting as a free choice. 
- In contrast to this proportion are white non-Hispanic and other non-Hispanic ethnicities: In those groups more than 80% favor voting as a free choice.

:::


::::
:::::

:::

::::
:::::

:::::{.my-resource}
:::{.my-resource-header}
:::::: {#lem-chap05-cross-tabulation}
Cross-Tabulation
::::::
:::
::::{.my-resource-container}

- [Working with Tables in R](https://bookdown.org/kdonovan125/ibis_data_analysis_r4/working-with-tables-in-r.html) in [@donovan2019a].
- [Cross-Tabulation in R](https://www.marsja.se/cross-tabulation-in-r-creating-interpreting-contingency-tables/): Creating & Interpreting Contingency Tables [@marsja2023].
- [Tables in R](https://cran.r-project.org/web/packages/DescTools/vignettes/TablesInR.pdf): A Quick Practical Overview [@signorell2021], see also [@pak-DescTools].
- [Introduction to Crosstable](https://cran.r-project.org/web/packages/crosstable/vignettes/crosstable.html) [@chalthiel2023], see also [@pak-crosstable].

::::
:::::


### Graphs

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap05-descriptive-graphs}
: Descriptive graphs
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### geom_col()

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pew-voting-geom-col-graph}
: Visualizing opinions about ease of voting by race / ethnicity 
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-pew-voting-geom-col-graph
#| fig-cap: "Opinion on ease of voting by race / ethnicity from a study of the Pew Research Center 2017 (n = 1,028)"

p_ease_vote <- vote_clean |> 
    ## prepare data
    tidyr::drop_na(ease_vote) |>
    tidyr::drop_na(race) |>
    dplyr::group_by(race, ease_vote) |>
    dplyr::count() |>
    dplyr::group_by(race) |>
    dplyr::mutate(perc = n / base::sum(n)) |>
    
    ## draw graph
    ggplot2::ggplot(
        ggplot2::aes(
            x = race, 
            y = perc,
            fill = ease_vote)
    ) +
    ggplot2::geom_col(position = "dodge") +
    ggplot2::scale_y_continuous(labels = scales::percent) +
    ggplot2::labs(
        x = "Race / Ethnicity",
        y = "Percent"
    ) +
    ggplot2::scale_fill_viridis_d(
        name = "Ease of voting",
        alpha = .8, # here alpha works!!
        begin = .25,
         end = .75,
        direction = -1,
        option = "viridis"
    )

p_ease_vote
```
***

I had several difficulties by drawing this graph:

1. Most important: I did not know that the second variable `ease_vote` has to be included by the `fill` argument. That seems not logical but together with `position = dodge` it make sense.
2. I didn't know that I have to group by race again (the line after `dplyr::count()`)
3. I thought that I could calculate the percentages with `ggplot2::after_stat()`. The solution was more trivial: Creating a new column with the calculated percentages and using `geom_col()` instead of `geom_bar()`.

Instead of the last line I could have used with the same result: `ggplot2::geom_bar(position = "dodge", stat = "identity")`. `geom_bar()` uses as standard option `ggplot2::stat_count()`. It is however possible to override the default value as was done in the book code. But it easier here to use `geom_col()` because it uses as default `stat_identity()` e.g., it leaves the data as is.

::: {.callout-note #nte-chap05-changes}
**Two additional remarks**:

1. I have used here the percent scale from the {**scales**} package to get percent signs on the y-axis.
2. I practiced my learnings from @sec-chap03 about adding a color-friendly palette (see @sec-chap03-practice-test). (See also my color test in @cnj-chap05-color-test-bw.)
:::


::::
:::::

###### geom_bar()

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pew-voting-geom-bar-graph}
: Visualizing opinions about ease of voting by race / ethnicity 
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-pew-voting-geom-bar-graph
#| fig-cap: "Opinion on ease of voting by race / ethnicity from a study of the Pew Research Center 2017 (n = 1,028)"


vote_clean |> 
    tidyr::drop_na(ease_vote) |>
    tidyr::drop_na(race) |>
    ggplot2::ggplot(
        ggplot2::aes(
            x = race, 
            fill = ease_vote
        )
    ) +
    ggplot2::geom_bar(position = "dodge",
        ggplot2::aes(
            y = ggplot2::after_stat(count / base::sum(count))
        )) +
    ggplot2::scale_y_continuous(labels = scales::percent) +
    ggplot2::labs(
        x = "Race / Ethnicity",
        y = "Percent"
    ) +
    ggplot2::scale_fill_viridis_d(
        name = "Ease of voting",
        alpha = .8, # here alpha works!!
        begin = .25,
         end = .75,
        direction = -1,
        option = "viridis"
    )
```
***

Here I have used `geom_bar()` with the `after_stat()` calculation. It turned out that the function computes the percentages of the different race categories for the two `ease_vote` values. This was not was I had intended.

I tried for several hours to use `after_stat()` with the same result as in @cnj-chap05-pew-voting-geom-col-graph, but I didn't succeed. I do not know if the reason is my missing knowledge (for instance to generate another structure of the data.frame) or if you can't do that in general. 

::::
:::::

###### geom_col() with labels

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pew-voting-geom-col-label-graph}
: Visualizing opinions about ease of voting by race / ethnicity 
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-pew-voting-geom-col-label-graph
#| fig-cap: "Opinion on ease of voting by race / ethnicity from a study of the Pew Research Center 2017 (n = 1,028)"

vote_clean |> 
    tidyr::drop_na(ease_vote) |>
    tidyr::drop_na(race) |>
    dplyr::group_by(race, ease_vote) |>
    dplyr::count() |>
    dplyr::group_by(race) |>
    dplyr::mutate(perc = n / base::sum(n)) |>
    ggplot2::ggplot(
        ggplot2::aes(
            x = race, 
            y = perc,
            fill = ease_vote)
    ) +
    ggplot2::geom_col(position = "dodge") +
    ggplot2::geom_label(
        ggplot2::aes(
            x = race,
            y = perc,
            label = paste0(round(100 * perc, 1),"%"),
            vjust = 1.5, hjust = -.035
        ),
        color = "white"
    ) +
    ggplot2::scale_y_continuous(labels = scales::percent) +
    ggplot2::labs(
        x = "Race / Ethnicity",
        y = "Percent"
    ) +
    ggplot2::scale_fill_viridis_d(
        name = "Ease of voting",
        alpha = .8, # here alpha works!!
        begin = .25,
         end = .75,
        direction = -1,
        option = "viridis"
    )
    
```
***

Here I have experimented with labels. It seems that with the argument `position = dodge` the labels can't appear on each of the appropriate bars.

::::
:::::

###### requirements

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pew-voting-requirements-by-race}
: Visualizing opinions about requirements of voting by race / ethnicity 
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-pew-voting-requirements-by-race
#| fig-cap: "Opinion on voting requirements by race / ethnicity from a study of the Pew Research Center 2017 (n = 1,028)"

p_require_vote <- vote_clean |> 
    ## prepare data
    tidyr::drop_na(require_vote) |>
    tidyr::drop_na(race) |>
    dplyr::group_by(race, require_vote) |>
    dplyr::count() |>
    dplyr::group_by(race) |>
    dplyr::mutate(perc = n / base::sum(n)) |>
    
    ## draw graph
    ggplot2::ggplot(
        ggplot2::aes(
            x = race, 
            y = perc,
            fill = require_vote)
    ) +
    ggplot2::geom_col(position = "dodge") +
    ggplot2::scale_y_continuous(labels = scales::percent) +
    ggplot2::labs(
        x = "Race / Ethnicity",
        y = "Percent"
    ) +
    ggplot2::scale_fill_viridis_d(
        name = "Requirements of voting",
        alpha = .8, # here alpha works!!
        begin = .25,
         end = .75,
        direction = -1,
        option = "viridis"
    )

p_require_vote
```


::::
:::::

###### Voting by race

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-voting-opinions-by-race}
: Visualizing opinions about voting by race / ethnicity 
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-pew-voting-by-race
#| fig-cap: "Opinion on ease of voting and voting requirements by race / ethnicity from a study of the Pew Research Center 2017 (n = 1,028)"
#| fig-height: 6
#| warning: false

p_ease <- p_ease_vote +
    ggplot2::labs(
        x = "",
        y = "Percent within group"
    ) +
    ggplot2::scale_fill_viridis_d(
        name = "Opinion on\nvoter registration",
        alpha = .8, 
        begin = .25,
        end = .75,
        direction = -1,
        option = "viridis"
    ) +
    ggplot2::theme(axis.text.x = ggplot2::element_blank())

p_require <- p_require_vote +
    ggplot2::labs(y = "Percent within group") +
    ggplot2::scale_fill_viridis_d(
        name = "Opinion on\nvoting",
        alpha = .8,
        begin = .25,
        end = .75,
        direction = -1,
        option = "viridis"
    )

gridExtra::grid.arrange(p_ease, p_require, ncol = 1)

```

::::
:::::


###### Color test

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-color-test-bw}
: Test how the colors used for the graph race by ease of voting look for printing in black & white
::::::
:::
::::{.my-r-code-container}


::: {#lst-chap05-color-test-bw}

```{r}
#| label: fig-color-test-bw
#| fig-cap: "Test if used colors of my graph race by ease of voting look are also readable for black & white printing"
#| fig-height: 3
#| results: hold

pal_data <- list(names = c("Normal", "desaturated"),
    color = list(scales::viridis_pal(
                                alpha = .8, 
                                begin = .25, 
                                 end = .75, 
                                direction = -1, 
                                option = "viridis")(2),
    colorspace::desaturate(scales::viridis_pal(
                                alpha = .8, 
                                begin = .25, 
                                end = .75, 
                                direction = -1, 
                                option = "viridis")(2)))
    )
list_plotter(pal_data$color, pal_data$names, 
    "Colors and black & white of graph race by ease of voting")

```

Test how the colors I have used for my graphs about race by ease of voting look in black & white
:::

::::
:::::

:::

::::
:::::


## Achievement 2: Comparing groups {#sec-chap05-achievement2}

The `r glossary("chi-squared")` test is useful for testing to see if there may be a statistical relationship between two categorical variables. The chi-squared test is based on the observed values, and the values expected to occur if there were no relationship between the variables.

### Observed values

We will use the observed values from @lst-chap05-ease-voting and @lst-chap05-require-voting.

### Expected values

For each cell in the table, multiply the row total for that row by the column total for that column and divide by the overall total.

To prevent manually computing the values I have used `CrossTable()` from the {**descr**} package (see @pak-descr and [StackOverflow](https://stackoverflow.com/a/34214973/7322615)).

$$
\text{Expected Values} = \frac{rowTotal \times columnTotal}{Total}
$$ {#eq-chap05-expected-values}


:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap05-expected-values}
: Show observed and expected values
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### Ease

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-expected-ease-vote}
: Ease of voting by race / ethnicity
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: expected-race-by-ease-vote
#| results: hold
#| cache: true

vote_clean <- base::readRDS("data/chap05/vote_clean.rds")

vote_opinions <- vote_clean |> 
    dplyr::select(race, ease_vote, require_vote) |>
    tidyr::drop_na()

ct_ease <- descr::CrossTable(
    x = vote_opinions$race,
    y = vote_opinions$ease_vote,
    dnn = c("Race", "Ease of voting"),
    prop.r = FALSE, 
    prop.c = FALSE, 
    prop.t = FALSE,
    prop.chisq = FALSE,
    expected = TRUE
    )
ct_ease
```

***

::: {.callout-tip}

- Some of the cells have observed and expected values that are very close to each other. For example, the observed number of Other race-ethnicity people who want to make it easy to vote is 46, while the expected is 43.3. 
- But other categories show bigger differences. For example, the observed number of Black non-Hispanics who think people should register to vote is 28, and the expected value is nearly twice as high at 51.3.
:::

::::
:::::

###### Require

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-exprected-race-by-require}
: Status of voting by race / ethnicity
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: expected-race-by-require
#| results: hold
#| cache: true

ct_require <- descr::CrossTable(
        x = vote_opinions$race,
        y = vote_opinions$require_vote,
        dnn = c("Race", "Status of voting"),
        prop.r = FALSE, 
        prop.c = FALSE, 
        prop.t = FALSE,
        prop.chisq = FALSE,
        expected = TRUE
    )
ct_require
```

***

::: {.callout-tip}
The cell "Other" has similar observed and expected values, but the rest have bigger differences.
:::

::::
:::::

###### Both

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-expected-voting-data}
: Computing ease and require of voting using the {**sjstats**} package
::::::
:::
::::{.my-r-code-container}


```{r}
#| label: expected-voting-data
#| results: hold
#| cache: true

## load vote_clean ##########
vote_clean <-  base::readRDS("data/chap05/vote_clean.rds")

vote_clean2 <- vote_clean |> 
    dplyr::select(race, ease_vote, require_vote) |> 
    tidyr::drop_na()

ease_vote_n <- vote_clean2 |> 
    dplyr::select(race, ease_vote) |> 
    dplyr::group_by(race, ease_vote) |> 
    dplyr::summarize(n_ease = dplyr::n(),
                     .groups = "keep")

ease_expected  <-  
    tibble::as_tibble(
        base::as.data.frame(
            sjstats::table_values(
                base::table(
                    vote_clean$race, 
                    vote_clean$ease_vote)
                )$expected,
                .name_repair = "unique")) |> 
    dplyr::arrange(Var1)

(
    ease_expected2 <- dplyr::bind_cols(
    ease_vote_n,
    exp_ease = ease_expected$Freq)
)

glue::glue(" ")
glue::glue("**********************************************************")
glue::glue(" ")

require_vote_n <- vote_clean2 |> 
    dplyr::select(race, require_vote) |> 
    dplyr::group_by(race, require_vote) |> 
    dplyr::summarize(n_require = dplyr::n(),
                     .groups = "keep")

require_expected  <-  
    tibble::as_tibble(
        base::as.data.frame(
            sjstats::table_values(
                base::table(
                    vote_clean$race, 
                    vote_clean$require_vote)
                )$expected,
                .name_repair = "unique")) |> 
    dplyr::arrange(Var1)

(
    require_expected2 <- dplyr::bind_cols(
    require_vote_n,
    exp_require = require_expected$Freq)
)
```

***

The `sjstats::table_values()` function has the advantage that it can be converted to a data.frame. We can therefore manipulate the data and --- for example --- combine expected data for different variables.


::::
:::::

###### Together

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-expected-vote-data}
: : Combining ease and require of voting
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: expacted-vote-data
#| results: hold

require_expected3 <- require_expected2 |> 
    dplyr::ungroup() |> 
    dplyr::select(-1)

vote_expected <- dplyr::bind_cols(
    ease_expected2,
    require_expected3
)

vote_expected
```

::::
:::::


:::

::::
:::::

:::::{.my-important}
:::{.my-important-header}
Differences between observed values and expected indicates that there may be a relationship between the variables. 
:::
:::::

### Assumptions of the chi-squared test of independence

***

::: {#bul-chap05-assumptions-chi-squared}
:::::{.my-bullet-list}
:::{.my-bullet-list-header}
Bullet List
:::
::::{.my-bullet-list-container}

- **The variables must be nominal or ordinal (usually nominal)**. We have categorical data with no order, e.g., nominal data: *The assumption is met.*
- **The expected values should be 5 or higher in at least 80% of groups**. We have 8 cells with values. None of these cells are 5 or lower: *The assumption is met.*
- **The observations must be independent**. We have neither the same set of people asked before and after an intervention nor do are the respondents family members or other affiliated with each other: *The assumption is met. *

::::
:::::
Assumptions for the chi-squared test

:::

***

## Calculating the chi-squared statistic {#sec-chap05-achievement3}

The differences between observed values and expected values can be combined into an overall statistic. But adding (resp. subtracting) does not work as the result is always 0. So we will again --- like with the computation of the variance --- square the difference.

To prevent huge differences when observed and expected values are very large, there is an additional step in the computation of $\chi^2$: Divide the squared differences by the expected value of the appropriate cells.

$$
\chi^2 = \sum\frac{(observed - expected)^2}{expected}
$$ {#eq-chap05-chi-squared}

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-chi-squared-ease-voting}
: Compute chi-squared for race by ease of voting
::::::
:::
::::{.my-r-code-container}

:::{#lst-chap05-chi-squared-ease-voting}
```{r}
#| label: chi-squared-ease-voting
#| cache: true

vote_clean <- base::readRDS("data/chap05/vote_clean.rds")

stats::chisq.test(
    x = vote_clean$ease_vote,
    y = vote_clean$race
)

```

Chi-squared statistic for race by ease of voting
:::
::::
:::::

## Achievement 4: Interpreting the chi-squared statistic {#sec-chap05-achievement4}

In contrast to the binomial and normal distribution which both have two parameters (n and p, resp. $\mu$ and $\sigma$), the `r glossary("chi-squared")` distribution has only one `r glossary("parameter")`: the `r glossary("degrees of freedom")`. The `df` can be used to find the population `r glossary("standard deviation")` for the distribution:

$$
\sqrt{2df}
$$ {#eq-chap05-pop-sd-df}


:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap05-chi-squared-dist}
: Chi-square probability distributions with different degrees of freedom
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### 4 $\chi^2$ dist extra

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-chi-squared-separately}
: Four chi-square probability distributions with different degrees of freedom
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-chi-squared-dist
#| fig-cap: "Chi-square probability distributions with different degrees of freedom"

# Define sequence of x-values
tib <- tibble::tibble(x = seq(0, 30, length.out = 600))

tib <- tib |> 
# Compute density values
    dplyr::mutate(
        y1 = stats::dchisq(x, df = 1),
        y3 = stats::dchisq(x, df = 3),
        y5 = stats::dchisq(x, df = 5),
        y7 = stats::dchisq(x, df = 7)
    )  
chi_sq1 <- tib |> 
# Plot the Chi-square distribution: df = 1
    ggplot2::ggplot(ggplot2::aes(x = x, y = y1)) +
    ggplot2::geom_line(color = "blue") +
    ggplot2::labs(x = "x", y = "Density", 
      title = paste("Chi-square with 1 degree of freedom")) 

chi_sq3 <- tib |> 
# Plot the Chi-square distribution: df = 3
    ggplot2::ggplot(ggplot2::aes(x = x, y = y3)) +
    ggplot2::geom_line(color = "blue") +
    ggplot2::labs(x = "x", y = "Density", 
      title = paste("Chi-square with 3 degrees of freedom"))

chi_sq5 <- tib |> 
# Plot the Chi-square distribution: df = 5
    ggplot2::ggplot(ggplot2::aes(x = x, y = y5)) +
    ggplot2::geom_line(color = "blue") +
    ggplot2::labs(x = "x", y = "Density", 
      title = paste("Chi-square with 5 degrees of freedom"))

chi_sq7 <- tib |> 
# Plot the Chi-square distribution: df = 7
    ggplot2::ggplot(ggplot2::aes(x = x, y = y7)) +
    ggplot2::geom_line(color = "blue") +
    ggplot2::labs(x = "x", y = "Density", 
      title = paste("Chi-square with 7 degrees of freedom"))

gridExtra::grid.arrange(chi_sq1, chi_sq3, chi_sq5, chi_sq7, ncol = 2)
```
***

:::::{.my-watch-out}
:::{.my-watch-out-header}
WATCH OUT! The graphs have different y scales!
:::
::::{.my-watch-out-container}
This is the replication of Figure 5.7 from the book.

Note: The first impression --- that all probability distributions have same height --- is wrong! All four graphs have very different density scales! 

We will see that all four distributions overlaid into one graphic will give a different impression.
::::
:::::


::::
:::::


###### 4 $\chi^2$ dist together

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-chi-squared-dist-together}
: Four chi-square probability distributions with different degrees of freedom in one graph
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-chi-squared-dist-together
#| fig-cap: "Four chi-square probability distributions with different degrees of freedom"
#| results: hold

# Define sequence of x-values
tib_chisq <- tibble::tibble(x = seq(0, 30, length.out = 600))

tib_chisq |> 
# Compute density values
    dplyr::mutate(
        y1 = stats::dchisq(x, df = 1),
        y3 = stats::dchisq(x, df = 3),
        y5 = stats::dchisq(x, df = 5),
        y7 = stats::dchisq(x, df = 7)
    ) |> 
    tidyr::pivot_longer(-1) |>  
    
    ggplot2::ggplot(
        ggplot2::aes(x, value, color = name)) + 
    ggplot2::geom_line(linewidth = 1) +
    ggplot2::ylim(0, .3) +
    ggplot2::labs(y = "Density") +
    ggplot2::scale_color_viridis_d(
        name = "Degrees\nof Freedom",
        labels = c("1", "3", "5", "7"),
        option = "plasma",
        end = .8
    )

```

::::
:::::

See a more succinct example using a loop in [How to Plot a Chi-Square Distribution in R](https://lifewithdata.com/2023/07/30/how-to-plot-a-chi-square-distribution-in-r/) [@bprasad262023a]

###### $\chi^2$ as function

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-chi-squared-function}
: Chi-square probability distributions with 3 degrees of freedom
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-chi-squared-dist-function
#| fig-cap: "Chi-square probability distributions with 3 degrees of freedom"
#| results: hold

ggplot2::ggplot() +
    ggplot2::xlim(0, 30) +
    ggplot2::stat_function(
        fun = dchisq,
        args = list(df = 3)
    )
```

::::
:::::


:::

::::
:::::

:::::{.my-procedure}
:::{.my-procedure-header}
:::::: {#prp-compute-df}
: Compute degrees of freedom (df) and population standard deviation of a chi-squared distribution
::::::
:::
::::{.my-procedure-container}

1. Subtract 1 from the number of each categories used for the test.
2. Multiply the resulting numbers together gives the degrees of freedom (df)
3. The square root of twice times df is the population standard deviation $\sqrt{(2 \times df)}$.
::::
:::::     

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-comp-df}
: Compute degrees of freedom (`df`) and population standard deviation for the chi-squared distribution of `race` by `ease_vote`
::::::
:::
::::{.my-example-container}

I am following @prp-compute-df:

1. Subtract 1 from the number of each categories used for the test.

- We have four categories in `race`: White non-Hispanic, Black non-Hispanic, Hispanic, Other. $4 - 1 = 3$.
- We have 2 categories in `ease_vote`: Register to vote and Make easy to vote. $2 - 1 = 1$.

2. Multiply the resulting numbers together gives the degrees of freedom (df): $3 \times 1 = 3$

3. The population standard deviation is $\sqrt{(2 \times df)}$ = $\sqrt{(2 \times 3)}$ = `r round(sqrt(2 * 3), 3)`.
::::
:::::

The `r glossary("chi-squared")` distribution shown, which is the chi-squared `r glossary("probability density function")` (PDF), **shows the probability of a value of chi-squared occurring when there is no relationship** between the two variables contributing to the chi-squared.

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap05-chi-squared-pdf}
: Determine the probability using the chi-squared distribution 
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### test example1

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pdf-test-example1}
: Chi-squared probability distribution (df = 5)
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-pdf-test-example1
#| fig-cap: "Chi-squared probability distribution (df = 5)"


## Define start of shade
x_shade = 10 
y_shade = stats::dchisq(10, 5)


## Define sequence of x-values
tib <- tibble::tibble(x = seq(0, 30, length.out = 600)) |> 
    # Compute density values
    dplyr::mutate(
        y = stats::dchisq(x, df = 5)
    )

## Subset data for shaded area
shade_10 <- tib |> 
    dplyr::filter(x >= x_shade) |> 
    ## Necessary as starting point for y = 0!
    tibble::add_row(x = 10, y = 0, .before = 1)


tib |> 
    ## Plot the Chi-square distribution: df = 5
    ggplot2::ggplot(ggplot2::aes(x = x, y = y)) +
    ggplot2::geom_line() +
    
    ## Draw segment 
    ggplot2::geom_segment(
        x = x_shade,
        y = 0,
        xend = x_shade,
        yend = y_shade
    ) +
    
    ## Shade curve
    ggplot2::geom_polygon(
        data = shade_10, 
        fill = "lightblue",
        ggplot2::aes(x = x, y = y)
        ) +
    ggplot2::labs(x = "x", y = "Density", 
      title = paste("Chi-square with 5 degree of freedom and shaded area starting with x = 10.0"))


```
***

The probability that the differences between observed and expected values would result in a chi-squared of exactly 10 is --- looking into the data --- around 2.8%, e.g., very small. 

It is more useful to know what the probability is of getting a chi-squared of *10 or higher*. The probability of the chi-squared value being 10 or higher would be the area under the curve from 10 to the end of the distribution at the far right. 

The probability of the chi-squared value being 10 or higher is about 15%. Even if this value is not very probable it is way above to be statistically significant (5%). The probability of the squared differences between observed and expected adding up to 10 or more is low when there is no relationship between the variables and result in a chi-squared value well inside the probability density function (PDF).

For instance: In our test case the $\chi^2$-value of 10.0 lies well inside the probability curve. The probability that this value can occur when there is no statistically relevant relationship is relatively high (15%). We can't therefore reject the H0, because we do not have a statistically significant value of 5% or less. This can be seen clearly in the resulting graph of @lst-pdf-test-example2, created with the {**sjPlot**} package (see @pak-sjPlot).


::::
:::::

###### test example2

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pdf-test-example2}
: Chi-squared probability distribution (df = 5)
::::::
:::
::::{.my-r-code-container}

::: {#lst-pdf-test-example2}
```{r}
#| label: fig-pdf-test-example2
#| fig-cap: "Chi-squared probability distribution (df = 5)"

sjPlot::dist_chisq(chi2 = 10, deg.f = 5)
```

Chi-squared probability distribution (df = 5) created with {**sjPlot**}
:::
***

This graph uses the {**sjPlot**) package and is very easy to produce. It shows that the p-value for x = 10 is 0.08 (8%), e.g., higher as the standard value of 0.05 (5%). To be statistically significant, the $\chi^2$ value would need to be equal or higher than 11.07.


::::
:::::

:::::{.my-watch-out}
:::{.my-watch-out-header}
{**sjPlot**}: Great package and easy to use in default mode, but you need time to learn the many configurations
:::
::::{.my-watch-out-container}
Even if the standard version of the plot is easy to create, to adapt the graph is another issue. In the background {**sjPlot**} uses the {**ggplot2**} package, but you can’t specify changes by mixing (**sjPlot**) with {**ggplot2**} commands. I tried it and it produced two different plots. To customize plot appearance you have to learn the many arguments of of `sjPlot:set_theme()` and `sjPlot::plot_grpfrq()`. (See also @pak-sjPlot)

(I managed to change the theme in {**sjPlot**} by setting the default theme in {**ggplot2**} with `ggplot2::theme_set(ggplot2::theme:bw())` as global option in the setup chunk.)
::::
:::::

###### test example3

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pdf-test-example3}
: Chi-squared probability distribution (df = 5)
::::::
:::
::::{.my-r-code-container}

::: {#lst-pdf-test-example3}
```{r}
#| label: fig-pdf-test-example3
#| fig-cap: "Chi-squared probability distribution (df = 5)"

nhstplot::plotchisqtest(chisq = 10, df = 5)
```

Chi-squared probability distribution (df = 5) created with {**nhstplot**}
:::
***

Working at 2024-04.26 on @sec-chap10 I just learned of {**nhstplot**} as another package for illustrating graphically the most common Null Hypothesis Significance Testing (`r glossary("NHST")`) procedures (See @pak-nhstplot). This package is even easier to use than the {**sjPlot**) package and is more visually appealing.

Especially valuabe is that the axes are automatically scaled to present the relevant part and the overall shape of the probability density function. {**nhstplot**} is especially intended for education purposes, as it provides a helpful support to help explain the Null Hypothesis Significance Testing process, its use and/or shortcomings.

::::
:::::


###### Race & voting

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-chisq-voting}
: Determine probability of ease of voting by race
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: chap05-chisq-voting
#| fig-cap: "Chi-squared probability distribution of `ease_vote` by `race`"
#| results: hold

(
    chisq_ease_vote_stats <- stats::chisq.test(ease_vote_table)
)

base::invisible(
    chisq_sjplot <- sjPlot::dist_chisq(
        chi2 = chisq_ease_vote_stats[["statistic"]][["X-squared"]],
        deg.f = chisq_ease_vote_stats[["parameter"]][["df"]]
        )
)

```
***

- The limit $\chi^2$ where a statistically significant p-value < 0.05 would start is with `r chisq_sjplot[["plot_env"]][["cs"]]` much lower. The label of $\chi^2$ = `r round(chisq_sjplot[["plot_env"]][["cs"]], 2)` therefore is not the actual chi-squared value (which is `r round(chisq_ease_vote_stats[["statistic"]][["X-squared"]],2)`), but it is the chi-squared value where the `r glossary("p-value")` is .05. From here on we will get with bigger chi-squared values even smaller statistically significant p-values until we finally reach at $\chi^2$ = `r round(chisq_ease_vote_stats[["statistic"]][["X-squared"]], 2)` a p-value of `r chisq_ease_vote_stats[["p.value"]]`.
- **p-value**: The p-value `r chisq_ease_vote_stats[["p.value"]]` is far below the statistically significant level of 0.05. 
- **$\chi^2$**: The shaded area equal or greater than `r round(chisq_ease_vote_stats[["statistic"]], 2)` is so small that you can’t see it.

::: {.callout-tip}
There is a statistically significant association between views on voting ease and race-ethnicity [$\chi^2(3) = 28.95; p < .05$].
:::


::::
:::::

:::::{.my-watch-out}
:::{.my-watch-out-header}
Whenever possible, use the actual p-value rather than p < .05
:::
::::{.my-watch-out-container}
In this case the `r glossary("p-value")` is so small that it wouldn’t look nice to provide the exact figure.
::::
:::::

###### Race & voting

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-chisq-voting}
: Determine probability of ease of voting by race
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: chap05-chisq-voting2
#| fig-cap: "Chi-squared probability distribution of `ease_vote` by `race`"
#| results: hold

(
    chisq_ease_vote_stats <- stats::chisq.test(ease_vote_table)
)


nhstplot::plotchisqtest(
    chisq = chisq_ease_vote_stats[["statistic"]][["X-squared"]],
    df = chisq_ease_vote_stats[["parameter"]][["df"]]
    )


```


::::
:::::

:::

::::
:::::


## Achievement 5: Null Hypothesis Significance Testing {#sec-chap05-achievement5}

:::::{.my-procedure}
:::{.my-procedure-header}
:::::: {#prp-chap05-nhst}
: Null Hypothesis Significance Testing
::::::
:::
::::{.my-procedure-container}
1. Write the null and alternate hypotheses. 
2. Compute the test statistic. 
3. Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true). 
4a. If the probability that the null is true is very small, usually less than 5%, reject the null hypothesis. 
4b. If the probability that the null is true is not small, usually 5% or greater, retain the null hypothesis.
::::
:::::

:::::{.my-watch-out}
:::{.my-watch-out-header}
WATCH OUT! Last step has to alternate options
:::
::::{.my-watch-out-container}
In the book the above @prp-chap05-nhst has 5 options. But the last two steps (4 and 5) are contradictory alternatives. If one is true, the other does not apply. My @prp-chap05-nhst has therefore only 4 steps.
::::
:::::


### NHST Step 1

The null (`r glossary("null hypothesis", "H0")`) and alternate (`r glossary("alternate hypothesis", "HA")`) are written about the population and are tested using a sample from the population.

::: {.callout-note}
- **H0**: People’s opinions on voter registration are the same across race-ethnicity groups. 
- **HA**: People’s opinions on voter registration are not the same across race-ethnicity groups.
:::

### NHST Step 2

The second step is to use the test statistic. When examining a relationship between two categorical variables the appropriate test statistic is the `r glossary("chi-squared")` statistic, $\chi^2$. You can see in the last line of @lst-chap05-chi-squared-ease-voting that $\chi^2 = 28.952$.

### NHST Step 3

The probability of seeing a chi-squared as big as 28.952 in our sample if there were no relationship in the population between opinion on voting ease and race-ethnicity group would be 0.000002293 or p < .05.

### NHST Step 4

The probability that the null hypothesis, “People’s opinions on voter registration are the same across race-ethnicity groups,” is true in the population based on what we see in the sample is 0.000002293 or p < .05. This is a very small probability of being true and indicates that the null hypothesis is not likely to be true and should therefore be *rejected.*

::: {.callout-tip}
We used the chi-squared test to test the null hypothesis that there was no relationship between opinions on voter registration and race/ethnicity group. We rejected the null hypothesis and concluded that there was a statistically significant association between views on voter registration and race-ethnicity [$\chi^2(3) = 28.952; p < .05$].
:::

:::::{.my-watch-out}
:::{.my-watch-out-header}
WATCH OUT! Chi-squared test and chi-squared goodness-of-fit test are not the same!
:::
::::{.my-watch-out-container}
The chi-squared goodness-of-fit test is used for comparing the values of a single categorical variable to values from a hypothesized or `r glossary("population")` variable. The goodness-of-fit test is often used when trying to determine if a `r glossary("samples")` are a good representation of the population.


::::
:::::

## Achievement 6: Standardized residuals {#sec-chap05-achievement6}

### Introduction

One limitation of the chi-squared independence test is that it determines whether or not there is a statistically significant relationship between two categorical variables but does not identify what makes the relationship significant. The name for this type of test is `r glossary("omnibus")`.

`r glossary("Standardized residuals")` (like `r glossary("z-score", "z-scores")` can aid analysts in determining which of the observed frequencies are significantly larger or smaller than expected. The standardized residual is computed by subtracting the expected value in a cell from the observed value in a cell and dividing by the square root of the expected value.

$$
\text{Standardized residual} = \frac{observed - expected}{\sqrt{expected}}
$$ {#eq-chap05-standardized-residuals}

The standardized residual is distributed like a z-score. Values of the standardized residuals that are higher than 1.96 or lower than –1.96 indicate that the observed value in that group is much higher or lower than the expected value. These are the groups that are contributing the most to a large chi-squared statistic and could be examined further and included in the interpretation.

:::{#wat-chap05-adjusted-standardized-residuals}
:::::{.my-watch-out}
:::{.my-watch-out-header}
WATCH OUT! Adjusted Standardized Residuals
:::
::::{.my-watch-out-container}
There are also *adjusted* standardized residuals. To increase the confusion Alan Agresti [-@agresti2018a] calls these residuals "Standardized Pearson Residual". To understand the difference between standardized and *adjusted* standardized residuals read see [Standardized Residuals in Statistics: What are They?](https://www.statisticshowto.com/what-is-a-standardized-residuals/) [@glenn.d]. Adjusted standardized residuals have higher values and are therefore not interpretable with the z-score values (e.g., looking for values greater or smaller than 2, res. 1.96 standard deviations). I will therefore stick with the (normal) standardized residuals.

$$
\begin{align*}
& \text{Adjusted residual} =  \\
& \frac{observed - expected}{\sqrt{expected \times (1-\text{row total proportion}) \times (1-\text{col total proportion}) )}}
\end{align*}
$$ {#eq-chap05-adjusted-standardized-residuals}

::::
:::::

What are Adjusted Standardized Residuals?
:::


The book recommends to get the standardized residuals with `Descr::CrossTable()`. But I have checked out that there are other possibilities as well. 


:::::{.my-resource}
:::{.my-resource-header}
:::::: {#lem-chap05-chi-squared}
Packages with functions to get standardized residuals of chi-squared tests
::::::
:::
::::{.my-resource-container}
The following list collects these resources I have found together with the approximate average download data of the appropriate package. This figures will give you an idea about package use, but will not say anything about the quality of the package or the standardized residual function we are looking for.

- {**stats**}: `stats::chisq.test()$residuals`
- {**descr**}: `descr::CrossTable()`: It has arguments to show residuals, standardized residuals and adjusted standardized residuals 
- {**janitor**}: `janitor::chisq.test(<tabyl>)$residuals`
- {**questionr**}`questionr::chisq.residuals()` 
- {**rstatix**}: `rstatix::pearson_residuals()`, `rstatix::std_residuals()` 

There is also the possibility to use `graphics::mosaicplot()` with the option `shade = TRUE` to examine residuals visually for the source of differences (See [@greenwood2022]).

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pkgs-chisq-residuals}
: Number of daily downloads for packages with functions to display chi-squared residuals
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: tbl-donwload-numbers-chi-squared-residuals-packages
#| tbl-cap: "Download average numbers of packages with chi-squared residuals functions" 
#| echo: false
#| cache: true

(cranlogs_chi_squared_residuals <- 
        base::readRDS("data/chap05/cranlogs_chi_squared_residuals.rds"))

```

::::
:::::


::::
:::::

### Computation

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap05-compute-standardized-residuals}
: Compute standardized residuals with functions of different packages
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### `desc::CrossTable()`

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-standardized-residuals-desc}
: Compute standardized residuals with `descr::CrossTable()`
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: standardized-residuals-desc

## load vote_clean ##########
vote_clean <-  base::readRDS("data/chap05/vote_clean.rds")

descr::CrossTable(
    x = ease_vote_table,
    expected = TRUE,
    prop.r = FALSE,
    prop.c = FALSE,
    prop.t = FALSE,
    prop.chisq = FALSE,
    chisq = TRUE,
    resid = TRUE,
    sresid = TRUE,
    asresid = TRUE 
)

```

***

Here I have displayed the first and only time also the adjusted standardized residuals. As you can see they are much higher and do not obey the z-score distribution. I do not know how to interpret them. As far as I a understood they are only used for some software packages as e.g., [SDA](https://sda.berkeley.edu/) to highlight outstanding values. (See @wat-chap05-adjusted-standardized-residuals)

***

::: {#bul-prepare-interpretation-standardized-residuals}
:::::{.my-bullet-list}
:::{.my-bullet-list-header}
Bullet List
:::
::::{.my-bullet-list-container}


- From the very small p-value (which is almost 0) we see that we have a statistically relevant association between opinions about opinions for ease of voting and race / ethnicity. 
- The biggest part for rejecting the null hypotheses, that there is not association has the group of black non-Hispanic. A much bigger proportion as we would have expected of black non-Hispanic support ease of voting and are against registration for voting.
- Another trend that goes in the reverse direction concerns the white non-Hispanic group. This group endorse that people should register for voting with a higher proportion as expected.

::::
:::::
What does the standardized residuals tell us?
:::

***

::::
:::::

::: {.callout-tip}
We used the chi-squared test to test the null hypothesis that there was no relationship between opinions on voter registration by race/ethnicity group. We rejected the null hypothesis and concluded that there was a statistically significant association between views on voter registration and race-ethnicity [$\chi^2(3) = 28.95; p < .05$]. Based on standardized residuals, the statistically significant chi-squared test result was driven by more White non-Hispanic participants and fewer Black non-Hispanic participants than expected believe that people should prove they want to vote by registering, and more Black non-Hispanic participants than expected believe that the voting process should be made easier.
:::


###### `stats::chisq.test()`

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-standardized-residuals-stats}
: Compute standardized residuals with `stats::chisq.test()`
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: standardized-residuals-stats
#| results: hold

stats::chisq.test(ease_vote_table)$residuals

graphics::mosaicplot(
    x = ease_vote_table,
    shade = TRUE,
    main = "Ease of voting by race / ethnicity"
)
```

***

I think that this result using base R tools is easier to understand and interpret as the presentation provided by `descr::CrossTable()`.  Especially the graph highlights the important differences. Solid lines represent values higher whereas dashed lines point to proportion that are smaller than expected. And the color scale gives you immediate feedback about the size of difference.

::::
:::::

###### `janitor::chis.test()`

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-standardized-residuals-janitor}
: Compute standardized residuals with `janitor::chisq.test()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: standardized-residuals-janitor

janitor::chisq.test(ease_vote_table)$residuals

```
***

Exactly the same result as with `stats::chisq.test()`.
::::
:::::

###### `questionr::chis.test()`

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-standardized-residuals-questionr}
: Compute standardized residuals with `questionr::chisq.residuals()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: standardized-residuals-questionr

questionr::chisq.residuals(ease_vote_table)

```
***

The only difference of this result is that the values are rounded. This is nice because for the interpretation we do not need the detailed values.


::::
:::::

###### `rstatix::chisq_test()`

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-standardized-residuals-rstatix}
: Compute standardized residuals with `rstatix::chisq_.residuals_test()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: standardized-residuals-rstatix
#| results: hold


(chisq_ease_vote_rstatix <- rstatix::chisq_test(ease_vote_table))

rstatix::chisq_descriptives(chisq_ease_vote_rstatix)

```
***

The result with {**rstatix**} is very detailed. Using {**rstatix**} has the additioonal advantage that it is {**tidyverse**} compatible and you can use the pipe. The package includes many different tests and has with `r cranlogs::cran_downloads(package = "rstatix")$count` downloads from the RStudio CRAN mirror in one day (`r cranlogs::cran_downloads(package = "rstatix")$date`) a pretty big user group.


::::
:::::


:::

::::
:::::

:::::{.my-important}
:::{.my-important-header}
Which package should I use to show standardized residuals?
:::
::::{.my-important-container}
1. `descr::CrossTable()` is used in the book, but I can't recommend it. The result cannot be transformed into a data.frame or tibble and it is therefore neither {**tidyverse**} compatible nor can you use the pipe.

2. A good solution is the combination of `stats::chisq.test()` and `graphics::mosaicplot()`. Especially the mosaic plot helps to figure out quickly which cells are important.

3. The best solution in my opinion is {**rstatix**}: Its results can be very detailed. {**rstatix**} is {**tidyverse**} compatible and you can use the pipe. The result with {**rstatix**} is very detailed. Using {**rstatix**} has the additional advantage that it is {**tidyverse**} compatible and you can use the pipe. The package includes [many different tests](https://rpkgs.datanovia.com/rstatix/) and has with `r cranlogs::cran_downloads(package = "rstatix")$count` downloads from the RStudio CRAN mirror in one day (`r cranlogs::cran_downloads(package = "rstatix")$date`) a pretty big user group.) and can therefore used or other tasks as well. With `r cranlogs::cran_downloads(package = "rstatix")$count` downloads from the RStudio CRAN mirror in one day (`r cranlogs::cran_downloads(package = "rstatix")$date`) it has a pretty big user group.

**Because of the wide range of tests and the big user basis I will apply {**rstatix**} as the predominant alternative  whenever the result is the same with other packages.**
::::
:::::


## Achievement 7: Effect sizes {#sec-chap05-achievement7}

### Cramér’s V {#sec-chap05-cramers-v}

Concerning out data about opinions about ease of voting we have two established two facts:

1. There is an association between ease of voting opinions and race /ethnicity.
2. This relation is driven mainly by black non-Hispanic preferring to a higher degree ease of voting and --- to a lesser degree -- white non-Hispanic supporting in a higher proportion than expected that people need to register for voting.

But we do not know the strength of this relationships. The strength of a relationship in statistics is referred to as `r glossary("effect size")`. For `r glossary("chi-squared")`, there are a few options, including the commonly used effect size statistic of `r glossary("Cramér’s V")`.

***

$$
V = \sqrt{\frac{\chi^2}{n(k-1)}}
$$ {#eq-chap05-cramers-v-formula}

- $\chi^2$: The chi-squared is the test statistic for the analysis.
- $n$: The sample size.
- $k$: The number of categories in the variable with the *fewest* categories.

***

$$
V = \sqrt{\frac{29.852}{977(2-1)}} = 0.17
$$ {#eq-chap05-cramers-v-example-calculation}

:::::{.my-assessment}
:::{.my-assessment-header}
:::::: {#cor-chap05-cramers-v}
: Interpretation of Cramér’s V
::::::
:::
::::{.my-assessment-container}
`r glossary("Cramér’s V")` is a measure of the strength of association between two nominal variables. It ranges from 0 to 1 where:

- Small or weak effect size for V = .1 
- Medium or moderate effect size for V = .3 
- Large or strong effect size for V = .5

More detailed interpretation based on the degrees of freedom in [How to Interpret Cramér’s V (with Examples)](https://www.statology.org/interpret-cramers-v/) [@bobbittn.d].

| Degrees of freedom | Small | Medium | Large |
|--------------------|-------|--------|-------|
| 1                  | 0.10  | 0.30   | 0.50  |
| 2                  | 0.07  | 0.21   | 0.35  |
| 3                  | 0.06  | 0.17   | 0.29  |
| 4                  | 0.05  | 0.15   | 0.25  |
| 5                  | 0.04  | 0.13   | 0.22  |

: How to interpret Cramér’s V?  {#tbl-cramers-v} {.striped .hover}

::::
:::::

:::::{.my-resource}
:::{.my-resource-header}
:::::: {#lem-chap05-cramer-v}
Number of daily downloads for packages with functions to compute Cramér’s V
::::::
:::
::::{.my-resource-container}

- {**lsr**}: `lsr::cramersV()`
- {**rcompanion**}: `rcompanion::cramerV()`
- {**DescTools**}: `DescTools::CramerV()`
- {**sjstats**}: `sjstats::cramer()`
- {**rstatix**}: `rstatix::cramer_v()`
- {**collinear**}: `collinear::cramber_v()`
- {**confintr**}: `confintr::cramersv()`

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-pkgs-cramers-v}
: Number of daily downloads for packages with functions to compute Cramèr’s V
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: tbl-donwload-numbers-cramers-v-packages
#| tbl-cap: "Download average numbers of packages with Cramèr’s V tests" 
#| echo: false
#| cache: true

(cranlogs_cramers_v <- base::readRDS("data/chap05/cranlogs_cramers_v.rds"))

```
I have checked only {**lsr**} and {**rstatix**} as I was happy with the result of the {**rstatix**} package.
::::
:::::


::::
:::::


:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap05-computing-cramers-v}
: Computing Cramér’s V
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### `lsr::cramersV()`

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-computing-cramers-v-lsr}
: Computing Cramér’s V with {**lsr**}
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: computing-cramers-v-lsr

lsr::cramersV(ease_vote_table)
```

::::
:::::


###### `rstatix::cramer_v()`

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-code-name-b}
: Computing Cramér’s V with {**rstatix**}
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: computing-cramers-v-rstatix

rstatix::cramer_v(ease_vote_table)
```

***

The more conservative interpretation from the book sees the effect size between small and medium, corresponding to a relationship between weak to moderate  Including the degrees of freedom we get the starting point for a moderate relationship. I will use the more conservative interpretation.

::: {.callout-tip}
There is a statistically significant relationship between opinions on voter registration and race-ethnicity, and the relationship is weak to moderate. This is consistent with the frequencies, which are different from expected, but not by an enormous amount in most of the groups.
:::

::::
:::::

:::

::::
:::::


### Yates continuity correction

When both variables have just two categories then you should apply the `r glossary("Yates continuity correction")`. It subtracts an additional .5 from the difference between observed and expected in each group, or cell of the table, making the chi-squared test statistic value smaller, making it therefore harder to reach statistical significance.

The correction is necessary because the chi-squared distribution is not a perfect representation of the distribution of differences between observed and expected of a chi-squared test in the situation where both variables are binary. Normally functions apply the correction as default whenever two binary variables are tested but you can decide via an argument whether you want to apply the correction or not.

An exception is `descr::CrossTable()` which provides automatically both versions whenever you compute the test statistic for a 2 by 2 table. This is somewhat illogical because you would always need only the version with the correction for a 2 by 2 table (and not both) and sometimes you would also want to apply it when there are few observations in one or more of the cells. 


:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap05-yates}
: Computing a chi-squared test statistic with the Yates continuity correction
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### descr::CrossTable() only test

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-yates-descr-only-test}
: Chi-squared test for ease of voting and home ownership
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: yates-descr-only-test
#| results: hold

## load vote_clean ##########
vote_clean <-  base::readRDS("data/chap05/vote_clean.rds")

descr::CrossTable(
    x = vote_clean$ease_vote,
    y = vote_clean$ownhome,
    expected = FALSE,
    prop.r = FALSE,
    prop.c = FALSE,
    prop.t = FALSE,
    prop.chisq = FALSE,
    chisq = TRUE,
    resid = FALSE,
    sresid = FALSE,
    asresid = FALSE
)
```

::::
:::::

###### rstatix::chisq_test() only test

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-yates-rstatix-only-test}
: Chi-squared test for ease of voting and home ownership with and without Yates continuity correction
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: yates-rstatix-only-test
#| results: hold

vote_ownhome_chisq1 <- rstatix::chisq_test(
    vote_clean$ease_vote,
    vote_clean$ownhome,
    correct = FALSE
)

vote_ownhome_chisq2 <- rstatix::chisq_test(
    vote_clean$ease_vote,
    vote_clean$ownhome,
    correct = TRUE
)

vote_ownhome_chisq <- 
    dplyr::bind_rows(
        vote_ownhome_chisq1,
        vote_ownhome_chisq2
        ) |> 
    tibble::add_column(
        "Yates" = c("No", "Yes"),
        .before = "p.signif"
        )

vote_ownhome_chisq
```

***

To compare the differences I have computed the chi-squared test twice with and without Yates correction. Then I have combined the results and added a column with the label yes/no.

::::
:::::


###### descr::CrossTable() full data

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-yates-descr}
: Chi-squared test for ease of voting and home ownership
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: yates-descr
#| results: hold

descr::CrossTable(
    x = vote_clean$ease_vote,
    y = vote_clean$ownhome,
    expected = TRUE,
    prop.r = FALSE,
    prop.c = FALSE,
    prop.t = FALSE,
    prop.chisq = FALSE,
    chisq = TRUE,
    resid = TRUE,
    sresid = TRUE,
    asresid = FALSE
)
```

::::
:::::

###### rstatix::chisq_test() full data

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-yates-rstatix}
: Chi-squared test for ease of voting and home ownership with and without Yates continuity correction
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: yates-rstatix
#| results: hold

vote_ownhome_chisq

glue::glue(" ")
glue::glue("#####################################################################")
glue::glue(" ")

rstatix::chisq_descriptives(vote_ownhome_chisq)

```

***

To compare the differences I have computed the chi-squared test twice with and without Yates correction. Then I have combined the results and added a column with the label yes/no.

::::
:::::


:::

::::
:::::

In all tabs of @exm-chap05-yates you can see that with the Yates continuity correction the $\chi^2$ value is smaller and results in a somewhat higher p-value. But that does not matter in this case: Both versions are statistically significant $p < .05$.

:::::{.my-assessment}
:::{.my-assessment-header}
:::::: {#cor-statistically-signifant}
: What do the stars under the heading `p-signif` in the results of the chi-squared tests with {**rstatix**} mean?
::::::
:::
::::{.my-assessment-container}

| significance <br>code          | p-value       |
|:------------------------------:|---------------|
|               ***              | [0, 0.001]    |
|               **               | (0.001, 0.01] |
|                *               | (0.01, 0.05]  |
|                .               | (0.05, 0.1]   |
|                                | (0.1, 1]      |

: How to interpret stars as significance levels?  {#tbl-stars-significance} {.striped .hover}

::::
:::::

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-cramers-v-voting-homeowner}
: Computing the effect size with Cramér’s V
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: cramers-v-voting-homeowner

rstatix::cramer_v(
    vote_clean$ease_vote,
    vote_clean$ownhome,
    correct = TRUE
)

```

The Yates continuity corrections also applies for the Cramér’s V effect size calculation. In this case the value of V falls into the weak or small effect size range.

::::
:::::

:::::{.my-remark}
:::{.my-remark-header}
Summary abbreviated
:::
::::{.my-remark-container}
I have not followed the `r glossary("NHST")` procedure and the analysis of the relationship for ease of voting and home ownership. I understand and feel save about most of the content, therefore I focus only on material where I have difficulties or where I need more practice (as with the `r glossary("Yates continuity correction")` and `r glossary("Cramér’s V")`.
::::
:::::


### Phi coefficient {#sec-chap05-phi-coefficient}

For 2 × 2 tables, the $k – 1$ term in the denominator of the Cramér’s V formula is always 1, so this term is not needed in the calculation. The formula without this term is called the `r glossary("phi coefficient")`.

:::::{.my-theorem}
:::{.my-theorem-header}
:::::: {#thm-chap05-phi-coefficient}
: Formula for phi coefficient $\phi$
::::::
:::
::::{.my-theorem-container}
$$
\phi = \sqrt{\frac{\chi^2}{n}}
$$ {#eq-chap05-phi-coefficient}

n = sample size

::::
:::::


### Odds ratio {#sec-chap05-odds-ratio}

:::::{.my-resource}
:::{.my-resource-header}
:::::: {#lem-chap05-odds-ratio}
Explaining the odds ratio
::::::
:::
::::{.my-resource-container}
The explication in `r glossary("SwR")` is not easy to understand. So I have used other material a well:

- Frost, J. (2022, January 11). Odds Ratio: Formula, Calculating & Interpreting. Statistics By Jim. https://statisticsbyjim.com/probability/odds-ratio/
- Glen, S. (n.d). Odds Ratio Calculation and Interpretation. Statistics How To. https://www.statisticshowto.com/probability-and-statistics/probability-main-index/odds-ratio/
- Poldrack, R. A. (2020, January 13). 10.12: [Odds and Odds Ratios](https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Statistical_Thinking_for_the_21st_Century_(Poldrack)/10%3A_Probability/10.12%3A_Odds_and_Odds_Ratios). Statistics LibreTexts. 
- Szumilas, M. (2010). Explaining Odds Ratios. Journal of the Canadian Academy of Child and Adolescent Psychiatry, 19(3), 227–229. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/
- Tenny, S., & Hoffman, M. R. (2024). Odds Ratio. In StatPearls. StatPearls Publishing. http://www.ncbi.nlm.nih.gov/books/NBK431098/

::::
:::::

Odds is usually defined in statistics as the probability an event will occur divided by the probability that it will not occur. In other words, it’s a ratio of successes (or wins) to losses (or failures). As an example, if a racehorse runs 100 races and wins 20 times, the odds of the horse winning a race is 20/80 = 1/4 = 0.25.  

The odds definition is different to the somewhat similar definition of probability, which is the fraction of times an event occurs in a certain number of trials. In the horse example, the probability of a win is 20/100 = 0.2. (see [@glenn.da])

:::::{.my-theorem}
:::{.my-theorem-header}
:::::: {#thm-chap05-odds}
: Formula for odds
::::::
:::
::::{.my-theorem-container}

$$
Odds = \frac{\text{Probability Event Occurs (p)}}{{\text{Probability Event Does Not Occur (1-p)}}}
$$ {#eq-chap05-odds}

::::
:::::


Odds ratios with groups quantify the strength of the relationship between two conditions. They indicate how likely an outcome is to occur in one context relative to another. 


:::::{.my-theorem}
:::{.my-theorem-header}
:::::: {#thm-chap05-odds-ratio}
: Formula for odds ratio
::::::
:::
::::{.my-theorem-container}

$$
\text{Odds Ratio} = \frac{\text{Odss of an Event (Condition A)}}{{\text{Odds of an Event (Condition B)}}}
$$ {#eq-chap05-odds-ratio}

::::
:::::


The denominator (condition B) in the odds ratio formula is the baseline or control group. Consequently, the OR tells you how much more or less likely the numerator events (condition A) are likely to occur relative to the denominator events. If you have a treatment and control group, the treatment will be in the numerator while the control group is in the denominator of the formula [@frost2022].

Taken the definition of odds and odds ratio together we get the formula:

:::::{.my-theorem}
:::{.my-theorem-header}
:::::: {#thm-chap05-odds-ratio2}
: Formula for odds ration (2)
::::::
:::
::::{.my-theorem-container}

$$
\begin{align*}
\text{Odds Ratio} = \frac{\text{Odds of an Event (Condition A)}}{{\text{Odds of an Event (Condition B)}}} = \\
\frac{\text{Odds of an Event (A)} / \text{Odds of an Non Event (A)}}{\text{Odds of an Event (B)} / \text{Odds of an Non Event (B)}} = \\
\frac{\text{Odds of an Event (A)} \times \text{Odds of an Non Event (B)}}{\text{Odds of Non Event (A)} \times \text{Odds of a Event (B)}}
\end{align*}
$$ {#eq-chap05-odds-ratio2}

::::
:::::


The book explanation of the `r glossary("odds ratio")` uses with `r glossary("exposure")` and `r glossary("outcome")` two new concepts and is therefore more difficult to understand. Under this terminology is the odds ratio a measure of the likelihood of a particular outcome. The odds ratio is calculated as the ratio of the number of events that produce or are exposed to that outcome to the number of events that do not produce, resp. are not exposed to the outcome. The odds ratio measures the odds of some event or outcome occurring given a particular exposure compared to the odds of it happening without that exposure. Or more generally: The odds ratio tells us the ratio of the odds of an event occurring in a treatment group compared to the odds of an event occurring in a control group. (Still pretty difficult…)

In our case of voting opinion and housing status the odds ratio would measure the odds of people that think one should register to vote given owning a home, compared to the odds of people that think one should register to vote given not owning a home. 

:::::{.my-theorem}
:::{.my-theorem-header}
:::::: {#thm-chap05-odds-ratio3}
: Formula for odds ratio (3)
::::::
:::
::::{.my-theorem-container}

$$
OR = \frac{\text{exposed with outcome} / \text{unexposed with outcome}}{\text{exposed no outcome} / \text{unexposed no outcome}}
$$ {#eq-chap05-odds-ratio3}

::::
:::::


To fill in the correct values one has to conceptualize a 2x2 table:

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-odds-ratio-table}
: Odds ratio table
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: odds-tatio-table
#| results: markup

tibble::tribble(
  ~Exposure,      ~Cases,  ~Control,
  "Exposed",     "a",     "b",
  "Not Exposed", "c",     "d"
)

```

***

The columns "Cases" and "Control" are the Outcomes:

- a = Number of exposed cases
- b = Number of exposed non-cases
- c = Number of unexposed cases
- d = Number of unexposed non-cases
::::
:::::

:::::{.my-theorem}
:::{.my-theorem-header}
:::::: {#thm-chap05-odds-ratio4}
: Formula odds ratio (4)
::::::
:::
::::{.my-theorem-container}

$$
OR = \frac{a / c}{b / d} = \frac{a \times d}{b \times c}
$$ {#eq-chap05-odds-ratio-4}

::::
:::::


Now let's think what this general structure mean in our case with voting opinions (easy versus register) and housing status (owner or renter). 


```{r}
#| label: voting-housing-table
#| results: markup

vote_clean <- base::readRDS("data/chap05/vote_clean.rds")
(
    vote_housing_table <- base::table(
        vote_clean$ownhome,
        vote_clean$ease_vote,
        dnn = c("Housing status", "Voting opinion")
    )
)
```

***

::: {#bul-chap05-odds-ratio-example}

:::::{.my-bullet-list}
:::{.my-bullet-list-header}
Bullet List
:::
::::{.my-bullet-list-container}


**Exposure and Outcome**

- Exposed: Landlords 
- Not Exposed: Renter (tenants)
- Cases: People that favor register for voting
- Control: People that want easy voting

**Cells and their values**

- Number of exposed cases [1,1] = (a) = House owner that want people to register for voting = 287.
- Number of exposed non-cases [1,2] = (c) = House owner that want easy voting = 375.
- Number of unexposed cases [2,1] = (b) = Renter that want people to register for voting = 112.
- Number of unexposed non-cases [2,2] = (d) = Renter that want easy voting = 208.


::::
:::::
Calculation of the odds ratio (OR) using the two-by-two frequency table of voting opinion by housing status

:::

***

:::::{.my-assessment}
:::{.my-assessment-header}
:::::: {#cor-chap05-odds-ratio}
: Interpretation of odds ratios using our example of voting opinion by housing status
::::::
:::
::::{.my-assessment-container}

**General rule**

- **OR = 1** indicates that the likelihood of the outcome for exposed is the same as for unexposed
- **OR > 1** indicates higher odds of the outcome for exposed compared to unexposed, e.g. the event/outcome is more likely to occur.
- **OR < 1** indicates lower odds of the outcome for exposed compared to unexposed, e.g. the event/outcome is less likely to occur.

**Our example**

- Home owners have 1.42 times the odds of thinking people should register to vote compared to people who do not own homes. 
- Or alternatively: Home owners have 42% higher odds of thinking people should register to vote compared to people who do not own homes.

::::
:::::


$$
OR = \frac{a / c}{b / d} = \frac{287 / 112}{375 / 208} = \frac{2.5625}{1.802885} = 1.42
$$ {#eq-chap05-odds-ratio-example}

The p-value for odds ratios has the same broad meaning as p-values for the chi-squared. But instead of being based on the area under the curve for the chi-squared distribution, it is based on the area under the curve for the log of the odds ratio, which is approximately normally distributed. The odds ratio can only be a positive number, and it results in a right-skewed distribution, which the log function can often transform to something close to normal.

:::::{.my-resource}
:::{.my-resource-header}
:::::: {#lem-chap05-packages-odds-ratio}
Packages with odds ratio function
::::::
:::
::::{.my-resource-container}
The book explains the manual calculation and recommends the {**fmsb**} package. I found via internet research some other packages with an odds ratio function: The following list is alphaetically sorted:

- {**DescTools**}: `DescTools::OddsRatio()`
- {**epitools**}: `epitools::oddsratio()`
- {**fmsb**}: `fmsb::oddsratio()`

The packages {**tern**} and {**BioProbability**} feature also a odds ratio function. But I haven't looked into these packages because they have less than 100 downloads daily form the RStudio CRAN Mirror server.

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-pkgs-dl-odds-ratio}
: Number of daily downloads for packages with an odds ratio function
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: tbl-odds-ratio-pkgs
#| tbl-cap: "Daily donwloads of packages with odds ratio function"
#| cache: TRUE

pkgs = c("DescTools", "epitools", "fmsb", "tern", "BioProbability")
pkgs_dl(pkgs)
```

::::
:::::

::::
:::::


:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap05-odds-ratio}
: Computing the odds ratio
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### Manually

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-odds-ratio-by-hand}
: Odds ratio of ease of voting by home ownership computed manually
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: odds-ratio-by-hand
#| results: hold


glue::glue("############### Table format used ################## ")
(
    vote_housing_table <- base::table(
        vote_clean$ownhome,
        vote_clean$ease_vote,
        dnn = c("Voting opinion", "Housing status")
    )
)
odds_ratio <-  round((287 / 112) / (375 / 208), 2)

glue::glue(" ")
glue::glue("###################################################")
glue::glue("Oddsratio: {odds_ratio}")
```
***

The calculation uses the frequencies in the 2 × 2 table where the rows are the exposure and the columns are the outcome.
::::
:::::


###### {**fmsb**}

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-odds-ratio-fmsb}
: Odds ratio of ease of voting by home ownership using `fmsb::oddsratio()`
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: odds-ratio-fmsb
#| results: hold

glue::glue("*****************   Input counts manually   ***********")
fmsb::oddsratio(a = 287, b = 112, c = 375, d = 208)

glue::glue(" ")
glue::glue("*******************************************************")

fmsb::oddsratio(vote_housing_table)

```

***

Here I have replicated the code from the book. {**fmsb**} has a disadvantage: You have to specify the values manually, you can't use a table object. It is said that the function will also work with a matrix but then I got a warning message:

> Warning in N1 * N0 * M1 * M0: NAs produced by integer overflow

As a result of the produced NA's the p-value is not computed. (But the calculated odds ratio is correct.)

So the best option is to stick with manually input. Besides of this inconvenience there is also a somewhat improper medical summary of the table ("Disease" / "Nondisease"). 


::::
:::::

###### {**DescTools**}


:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-odds-ratio-desctools}
: Odds ratio of ease of voting by home ownership using `DescTools::OddsRatio()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: odds-ratio-desctools

DescTools::OddsRatio(
    x = vote_housing_table, 
    conf.level = .95,
    method = "midp")
```
***

This is a very sparse output. In contrast to the two other packages it misses the table summary and the p-value.


::::
:::::

###### {**epitools**}

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap05-odds-ratio-epitools}
: Odds ratio of ease of voting by home ownership using `epitools::oddsratio()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: odds-ratio-epitools

epitools::oddsratio.midp(vote_housing_table, 
                    correction = TRUE,
                    verbose = TRUE)
```

***

This is the most detailed output. There exist also a less verbose version without 

- replicating the raw data $x$
- calculation of exposed proportions
- calculation of outcome proportions
- repeating the confidence level

This more stringent version has the most important information and is in my opinion the best option for calculating the odds ratio.
::::
:::::


:::

::::
:::::

:::::{.my-important}
:::{.my-important-header}
{**epitools**} is my preferred method for the odds ratio calculation
:::
::::{.my-important-container}

Because of the somewhat inconvenient data input for the `oddsratio()` function of the {**fmsb**} package and the sparse output `OddsRatio()` function of the {**DescTool**} package I prefer the computation with {**epitools**} in its more stringent option (`verbose = FALSE`).

::::
:::::


## Achievement 8: When chi-squared assumptions fail {#sec-chap05-achievement8}

What is to do when one of the chi-squared assumption fails?

### Variables not nominal or ordinal

Use a different statistical test. Chi-squared is only appropriate for categorical variables.

### Sample too small

The assumption of expected values of 5 or higher in at least 80% of groups is necessary because the sampling distribution for the chi-squared statistic only approximates the actual chi-squared distribution but does not capture it completely accurately. When a sample is large, the approximation is better and using the chi-squared distribution to determine statistical significance works well.

However, for very small samples, the approximation is not great, so a different method of computing the p-value is better. The method most commonly used is the `r glossary("Fisher’s exact test")` (`stats::fisher.test()`, `rstatix::fisher_test()`, `janitor::fisher.test()`, `fmsb::pairwise.fisher.test()`).

### Observation not independent

- If both variables are binary (have only two categories) use `r glossary("McNemar’s test")` (`stats::mcnemar.test()`)
- If there are three or more groups for one variable and a binary second variable use `r glossary("Cochran’s Q-test")`. Besides the book recommendation `nonpar::cochrans.q()` the test is also availabe in other packages, that I used already for this book: `DescTools::CochranQTest()`, `rstatix::cochran_qtest()`.


## Exercises (empty)


## Glossary

```{r}
#| label: glossary-table
#| echo: false

glossary_table()
```

------------------------------------------------------------------------

## Session Info {.unnumbered}

::: my-r-code
::: my-r-code-header
Session Info
:::

::: my-r-code-container
```{r}
#| label: session-info

sessioninfo::session_info()
```
:::
:::