From a4bd0f41751eff648ebf9f0cb683fb63ffdb00e8 Mon Sep 17 00:00:00 2001 From: Lieven Clement Date: Wed, 16 Oct 2024 14:34:47 +0200 Subject: [PATCH] update twosample and wilcoxon with movies --- 05-statisticalInference-twosampleT.Rmd | 2 ++ ...ametericStatistics-WilcoxonMannWithney.Rmd | 1 + docs/05-statisticalInference-twosampleT.html | 4 +++- .../figure-html/unnamed-chunk-2-1.png | Bin 47350 -> 47539 bytes ...metericStatistics-WilcoxonMannWithney.html | 4 +++- 5 files changed, 9 insertions(+), 2 deletions(-) diff --git a/05-statisticalInference-twosampleT.Rmd b/05-statisticalInference-twosampleT.Rmd index 622713f2..8c07590d 100644 --- a/05-statisticalInference-twosampleT.Rmd +++ b/05-statisticalInference-twosampleT.Rmd @@ -14,6 +14,8 @@ knitr::opts_chunk$set( library(tidyverse) ``` + + # Smelly armpit example - Smelly armpits are not caused by sweat, itself. The smell is caused by specific micro-organisms belonging to the group of *Corynebacterium spp.* that metabolise sweat. diff --git a/09-NonparametericStatistics-WilcoxonMannWithney.Rmd b/09-NonparametericStatistics-WilcoxonMannWithney.Rmd index fedf6523..e2d4ba2b 100644 --- a/09-NonparametericStatistics-WilcoxonMannWithney.Rmd +++ b/09-NonparametericStatistics-WilcoxonMannWithney.Rmd @@ -16,6 +16,7 @@ library(Rmisc) set.seed(140) ``` + # Introduction diff --git a/docs/05-statisticalInference-twosampleT.html b/docs/05-statisticalInference-twosampleT.html index 38cdb8a9..1cc8cf40 100644 --- a/docs/05-statisticalInference-twosampleT.html +++ b/docs/05-statisticalInference-twosampleT.html @@ -554,6 +554,8 @@

statOmics, Ghent University (Creative Commons License

+

1 Smelly armpit example

    @@ -777,7 +779,7 @@

    4 How to report?

    An effect can be extremely statistically significant, but scientifically irrelevant. With a CI you will spot this.

-
---
title: "5. Statistical Inference: Two-sample t-test"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
---

<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>

```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
  include = TRUE, comment = NA, echo = TRUE,
  message = FALSE, warning = FALSE, cache = TRUE
)
library(tidyverse)
```

# Smelly armpit example

- Smelly armpits are not caused by sweat, itself. The smell is caused by specific micro-organisms belonging to the group of *Corynebacterium spp.* that metabolise sweat.
Another group of abundant bacteria are the *Staphylococcus spp.*, these bacteria do not metabolise sweat in smelly compounds.

- The CMET-group at Ghent University does research to on transplanting the armpit microbiome to save people with smelly armpits.

- Proposed Therapy:
  	1. Remove armpit-microbiome with antibiotics
    2. Influence armpit microbiome with microbial  transplant (https://youtu.be/9RIFyqLXdVw)

- Experiment:

    - 20 subjects with smelly armpits are attributed to one of two treatment groups
    - placebo (only antibiotics)
    - transplant (antibiotica followed by microbial transplant).
    - The microbiome is sampled 6 weeks upon the treatment
    - The relative abundance of *Staphylococcus spp.* on *Corynebacterium spp.* + *Staphylococcus spp.* in the microbiome is measured via DGGE (*Denaturing Gradient Gel Electrophoresis*).

---

## Import the data

```{r}
ap <- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/armpit.csv")
ap
```

## Data exploration

We plot the direct relative abundances in function of the treatment group. With the ggplot2 library we can easily build plots by adding layers.

```{r}
ap %>% ggplot(aes(x = trt, y = rel)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(position = "jitter")

ap %>% ggplot(aes(sample = rel)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~trt)
```

---

# Two sample T-test
## Notation

Suppose that $Y_{ij}$ is the response for subjects $i=1,\ldots, n_j$ from population $j=1,2$.

Use of the term **treatment** or **group** instead of population

Here the treatment is $j=1$ microbial transplant vs $j=2$ placebo.

We assume

$$Y_{ij}\text{ i.i.d. } N(\mu_j,\sigma^2)\;\;\;i=1,\ldots,n_i\;j=1,2.$$

Note, that we assume equal variances **homoscedastic**

(Unequal variances are referred to as  **heteroscedastic**)

---

## Hypotheses

Test $$ H_0: \mu_1 = \mu_2 $$
against
$$  H_1: \mu_1 \neq \mu_2 .$$

$H_1$ is again the research hypothesis: the average relative abundance of *Staphylococcus spp.* is different upon microbial transplant then upon placebo treatment.

$H_0$ and $H_1$ can also be specified in terms of the effect size between the two treatments, $\mu_1-\mu_2$
$$H_0: \mu_1-\mu_2 = 0,$$
$$H_1: \mu_1-\mu_2 \neq 0.$$

We can estimate the effect size using the difference in sample means:
$$\hat \mu_1-\hat \mu_2=\bar Y_1 -\bar Y_2.$$

---

## Variance estimator

The experimental units are independent so the sample means are also independent and the variance on the difference is
$$\text{Var}_{\bar Y_1 -\bar Y_2}=\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}=\sigma^2 \left(\frac{1}{n_1}+\frac{1}{n_2}\right).$$

And the standard error becomes
$$\sigma_{\bar Y_1 -\bar Y_2}=\sigma\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}.$$

The variance can be estimated within each group using the sample variance:

$$S_1^2 = \frac{1}{n_1-1}\sum_{i=1}^{n_1} (Y_{i1}-\bar{Y}_1)^2.$$

$$S_2^2 = \frac{1}{n_2-1}\sum_{i=1}^{n_2} (Y_{i2}-\bar{Y}_2)^2.$$

But, if we assume equal variances $\sigma_1^2=\sigma_2^2=\sigma^2$ than we can estimate the variance more precise by using all observations in both groups. This variance estimator is also referred to as the *pooled variance estimator: $S^2_p$*.

So $S_1^2$ en $S_2^2$ are estimators of the same parameter $\sigma^2$.

And we can combine them into one estimator based on all $n_1+n_2$ observations:

$$  S_p^2 = \frac{n_1-1}{n_1+n_2-2} S_1^2 + \frac{n_2-1}{n_1+n_2-2} S_2^2 = \frac{1}{n_1+n_2-2}\sum_{j=1}^2\sum_{i=1}^{n_j} (Y_{ij} - \bar{Y}_j)^2.$$

$$ S_p^2= \sum\limits_{j=1}^2\sum\limits_{i=1}^{n_j} \frac{(Y_{ij}-\bar{Y}_{.j})^2}{n_1+n_2-2}$$


The pooled variance estimator uses the squared deviations of the observations from their group mean and has $n_1+n_2-2$ degrees of freedom.

---

## Test statistic
Two-sample $t$-teststatistiek:

$$T = \frac{\bar{Y}_1-\bar{Y}_2}{\sqrt{\frac{S_p^2}{n_1}+\frac{S_p^2}{n_2}}} =
  \frac{\bar{Y}_1 - \bar{Y}_2}{S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}.$$

The statistic T follows a t-distribution with $n_1+n_2-2$ under $H_0$ is all data are independent, normally distributed and have equal variances.

---

## Armpit example

We can implement the test in R:

```{r}
t.test(rel ~ trt, data = ap, var.equal = TRUE)
```

On the $5\%$ significance level we reject the null hypothesis in favor of the alternative hypothesis and conclude that the relative abundance of  *Staphylococcus spp.* is on average extreme significant larger is in transplantation group than in the placebo group.

If there is no effect of the transplant we have a probability of less then 9 in $100000$ to observe a test statistic in a random sample that is at least as extreme as what we observed in the armpit experiment.

This is extremely rare under $H_0$.

If $H_1$ is correct, we expect that the test statistic is larger in absolute value and expect small p-values. Hence we decide that there is a lot of evidence against $H_0$ in favour of $H_1$.

**Good statistical practice** is to report the $p$-value, but also  effect size along with its confidence interval. So that we can judge the statistical significance and the biological relevance.

### Conclusion

On average the relative abundance of *Staphylococcus spp.* in the microbiome of the armpit in the transplant group is extremely significantly different from that in the placebo group ($p<<0.001$). The relative abundance of *Staphylococcus spp.* is on average `r round(diff(t.test(rel~trt,data=ap,var.equal=TRUE)$estimate),1)`% larger in the transplant group than in the placebo group (95\% CI [`r paste(format(-t.test(rel~trt,data=ap,var.equal=TRUE)$conf.int[2:1],digits=2,nsmall=1),collapse=",")`]%).

---

# Assumptions

Validity of t-test depends on distributional assumptions:

- Independence (design)
- One-sample t-test: normality of the observations
- Paired t-test: normality of the difference
- Two-sample t-test: Normality of the observations in both groups, and equal variances.

If the assumptions are not met, the null distribution does not follow a t-distribution, and, the p-values and critical values are incorrect.

To construct confidence intervals we also rely on these assumptions.

- We used quantiles from the t-distribution to calculate the lower and upper limit.

- The correct coverage of the CI depends on these assumptions

---

## Evaluate normality

 - Boxplots and histograms: shape of distribution and outliers

 - QQ-plots

There also exist hypothesis tests (goodness-of-fit test), but their null hypothesis is that the data are normally distributed so we make a weak conclusion!

  - Kolmogorov-Smirnov, Shapiro-Wilk en Anderson-Darling.
  - In small samples they have a low power
  - In large samples they often flag very small deviations as significant

Recommendation

- Start with graphical exploration of the data and keep the sample size in mind to avoid overinterpretation of the plots.

- If you have doubts, use simulation where you simulate data with the same sample size from a Normal distribution with the same mean and variance as the one that you observed in the sample

- If you observed deviations of normality check in the literature how sensitive your method is such deviations of normality. (e.g. T-tests for instance are rather insensitive to deviations as long as the distribution of the data is symmetric.)

- In large samples you can resort to the central limit theorem.

- You might resort to transformations of the response.

---

## Homoscedasticity

- Boxplots: The box size is the inter quartile range (IQR) a robust estimator of the variance.

- If the differences are not large $\rightarrow$ homoscedasticiteit

- Again you can use simulation to get insight in the differences you can expect.

- Formal F-test can be used to compare the variances, but again under the null you assume equal variances, so the same criticism as for normality tests applies here.

---

## Welch modified t-test

If the data are heteroscedastic, you can use a Welch two-sample T-test, which no longer uses the pooled variance estimator.

$$T =  \frac{\bar{Y}_1 - \bar{Y}_2}{\sqrt{\frac{S^2_1}{n_1}+\frac{S^2_2}{n_2}}}$$
with $S^2_1$ en $S^2_2$ the sample variances in both groups.

This statistic follows approximately a t-distribution with a number of degrees of freedom between  $\text{min}(n_1-1,n_2-1)$ and $n_1+n_2-2$.

In R the degrees of freedom are estimated using the Welch- Satterthwaite approximation. You can do this by using the `t.test` function with argument `var.equal=FALSE`.

```{r}
t.test(rel ~ trt, data = ap, var.equal = FALSE)
```

Note that you can see that the Welch T-test is adopted in the title. The adjusted degrees of freedom are $df = 17.876$ $\pm$ to that of the conventional T-test, because the variances are approximately equal.

---

# How to report?

- In the scientific literature there is too much attention for p-values

- It is much more informative to combine an estimate with its confidence interval.

**Rule of thumb**:

Report an estimate together with its  confidence interval (and its p-value)

1. The result of the test can be derived of the confidence interval
2. It allows the reader to judge **scientific relevance**.

```{r}
t.test(rel ~ trt, data = ap)
```

The result of an $\alpha$-level t-test is equivalent with comparing the effect size under $H_0$ with the $1-\alpha$ CI.

An effect can be extremely statistically significant, but scientifically irrelevant. With a CI you will spot this.

+
---
title: "5. Statistical Inference: Two-sample t-test"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
---

<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>

```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
  include = TRUE, comment = NA, echo = TRUE,
  message = FALSE, warning = FALSE, cache = TRUE
)
library(tidyverse)
```

<iframe width="560" height="315" src="https://www.youtube.com/embed/gghS0czPKGY?si=VsPkue-f7GgVeILc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

# Smelly armpit example

- Smelly armpits are not caused by sweat, itself. The smell is caused by specific micro-organisms belonging to the group of *Corynebacterium spp.* that metabolise sweat.
Another group of abundant bacteria are the *Staphylococcus spp.*, these bacteria do not metabolise sweat in smelly compounds.

- The CMET-group at Ghent University does research to on transplanting the armpit microbiome to save people with smelly armpits.

- Proposed Therapy:
  	1. Remove armpit-microbiome with antibiotics
    2. Influence armpit microbiome with microbial  transplant (https://youtu.be/9RIFyqLXdVw)

- Experiment:

    - 20 subjects with smelly armpits are attributed to one of two treatment groups
    - placebo (only antibiotics)
    - transplant (antibiotica followed by microbial transplant).
    - The microbiome is sampled 6 weeks upon the treatment
    - The relative abundance of *Staphylococcus spp.* on *Corynebacterium spp.* + *Staphylococcus spp.* in the microbiome is measured via DGGE (*Denaturing Gradient Gel Electrophoresis*).

---

## Import the data

```{r}
ap <- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/armpit.csv")
ap
```

## Data exploration

We plot the direct relative abundances in function of the treatment group. With the ggplot2 library we can easily build plots by adding layers.

```{r}
ap %>% ggplot(aes(x = trt, y = rel)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(position = "jitter")

ap %>% ggplot(aes(sample = rel)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~trt)
```

---

# Two sample T-test
## Notation

Suppose that $Y_{ij}$ is the response for subjects $i=1,\ldots, n_j$ from population $j=1,2$.

Use of the term **treatment** or **group** instead of population

Here the treatment is $j=1$ microbial transplant vs $j=2$ placebo.

We assume

$$Y_{ij}\text{ i.i.d. } N(\mu_j,\sigma^2)\;\;\;i=1,\ldots,n_i\;j=1,2.$$

Note, that we assume equal variances **homoscedastic**

(Unequal variances are referred to as  **heteroscedastic**)

---

## Hypotheses

Test $$ H_0: \mu_1 = \mu_2 $$
against
$$  H_1: \mu_1 \neq \mu_2 .$$

$H_1$ is again the research hypothesis: the average relative abundance of *Staphylococcus spp.* is different upon microbial transplant then upon placebo treatment.

$H_0$ and $H_1$ can also be specified in terms of the effect size between the two treatments, $\mu_1-\mu_2$
$$H_0: \mu_1-\mu_2 = 0,$$
$$H_1: \mu_1-\mu_2 \neq 0.$$

We can estimate the effect size using the difference in sample means:
$$\hat \mu_1-\hat \mu_2=\bar Y_1 -\bar Y_2.$$

---

## Variance estimator

The experimental units are independent so the sample means are also independent and the variance on the difference is
$$\text{Var}_{\bar Y_1 -\bar Y_2}=\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}=\sigma^2 \left(\frac{1}{n_1}+\frac{1}{n_2}\right).$$

And the standard error becomes
$$\sigma_{\bar Y_1 -\bar Y_2}=\sigma\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}.$$

The variance can be estimated within each group using the sample variance:

$$S_1^2 = \frac{1}{n_1-1}\sum_{i=1}^{n_1} (Y_{i1}-\bar{Y}_1)^2.$$

$$S_2^2 = \frac{1}{n_2-1}\sum_{i=1}^{n_2} (Y_{i2}-\bar{Y}_2)^2.$$

But, if we assume equal variances $\sigma_1^2=\sigma_2^2=\sigma^2$ than we can estimate the variance more precise by using all observations in both groups. This variance estimator is also referred to as the *pooled variance estimator: $S^2_p$*.

So $S_1^2$ en $S_2^2$ are estimators of the same parameter $\sigma^2$.

And we can combine them into one estimator based on all $n_1+n_2$ observations:

$$  S_p^2 = \frac{n_1-1}{n_1+n_2-2} S_1^2 + \frac{n_2-1}{n_1+n_2-2} S_2^2 = \frac{1}{n_1+n_2-2}\sum_{j=1}^2\sum_{i=1}^{n_j} (Y_{ij} - \bar{Y}_j)^2.$$

$$ S_p^2= \sum\limits_{j=1}^2\sum\limits_{i=1}^{n_j} \frac{(Y_{ij}-\bar{Y}_{.j})^2}{n_1+n_2-2}$$


The pooled variance estimator uses the squared deviations of the observations from their group mean and has $n_1+n_2-2$ degrees of freedom.

---

## Test statistic
Two-sample $t$-teststatistiek:

$$T = \frac{\bar{Y}_1-\bar{Y}_2}{\sqrt{\frac{S_p^2}{n_1}+\frac{S_p^2}{n_2}}} =
  \frac{\bar{Y}_1 - \bar{Y}_2}{S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}.$$

The statistic T follows a t-distribution with $n_1+n_2-2$ under $H_0$ is all data are independent, normally distributed and have equal variances.

---

## Armpit example

We can implement the test in R:

```{r}
t.test(rel ~ trt, data = ap, var.equal = TRUE)
```

On the $5\%$ significance level we reject the null hypothesis in favor of the alternative hypothesis and conclude that the relative abundance of  *Staphylococcus spp.* is on average extreme significant larger is in transplantation group than in the placebo group.

If there is no effect of the transplant we have a probability of less then 9 in $100000$ to observe a test statistic in a random sample that is at least as extreme as what we observed in the armpit experiment.

This is extremely rare under $H_0$.

If $H_1$ is correct, we expect that the test statistic is larger in absolute value and expect small p-values. Hence we decide that there is a lot of evidence against $H_0$ in favour of $H_1$.

**Good statistical practice** is to report the $p$-value, but also  effect size along with its confidence interval. So that we can judge the statistical significance and the biological relevance.

### Conclusion

On average the relative abundance of *Staphylococcus spp.* in the microbiome of the armpit in the transplant group is extremely significantly different from that in the placebo group ($p<<0.001$). The relative abundance of *Staphylococcus spp.* is on average `r round(diff(t.test(rel~trt,data=ap,var.equal=TRUE)$estimate),1)`% larger in the transplant group than in the placebo group (95\% CI [`r paste(format(-t.test(rel~trt,data=ap,var.equal=TRUE)$conf.int[2:1],digits=2,nsmall=1),collapse=",")`]%).

---

# Assumptions

Validity of t-test depends on distributional assumptions:

- Independence (design)
- One-sample t-test: normality of the observations
- Paired t-test: normality of the difference
- Two-sample t-test: Normality of the observations in both groups, and equal variances.

If the assumptions are not met, the null distribution does not follow a t-distribution, and, the p-values and critical values are incorrect.

To construct confidence intervals we also rely on these assumptions.

- We used quantiles from the t-distribution to calculate the lower and upper limit.

- The correct coverage of the CI depends on these assumptions

---

## Evaluate normality

 - Boxplots and histograms: shape of distribution and outliers

 - QQ-plots

There also exist hypothesis tests (goodness-of-fit test), but their null hypothesis is that the data are normally distributed so we make a weak conclusion!

  - Kolmogorov-Smirnov, Shapiro-Wilk en Anderson-Darling.
  - In small samples they have a low power
  - In large samples they often flag very small deviations as significant

Recommendation

- Start with graphical exploration of the data and keep the sample size in mind to avoid overinterpretation of the plots.

- If you have doubts, use simulation where you simulate data with the same sample size from a Normal distribution with the same mean and variance as the one that you observed in the sample

- If you observed deviations of normality check in the literature how sensitive your method is such deviations of normality. (e.g. T-tests for instance are rather insensitive to deviations as long as the distribution of the data is symmetric.)

- In large samples you can resort to the central limit theorem.

- You might resort to transformations of the response.

---

## Homoscedasticity

- Boxplots: The box size is the inter quartile range (IQR) a robust estimator of the variance.

- If the differences are not large $\rightarrow$ homoscedasticiteit

- Again you can use simulation to get insight in the differences you can expect.

- Formal F-test can be used to compare the variances, but again under the null you assume equal variances, so the same criticism as for normality tests applies here.

---

## Welch modified t-test

If the data are heteroscedastic, you can use a Welch two-sample T-test, which no longer uses the pooled variance estimator.

$$T =  \frac{\bar{Y}_1 - \bar{Y}_2}{\sqrt{\frac{S^2_1}{n_1}+\frac{S^2_2}{n_2}}}$$
with $S^2_1$ en $S^2_2$ the sample variances in both groups.

This statistic follows approximately a t-distribution with a number of degrees of freedom between  $\text{min}(n_1-1,n_2-1)$ and $n_1+n_2-2$.

In R the degrees of freedom are estimated using the Welch- Satterthwaite approximation. You can do this by using the `t.test` function with argument `var.equal=FALSE`.

```{r}
t.test(rel ~ trt, data = ap, var.equal = FALSE)
```

Note that you can see that the Welch T-test is adopted in the title. The adjusted degrees of freedom are $df = 17.876$ $\pm$ to that of the conventional T-test, because the variances are approximately equal.

---

# How to report?

- In the scientific literature there is too much attention for p-values

- It is much more informative to combine an estimate with its confidence interval.

**Rule of thumb**:

Report an estimate together with its  confidence interval (and its p-value)

1. The result of the test can be derived of the confidence interval
2. It allows the reader to judge **scientific relevance**.

```{r}
t.test(rel ~ trt, data = ap)
```

The result of an $\alpha$-level t-test is equivalent with comparing the effect size under $H_0$ with the $1-\alpha$ CI.

An effect can be extremely statistically significant, but scientifically irrelevant. With a CI you will spot this.

-
---
title: "9. Nonparametric Statistics - Wilcoxon-Mann-Withney test"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
---

<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>

```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
  include = TRUE, comment = NA, echo = TRUE,
  message = FALSE, warning = FALSE, cache = TRUE
)
library(tidyverse)
library(Rmisc)
set.seed(140)
```


# Introduction

Inference was only correct if distributional assumptions were satisfied

- e.g Normal distribution
- equal variance

-  The $p$-value: $\text{P}_0\left[ \vert T\vert \geq \vert t \vert \right]$.

	- Calculated using the null distribution of $T$ that we derived under the assumptions
	- In correct if assumptions are violated

-  $95\%$ CI also builds upon these assumptions. If they are invalid then the intervals will not contain the population parameter with 95% probability.

- Asymptotic theory is more difficult to place: the $t$-test is asymptotically non-parametric because for very large samples the distributional assumptions of normality are no longer important.

- If assumptions hold the parametric approach

	- more efficient: larger power with same sample size + smaller CI.
	- more flexible: easier to analyse data with complex designs

---

## Cholesterol example

- Cholesterol concentration in blood measured for
  - 5 patients (group=1) two days upon a stroke
  - 5 healthy subject (groep=2).

- Is cholesterol concentration of hart patients and healthy subjects on average different?

```{r}
chol <- read_tsv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/chol.txt")
chol$group <- as.factor(chol$group)
nGroups <- table(chol$group)
n <- sum(nGroups)
chol
```

---


```{r, echo=FALSE, fig.align='center'}
chol %>% ggplot(aes(x = group, y = cholest)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(position = "jitter")

chol %>% ggplot(aes(sample = cholest)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~group)
```

- Possibly outliers
- Difficult to assess distributional assumptions when only 5 observations are available.


# Rank Tests

- Important group of non-parametric test
  - Non-parametric,
  - Exact $p$-values using a permutation null distribution.
  - No need for separate permutation distribution for each new dataset.
  - Permutation null distribution of rank tests only depends on sample size
  - Robust to outliers

---

# Ranks

Rank tests start from rank-transformed data.

- Let $Y_1, \ldots, Y_n$.
- In the absence of *ties*
  $$R_i=R(Y_i) = \#\{Y_j: Y_j\leq Y_i; j=1,\ldots, n\}$$
- Smallest observation has rank 1, second smallest rank 2, ... , largest observation gets rank $n$

```{r}
chol$cholest
rank(chol$cholest)
```

---

## Ties

Sometimes *ties* occur: two observations with identical values

```{r}
withTies <- c(403, 507, 507, 610, 651, 651, 651, 830, 900)
rank(withTies)
```

- Ties: 507 occurs twice, 651 occurs 3 times
- If ties occur *midranks* are used.

- **midrank** of observation $Y_i$ becomes
  \begin{eqnarray*}
   R_i &=& \frac{ \#\{Y_j: Y_j\leq Y_i\} + ( \#\{Y_j: Y_j < Y_i\} +1)}{2}.
   \end{eqnarray*}

---

## Ranks of pooled sample

- Let $Y_{ij}$, $i=1,\ldots, n_j$ be observations from two treatment groups $j=1,2$.
- They can also be represented by $Z_1,\ldots, Z_n$ ($n=n_1+n_2$), the outcomes of the pooled sample

```{r}
t(chol)
z <- chol$cholest
z
rank(z)
```
---

# Wilcoxon-Mann-Whitney Test

 Simultaneously developed by Wilcoxon, and,  Mann and Whitney:  **Wilcoxon-Mann-Whitney**, **Wilcoxon rank sum test**  or **Mann-Whitney U test**

## Hypotheses

Under $H_0$ the distributions of the two groups are equal
$$H_0: f_1=f_2$$


Under the alternative $H_1$ the distributions differ in location $$H_1: \mu_1\neq \mu_2$$

$H_1$ assumes **location-shift**, we will relax this assumption later on.

## Test statistic

Classic T-test: difference in sample means $\bar{Y}_1-\bar{Y}_2$.

Here: Difference in sample means based on rank transformed data

Ranks based on the pooled sample (upon joining the observations from the two groups): $R_{ij}=R(Y_{ij})$ is de rank of observation $Y_{ij}$ in the pooled sample.

\[
  T = \frac{1}{n_1}\sum_{i=1}^{n_1} R(Y_{i1}) - \frac{1}{n_2}\sum_{i=1}^{n_2} R(Y_{i2}) .
\]

- Under $H_0$ we expect the average rank of the first group to be close to that of the second group so $T$ is close to zero.

- Under $H_1$ we expect the mean ranks to differ so that $T$ deviates from zero.

- It is sufficient to only calculate
  $$S_1=\sum_{i=1}^{n_1} R(Y_{i1})$$.

- $S_1$ is the sum of the ranks of the first group: *rank sum test*.

- This holds because
\[
  S_1+S_2 = \text{sum of all ranks} = 1+2+\cdots + n=\frac{1}{2}n(n+1).
\]


- $S_1$ (or $S_2$) is a good test statistic

- Use permutations to determine the exact permutation distribution. (Permute the ranks between the groups)

- For a given $n$ and no *ties* the rank transformed data is always
  $$1, 2, \ldots, n$$
- For given $n_1$ en $n_2$ the permutation distribution is always the same!
- With current computing power this is not so important any more.

---

## Standardized statistic

Often the standardized test statistic is used
\[
  T = \frac{S_1-\text{E}_{0}\left[S_1\right]}{\sqrt{\text{Var}_{0}\left[S_1\right]}},
\]

- with $\text{E}_{0}\left[S_1\right]$ and $\text{Var}_{0}\left[S_1\right]$ the expect mean and variance of S1 under $H_0$.

- Under $H_0$
 \[
   \text{E}_{0}\left[S_1\right]= \frac{1}{2}n_1(n+1) \;\;\;\;\text{ en }\;\;\;\; \text{Var}_{0}\left[S_1\right]=\frac{1}{12}n_1n_2(n+1).
 \]

- Under $H_0$ and when $\min(n_1,n_2)\rightarrow \infty$
 \[
    T = \frac{S_1-\text{E}_{0}\left[S_1\right]}{\sqrt{\text{Var}_{0}\left[S_1\right]}} \rightarrow N(0,1).
 \]

Asymptotically the standardised statistic follows a standard normal distribution!

---

## Cholesterol example

We illustrate the result for the cholesterol example using the R function `wilcox.test`.
```{r}
wilcox.test(cholest ~ group, data = chol)
```

- We reject $H_0$ ($p=$ `r format(wilcox.test(cholest~group,data=chol)$p.value,digits=2)` $<0.05$)

- The output shows $W=$ `r wilcox.test(cholest~group,data=chol)$statistic`?

- Lets calculate
```{r}
S1 <- sum(rank(chol$cholest)[chol$group == 1])
S1
S2 <- sum(rank(chol$cholest)[chol$group == 2])
S2
```

- Where does $W=$ `r wilcox.test(cholest~group,data=chol)$statistic` comes from?

---

## Mann and Whitney test

Mann and Whitney test in absence of ties:
\[
 U_1 = \sum_{i=1}^{n_1}\sum_{k=1}^{n_2} \text{I}\left\{Y_{i1}\geq Y_{k2}\right\}.
\]

- with $\text{I}\left\{.\right\}$ an indicator that equals 1  if the expression is true and is zero otherwise.

- U counts how many times an observation of the first group is larger or equal to an observation from the second group.

```{r}
y1 <- subset(chol, group == 1)$cholest
y2 <- subset(chol, group == 2)$cholest
u1Hlp <- sapply(y1, function(y1i, y2) {
  y1i >= y2
}, y2 = y2)
colnames(u1Hlp) <- y1
rownames(u1Hlp) <- y2
```

```{r}
u1Hlp
U1 <- sum(u1Hlp)
U1
```

It can be shown that $U_1 = S_1 - \frac{1}{2}n_1(n_1+1).$

```{r}
S1 - nGroups[1] * (nGroups[1] + 1) / 2
```

1. $U_1$ en $S_1$ contain the same information
2. $U_1$ is also a rank statistic, and
3. Exact test based on $U_1$ and $S_1$ are equivalent.

---

## Probabilistic index

- $U_1$ has a better interpretation feature
- Let $Y_j$ a random observation from group $j$ ($j=1,2$). Then
\begin{eqnarray*}
  \frac{1}{n_1n_2}\text{E}\left[U_1\right]
     &=& \text{P}\left[Y_1 \geq Y_2\right].
\end{eqnarray*}

So we can estimate the probability by calculating the mean of all indicator variable values $\text{I}\left\{Y_{i1}\geq Y_{k2}\right\}$. Note, that we did $n_1 \times n_2$ comparisons

```{r}
mean(u1Hlp)
U1 / (nGroups[1] * nGroups[2])
```

- Probability $\text{P}\left[Y_1 \geq Y_2\right]$ is referred to as the *probabilistic index*.
- It is the probability that a random observation of the first group is larger or equal than a random observation of the second group
- If $H_0$ holds $\text{P}\left[Y_1 \geq Y_2\right]=\frac{1}{2}$.

- R function `wilcox.test` does not return the Wilcoxon rank sum statistic. It returns the Mann-Whitney statistic $U_1$.
- Lets revisit the result
```{r}
wTest <- wilcox.test(cholest ~ group, data = chol)
wTest
U1
probInd <- wTest$statistic / prod(nGroups)
probInd
```

Because $p=$ `r format(wTest$p.value,digits=3)` $<0.05$ we conclude at the $5\%$ significance level that the mean cholesterol level of hart patients is larger then that of healthy subjects.

  - Note that we have assumed that the location-shift model is valid in this conclusion.
  - We also know that higher cholesterol level are more likely for hart patients then for healthy subjects and this probability is
$U1/(n_1\times n_2)=$ `r probInd*100`%.
  - We should assess the location shift assumption. But this is not possible with only 5 observations.

Without the location-shift assumption the conclusion in terms of the probabilistic index remains valid!

  - So when we do not assume location shift we test for

\[H_0: F_1=F_2 \text{ vs } H_1: P[Y_1 \geq Y_2] \neq 0.5.\]


## Conclusion

There is a significant difference in the distribution of the cholesterol concentration of hart patients two days upon a stroke and that of healthy subject ($p=$ `r format(wTest$p.value,digits=3)`). It is more likely to observe higher cholesterol levels for hart patients then for healthy subjects. The point estimator for this probability is `r probInd*100`%.

+
---
title: "9. Nonparametric Statistics - Wilcoxon-Mann-Withney test"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
---

<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>

```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
  include = TRUE, comment = NA, echo = TRUE,
  message = FALSE, warning = FALSE, cache = TRUE
)
library(tidyverse)
library(Rmisc)
set.seed(140)
```

<iframe width="560" height="315" src="https://www.youtube.com/embed/W7Jou4ZY8Vg?si=YVGWEeM7PaEtQmcZ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

# Introduction

Inference was only correct if distributional assumptions were satisfied

- e.g Normal distribution
- equal variance

-  The $p$-value: $\text{P}_0\left[ \vert T\vert \geq \vert t \vert \right]$.

	- Calculated using the null distribution of $T$ that we derived under the assumptions
	- In correct if assumptions are violated

-  $95\%$ CI also builds upon these assumptions. If they are invalid then the intervals will not contain the population parameter with 95% probability.

- Asymptotic theory is more difficult to place: the $t$-test is asymptotically non-parametric because for very large samples the distributional assumptions of normality are no longer important.

- If assumptions hold the parametric approach

	- more efficient: larger power with same sample size + smaller CI.
	- more flexible: easier to analyse data with complex designs

---

## Cholesterol example

- Cholesterol concentration in blood measured for
  - 5 patients (group=1) two days upon a stroke
  - 5 healthy subject (groep=2).

- Is cholesterol concentration of hart patients and healthy subjects on average different?

```{r}
chol <- read_tsv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/chol.txt")
chol$group <- as.factor(chol$group)
nGroups <- table(chol$group)
n <- sum(nGroups)
chol
```

---


```{r, echo=FALSE, fig.align='center'}
chol %>% ggplot(aes(x = group, y = cholest)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(position = "jitter")

chol %>% ggplot(aes(sample = cholest)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~group)
```

- Possibly outliers
- Difficult to assess distributional assumptions when only 5 observations are available.


# Rank Tests

- Important group of non-parametric test
  - Non-parametric,
  - Exact $p$-values using a permutation null distribution.
  - No need for separate permutation distribution for each new dataset.
  - Permutation null distribution of rank tests only depends on sample size
  - Robust to outliers

---

# Ranks

Rank tests start from rank-transformed data.

- Let $Y_1, \ldots, Y_n$.
- In the absence of *ties*
  $$R_i=R(Y_i) = \#\{Y_j: Y_j\leq Y_i; j=1,\ldots, n\}$$
- Smallest observation has rank 1, second smallest rank 2, ... , largest observation gets rank $n$

```{r}
chol$cholest
rank(chol$cholest)
```

---

## Ties

Sometimes *ties* occur: two observations with identical values

```{r}
withTies <- c(403, 507, 507, 610, 651, 651, 651, 830, 900)
rank(withTies)
```

- Ties: 507 occurs twice, 651 occurs 3 times
- If ties occur *midranks* are used.

- **midrank** of observation $Y_i$ becomes
  \begin{eqnarray*}
   R_i &=& \frac{ \#\{Y_j: Y_j\leq Y_i\} + ( \#\{Y_j: Y_j < Y_i\} +1)}{2}.
   \end{eqnarray*}

---

## Ranks of pooled sample

- Let $Y_{ij}$, $i=1,\ldots, n_j$ be observations from two treatment groups $j=1,2$.
- They can also be represented by $Z_1,\ldots, Z_n$ ($n=n_1+n_2$), the outcomes of the pooled sample

```{r}
t(chol)
z <- chol$cholest
z
rank(z)
```
---

# Wilcoxon-Mann-Whitney Test

 Simultaneously developed by Wilcoxon, and,  Mann and Whitney:  **Wilcoxon-Mann-Whitney**, **Wilcoxon rank sum test**  or **Mann-Whitney U test**

## Hypotheses

Under $H_0$ the distributions of the two groups are equal
$$H_0: f_1=f_2$$


Under the alternative $H_1$ the distributions differ in location $$H_1: \mu_1\neq \mu_2$$

$H_1$ assumes **location-shift**, we will relax this assumption later on.

## Test statistic

Classic T-test: difference in sample means $\bar{Y}_1-\bar{Y}_2$.

Here: Difference in sample means based on rank transformed data

Ranks based on the pooled sample (upon joining the observations from the two groups): $R_{ij}=R(Y_{ij})$ is de rank of observation $Y_{ij}$ in the pooled sample.

\[
  T = \frac{1}{n_1}\sum_{i=1}^{n_1} R(Y_{i1}) - \frac{1}{n_2}\sum_{i=1}^{n_2} R(Y_{i2}) .
\]

- Under $H_0$ we expect the average rank of the first group to be close to that of the second group so $T$ is close to zero.

- Under $H_1$ we expect the mean ranks to differ so that $T$ deviates from zero.

- It is sufficient to only calculate
  $$S_1=\sum_{i=1}^{n_1} R(Y_{i1})$$.

- $S_1$ is the sum of the ranks of the first group: *rank sum test*.

- This holds because
\[
  S_1+S_2 = \text{sum of all ranks} = 1+2+\cdots + n=\frac{1}{2}n(n+1).
\]


- $S_1$ (or $S_2$) is a good test statistic

- Use permutations to determine the exact permutation distribution. (Permute the ranks between the groups)

- For a given $n$ and no *ties* the rank transformed data is always
  $$1, 2, \ldots, n$$
- For given $n_1$ en $n_2$ the permutation distribution is always the same!
- With current computing power this is not so important any more.

---

## Standardized statistic

Often the standardized test statistic is used
\[
  T = \frac{S_1-\text{E}_{0}\left[S_1\right]}{\sqrt{\text{Var}_{0}\left[S_1\right]}},
\]

- with $\text{E}_{0}\left[S_1\right]$ and $\text{Var}_{0}\left[S_1\right]$ the expect mean and variance of S1 under $H_0$.

- Under $H_0$
 \[
   \text{E}_{0}\left[S_1\right]= \frac{1}{2}n_1(n+1) \;\;\;\;\text{ en }\;\;\;\; \text{Var}_{0}\left[S_1\right]=\frac{1}{12}n_1n_2(n+1).
 \]

- Under $H_0$ and when $\min(n_1,n_2)\rightarrow \infty$
 \[
    T = \frac{S_1-\text{E}_{0}\left[S_1\right]}{\sqrt{\text{Var}_{0}\left[S_1\right]}} \rightarrow N(0,1).
 \]

Asymptotically the standardised statistic follows a standard normal distribution!

---

## Cholesterol example

We illustrate the result for the cholesterol example using the R function `wilcox.test`.
```{r}
wilcox.test(cholest ~ group, data = chol)
```

- We reject $H_0$ ($p=$ `r format(wilcox.test(cholest~group,data=chol)$p.value,digits=2)` $<0.05$)

- The output shows $W=$ `r wilcox.test(cholest~group,data=chol)$statistic`?

- Lets calculate
```{r}
S1 <- sum(rank(chol$cholest)[chol$group == 1])
S1
S2 <- sum(rank(chol$cholest)[chol$group == 2])
S2
```

- Where does $W=$ `r wilcox.test(cholest~group,data=chol)$statistic` comes from?

---

## Mann and Whitney test

Mann and Whitney test in absence of ties:
\[
 U_1 = \sum_{i=1}^{n_1}\sum_{k=1}^{n_2} \text{I}\left\{Y_{i1}\geq Y_{k2}\right\}.
\]

- with $\text{I}\left\{.\right\}$ an indicator that equals 1  if the expression is true and is zero otherwise.

- U counts how many times an observation of the first group is larger or equal to an observation from the second group.

```{r}
y1 <- subset(chol, group == 1)$cholest
y2 <- subset(chol, group == 2)$cholest
u1Hlp <- sapply(y1, function(y1i, y2) {
  y1i >= y2
}, y2 = y2)
colnames(u1Hlp) <- y1
rownames(u1Hlp) <- y2
```

```{r}
u1Hlp
U1 <- sum(u1Hlp)
U1
```

It can be shown that $U_1 = S_1 - \frac{1}{2}n_1(n_1+1).$

```{r}
S1 - nGroups[1] * (nGroups[1] + 1) / 2
```

1. $U_1$ en $S_1$ contain the same information
2. $U_1$ is also a rank statistic, and
3. Exact test based on $U_1$ and $S_1$ are equivalent.

---

## Probabilistic index

- $U_1$ has a better interpretation feature
- Let $Y_j$ a random observation from group $j$ ($j=1,2$). Then
\begin{eqnarray*}
  \frac{1}{n_1n_2}\text{E}\left[U_1\right]
     &=& \text{P}\left[Y_1 \geq Y_2\right].
\end{eqnarray*}

So we can estimate the probability by calculating the mean of all indicator variable values $\text{I}\left\{Y_{i1}\geq Y_{k2}\right\}$. Note, that we did $n_1 \times n_2$ comparisons

```{r}
mean(u1Hlp)
U1 / (nGroups[1] * nGroups[2])
```

- Probability $\text{P}\left[Y_1 \geq Y_2\right]$ is referred to as the *probabilistic index*.
- It is the probability that a random observation of the first group is larger or equal than a random observation of the second group
- If $H_0$ holds $\text{P}\left[Y_1 \geq Y_2\right]=\frac{1}{2}$.

- R function `wilcox.test` does not return the Wilcoxon rank sum statistic. It returns the Mann-Whitney statistic $U_1$.
- Lets revisit the result
```{r}
wTest <- wilcox.test(cholest ~ group, data = chol)
wTest
U1
probInd <- wTest$statistic / prod(nGroups)
probInd
```

Because $p=$ `r format(wTest$p.value,digits=3)` $<0.05$ we conclude at the $5\%$ significance level that the mean cholesterol level of hart patients is larger then that of healthy subjects.

  - Note that we have assumed that the location-shift model is valid in this conclusion.
  - We also know that higher cholesterol level are more likely for hart patients then for healthy subjects and this probability is
$U1/(n_1\times n_2)=$ `r probInd*100`%.
  - We should assess the location shift assumption. But this is not possible with only 5 observations.

Without the location-shift assumption the conclusion in terms of the probabilistic index remains valid!

  - So when we do not assume location shift we test for

\[H_0: F_1=F_2 \text{ vs } H_1: P[Y_1 \geq Y_2] \neq 0.5.\]


## Conclusion

There is a significant difference in the distribution of the cholesterol concentration of hart patients two days upon a stroke and that of healthy subject ($p=$ `r format(wTest$p.value,digits=3)`). It is more likely to observe higher cholesterol levels for hart patients then for healthy subjects. The point estimator for this probability is `r probInd*100`%.
