generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
09-NonparametericStatistics-KruskalWallis.Rmd
172 lines (127 loc) · 5.75 KB
/
09-NonparametericStatistics-KruskalWallis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: "9. Non-parametric statistics - Kruskal Wallis"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
include = TRUE, comment = NA, echo = TRUE,
message = FALSE, warning = FALSE, cache = TRUE
)
library(tidyverse)
set.seed(140)
```
# Comparison of $g$ groups
- Extend $F$-test from a one-way ANOVA to non-parametric alternatives.
# DMH example
Assess genotoxicity of 1,2-dimethylhydrazine dihydrochloride (DMH) (EU directive)
- 24 rats
- four groups with daily DMH dose
- control
- low
- medium
- high
- Genotoxicity in liver using comet assay on 150 liver cells per rat.
- Are there differences in DNA damage due to DMH dose?
## Comet Assay:
- Visualise DNA strand breaks
- Length comet tail is a proxy for strand breaks.
![Comet assay](https://raw.githubusercontent.com/GTPB/PSLS20/gh-pages/assets/figs/comet.jpg){ width=50% }
```{r}
dna <- read_delim("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/dna.txt", delim = " ")
dna$dose <- as.factor(dna$dose)
dna
```
```{r}
dna %>%
ggplot(aes(x = dose, y = length, fill = dose)) +
geom_boxplot() +
geom_point(position = "jitter")
dna %>%
ggplot(aes(sample = length)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~dose)
```
- Strong indication that data in control group has a lower variance.
- 6 observations per group are too few to check the assumptions
```{r}
plot(lm(length ~ dose, data = dna))
```
# Kruskal-Wallis Rank Test
- The Kruskal-Wallis Rank Test (KW-test) is a non-parameteric alternative for ANOVA F-test.
- Classical $F$-teststatistic can be written as
\[
F = \frac{\text{SST}/(g-1)}{\text{SSE}/(n-g)} = \frac{\text{SST}/(g-1)}{(\text{SSTot}-\text{SST})/(n-g)} ,
\]
- with $g$ the number of groups.
- SSTot depends only on outcomes $\mathbf{y}$ and will not vary in permutation test.
- SST can be used as statistic
$$\text{SST}=\sum_{j=1}^t n_j(\bar{Y}_j-\bar{Y})^2$$
- The KW test statistic is based on SST on rank-transformed outcomes^[we assume that no *ties* are available],
\[
\text{SST} = \sum_{j=1}^g n_j \left(\bar{R}_j - \bar{R}\right)^2 = \sum_{j=1}^t n_j \left(\bar{R}_j - \frac{n+1}{2}\right)^2 ,
\]
- with $\bar{R}_j$ the mean of the ranks in group $j$, and $\bar{R}$ the mean of all ranks,
\[
\bar{R} = \frac{1}{n}(1+2+\cdots + n) = \frac{1}{n}\frac{1}{2}n(n+1) = \frac{n+1}{2}.
\]
- The KW teststatistic is given by
\[
KW = \frac{12}{n(n+1)} \sum_{j=1}^g n_j \left(\bar{R}_j - \frac{n+1}{2}\right)^2.
\]
- The factor $\frac{12}{n(n+1)}$ is used so that $KW$ has a simple asymptotic null distribution. In particular under $H_0$, given thart $\min(n_1,\ldots, n_g)\rightarrow \infty$,
\[
KW \rightarrow \chi^2_{t-1}.
\]
- The exact KW-test can be executed by calculating the permutation null distribution (that only depends on $n_1, \ldots, n_g$) to test
$$H_0: f_1=\ldots=f_g \text{ vs } H_1: \text{ at least two means are different}.$$
- In order to allow $H_1$ to be formulated in terms of means, the assumption of locations shift should be valid.
- For DMH example this is not the case.
- If location-shift is invalid, we have to formulate $H_1$ in terms of probabilistic indices:
$$H_0: f_1=\ldots=f_g \text{ vs } H_1: \exists\ j,k \in \{1,\ldots,g\} : \text{P}\left[Y_j\geq Y_k\right]\neq 0.5$$
## DNA Damage Example
```{r}
kruskal.test(length ~ dose, data = dna)
```
- On the $5\%$ level of significance we can reject the null hypothesis.
- R-functie `kruskal.test` only returns the asymptotic approximation for $p$-values.
- With only 6 observaties per groep, this is not a good approximation of the $p$-value
- With the `coin` R package we can calculate the exacte $p$-value
```{r,warning=FALSE,message=FALSE}
library(coin)
kwPerm <- kruskal_test(length ~ dose,
data = dna,
distribution = approximate(B = 100000)
)
kwPerm
```
- We conclude that the difference in the distribution of the DNA damages due to the DMH dose is extremely significantly different.
- Posthoc analysis with WMW tests.
```{r}
pairwise.wilcox.test(dna$length, dna$dose)
```
- All DMH behandelingen are significantly different from the control.
- The DMH are not significantly different from one another.
- U1 does not occur in the `pairwise.wilcox.test` output. Point estimate on probability on higher DNA-damage?
```{r, echo=FALSE}
pairWilcox <- pairwise.wilcox.test(dna$length, dna$dose)
```
```{r}
nGroup <- table(dna$dose)
probInd <- combn(levels(dna$dose), 2, function(x) {
test <- wilcox.test(length ~ dose, subset(dna, dose %in% x))
return(test$statistic / prod(nGroup[x]))
})
names(probInd) <- combn(levels(dna$dose), 2, paste, collapse = "vs")
probInd
```
Because there are doubts on the location-shift model we draw our conclusions in terms of the probabilistic index.
### Conclusion
- There is an extremely significant difference in in the distribution of the DNA-damage measurements due to the treatment with DMH ($p<0.001$ KW-test).
- DNA-damage is more likely upon DMH treatment than in the control treatment (all p=0.013, WMW-testen).
- The probability on higher DNA-damage upon exposure to DMH is 100% (Calculation of a CI on the probabilistic index is beyond the scope of the course)
- There are no significant differences in the distributions of the comit-lengths among the treatment with the different DMH concentrations ($p=$ `r paste(format(range(pairWilcox$p.value[,-1],na.rm=TRUE),digit=2),collapse="-")`).
- DMH shows already genotoxic effects at low dose.
- (Alle paarswise tests are gecorrected for multiple testing using Holm's methode).