generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
09_2_lettuce_sol.Rmd
373 lines (288 loc) · 12.9 KB
/
09_2_lettuce_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
---
title: "Exercise 9.2: Non-parametric statistics on the lettuce dataset"
author: "Lieven Clement and Jeroen Gilis"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# The lettuce dataset
In a previous tutorial, we analysed the dataset on
lettuce plants using ANOVA. However, it was not clear
if all the assumptions of ANOVA were met. Indeed, with
only 7 datapoints per group, it is very hard to assess
the assumptions of normality and equal variances.
Therefore, we will re-analyse the dataset by using the
non-parametric alternative to ANOVA, the `Kruskal-Wallis test`.
We will first give a concise overview of what we saw in the
ANOVA analysis, which can be found in the
`ANOVA_lettuce_plants.Rmd` file.
The researchers want to find out if biochar, compost and
a combination of both biochar and compost have an influence
on the growth of lettuce plants. To this end, they grew up
lettuce plants in a greenhouse. The pots were filled with
one of four soil types;
1. Soil only (control)
2. Soil supplemented with biochar (refoak)
3. Soil supplemented with compost (compost)
4. Soil supplemented with both biochar and compost (cobc)
The dataset `freshweight_lettuce.txt` contains the freshweight
(in grams) for 28 lettuce plants (7 per condition).
Load the required libraries
```{r, message = FALSE}
library(tidyverse)
library(car)
library(coin)
```
# Data import
```{r, message=FALSE}
lettuce <- read_csv("https://raw.githubusercontent.com/statOmics/PSLSData/main/freshweight_lettuce.txt")
# Take a glimpse at the data
glimpse(lettuce)
```
```{r}
# treatment to factor
lettuce <- lettuce %>%
mutate(treatment = as.factor(treatment))
```
# Data exploration
```{r}
# Count the number of observations per treatment
table(lettuce$treatment)
```
Now let's make a boxplot displaying the freshweight
of each treatment condition:
```{r}
lettuce %>%
ggplot(aes(x = treatment, y = freshweight, fill = treatment)) +
scale_fill_brewer(palette = "RdGy") +
theme_bw() +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2) +
ggtitle("Boxplot of the freshweigth for each treatment condition") +
ylab("freshweight (gram)") +
stat_summary(fun = mean, geom = "point", shape = 5, size = 3, color = "black", fill = "black")
```
Note that there are no clear outliers in the data.
We can see that the mean freshweight is very comparable
between the control and refoak treatments and between the
compost and cobc treatments. We can also see that the mean
freshweight is much higher in the cobc and control treatments
than in the control and refoak treatments. But is this
observed difference significant?
# ANOVA
To study whether or not the observed difference between the
average freshweight values of the differentt treatment groups
are significant, we may perform an ANOVA.
## Hypotheses
The null hypothesis of ANOVA states that:
The mean freshweigth is equal between the different treatment groups.
The alternative hypothesis of ANOVA states that:
The mean freshweigth for at least one treatment group is different
than the mean freshweight in at least one other treatment group.
## Checking the assumptions of ANOVA
Before we may proceed with the analysis, we must make sure that all
assumptions for ANOVA are met. ANOVA has three assumptions:
1. The observations are independent of each other (in all groups)
2. The data (freshweigth) must be normally distributed (in all groups)
3. The variability within all groups is similar
### Assumption of independence
The first assumption is met; we started of with 28 lettuce plants and
we randomly submitted them to one of four treatment conditions. There
is no reason to believe that the plants display systematic differences
between ttreatment groups, other than the actual treatment.
### Assumption of normality
For the second assumption, we must check normality in each group.
```{r}
## get qqplots for each individual treatment group
par(mfrow = c(2, 2))
for (i in levels(lettuce$treatment)) {
qqPlot(subset(lettuce, treatment == i)$freshweight, main = i, ylab = "")
}
```
While in the `ANOVA_lettuce_plants.Rmd` file we accepted
the assumption of normality, it must be noted that it is
tricky to assess the assumption with only 7 datapoints.
See `ANOVA_lettuce_plants.Rmd`"` for more details on this.
**Let's say that here we decide not to assume normality.**
### Assumption of equal variances
We can check the assumption of equal variance with a boxplot:
```{r}
lettuce %>%
ggplot(aes(x = treatment, y = freshweight, fill = treatment)) +
scale_fill_brewer(palette = "RdGy") +
theme_bw() +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2) +
ggtitle("Boxplot of the freshweigth for each treatment condition") +
ylab("freshweight (gram)") +
stat_summary(fun = mean, geom = "point", shape = 5, size = 3, color = "black", fill = "black")
```
As a measure of variability, we may take the height
of each boxplot's box. This is the interval between
the 25% and 75% quantile. Here we can see that this
interval, as well as the length of the whiskers, is
approximately equal for most groups. However, the
variability of cobc does seem to be quite a bit larger
than the variability in the refoak group.
**While we accepted the assumption of equal variances in the**
**`ANOVA_lettuce_plants.Rmd` file, we will here reject the**
**assumption.**
Not all assumptions for ANOVA are met. As such, we will rely
on the non-parametric alternative of ANOVA: the Kruskal-Wallis
test.
# Kruskal-Wallis rank test
If we want to test for a difference in the median of the
different treatment groups, we have to **assume a location shift**,
saying that all treatment groups follow the same distribution,
but with a different median.
However, here, we might be **not prepared for taking this assumption**.
While the range and spread of the data is similar for most groups
(see boxplot), there is a quite a big difference between the IQR of
the refoak and cobc conditions.
When we reject the assuming the location shift, we can relax the
distributional assumptions even further and perform a test in terms
of probabilistic indices (see the `Non_parametric_shrimps.Rmd` file).
With this Kruskal-Wallis test, we will test whether or not the chance
that a random value of one treatment group is larger than or equal to
("$\geq$") a random value of another treatment group is significantly
different from 50%.
## Hypotheses
- Null hypothesis:
$H0$: The distribution of freshweights of lettuce plants are equal
for all treatment conditions.
- Alternative hypothesis:
$HA$: The chance that a random value of at least one treatment group
is larger than or equal to ("$\geq$") a random value of at least
one other treatment group is significantly different from 50%.
## Test
```{r}
set.seed(1)
kwPerm <- kruskal_test(freshweight ~ treatment, lettuce,
distribution = approximate(nresample = 100000)
)
kwPerm
```
Note that here we are comparing the observed test statistic
(chi-squared = `r format(kwPerm@statistic@teststatistic, digits=5)`) with the test statistics derived from
an empirical distribution that was generated by taking 10.000
permutations of the original lettuce dataset.
We find an extremly significant (p < 1e-05) of the treatment
on the freshweight. On the 5% global significance level, we may
state that the chance that a random value of at least one treatment
group is larger than or equal to ("$\geq$") a random value of at
least one other treatment group is significantly different from 50%.
Now, we will perform a post-hoc analysis to find out which specific
groups are different from each other.
# Post-hoc analysis
We will perform a post-hoc analysis with pairwise Wilcoxon rank
sum test. As we did not want to assume the location shift, we
will interpret the outcome in terms of probabilistic indices.
Note that after the analysis, we will need to correct the acquired
p-values for multiple testing.
## Hypotheses
For each pairwise test, we have the following hypotheses:
- Null hopothesis:
$H0$: The distribution of freshweights of lettuce plants are equal
for both treatment conditions.
- Alternative hypothesis:
$HA$: The chance that a random value of treatment group 1 is larger
than or equal to ("$\geq$") a random value of treatment group 2
is significantly different from 50%.
## Test
```{r}
## initial attempt to perform the analysis
pairwise.wilcox.test(lettuce$freshweight, lettuce$treatment)
```
We get the following warning message:
`cannot compute exact p-value with ties`.
This is because the `pairwise.wilcox.test()` use the
standard `wilcox.test()` function. In the help file
of this function (?wilcox.test), we can read that in the
presence of ties in the data, the function will perform
an asymptotic test rather than an exact test.
## Test (2)
If we do want to obtain exact p-values, we may use the
`wilcox_test()` function from the `coin` package for
each pairwise combination of treatments. The obtained
p-values must be corrected for multiple testing, e.g.
with the `p.adjust()` function.
```{r}
## caluclate the p-value for each treatment combination with wilcoxon_test
treatments <- levels(lettuce$treatment)
freshweight <- lettuce$freshweight
pvalues <- combn(treatments, 2, function(x) {
## Pairwise Wilcon test
test <- wilcox_test(freshweight ~ treatment, subset(lettuce, treatment %in% x), distribution = "exact")
## Get and store p-value of test
pvalue(test)
})
## Adjust for multiple testing
pvalues_holm <- p.adjust(pvalues, method = "holm")
## link the p-value with the correct pairwise test
names(pvalues_holm) <- combn(levels(lettuce$treatment), 2, paste, collapse = "_VS_")
pvalues_holm
```
The exact p-values do indeed deviate from those calculated
with the `pairwise.wilcox.test()` function. We will proceed
with the exact p-values.
Now we will compute the point estimation for the probabilistic
index (for each pairwise comparison). Note that we already
did this in the `Non_parametric_shrimps.Rmd` file for a single comparison.
```{r}
## Count the number of observations per group
nGroup <- table(lettuce$treatment)
## Compute the probabilistic index for each pairwise combination
treatments <- levels(lettuce$treatment)
probInd <- combn(treatments, 2, function(x) {
## Compute the U1 statistic
U1 <- wilcox.test(freshweight ~ treatment, subset(lettuce, treatment %in% x))$statistic
## Compute the probabilistic index
U1 / prod(nGroup[x])
})
## link the probabilistic index with the correct pairwise test
names(probInd) <- combn(levels(lettuce$treatment), 2, paste, collapse = "_VS_")
probInd
```
We again see the same warning message. Here, this is not a
prblem, we are not interested in the p-values (which we
already computed), we only care about the probabilistic indices,
that are calculated from the U1 statistic. U1 is computed as
the number of times an observation of group 1 is larger than or
equal to an observation of group 2. This statistic is thus
defined even in the case of ties.
# Conclusion
We find an extremly significant (p < 1e-05) of the treatment
on the freshweight. On the 5% global significance level, we may
state that the chance that a random value of at least one treatment
group is larger than or equal to ("$\geq$") a random value of at
least one other treatment group is significantly different from 50%.
For the post-hoc analysis:
- We find a highly significant difference in the distributions between
the compost treatment and the control treatment. The probability
that the freshweigth of plants grown in compost soil is higher than
or equal to ("$\geq$") the freshweigth of plants grown in control
soil is 100%. This is highly significantly different from 50%
(adjusted p-value=0.003).
- We find a highly significant difference in the distributions between
the cobc treatment and the control treatment. The probability
that the freshweigth of plants grown in compost soil is higher than
or equal to ("$\geq$") the freshweigth of plants grown in control
soil is 98%. This is highly significantly different from 50%
(adjusted p-value=0.005).
- We find a highly significant difference in the distributions between
the compost treatment and the refoak treatment. The probability
that the freshweigth of plants grown in compost soil is higher than
or equal to ("$\geq$") the freshweigth of plants grown in control
soil is 100%. This is highly significantly different from 50%
(adjusted p-value=0.003).
- We find a highly significant difference in the distributions between
the cobc treatment and the refoak treatment. The probability
that the freshweigth of plants grown in compost soil is higher than
or equal to ("$\geq$") the freshweigth of plants grown in control
soil is 100%. This is highly significantly different from 50%
(adjusted p-value=0.003).
For the other contrast, we do not enough find evidence to suggest
significant differences between the treatment groups.
We may conclude that supplementing soil with compost or with
both compost and biochar has a positive effect on the freshweigth
of lettuce plants. Note that, qualitatively, these conclusion are
exactly the same as with the ANOVA analysis of the dataset in the
`ANOVA_lettuce_plants.Rmd` file.