generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
04_2_pertussis.Rmd
137 lines (93 loc) · 4.21 KB
/
04_2_pertussis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Exercise 4.2: Exploring the pertussis dataset"
author: "Lieven Clement, and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# Aim of this exercise
An exploratory data analysis is a crucial step in a data analysis to get
insight in the nature and distribution of the data, and to assess the
assumptions of the downstream data analysis.
In this exercise you will acquired the skills to conduct a data exploration for a two group comparison in R and to interpret the results.
# Background
Researchers wanted to study the immune response on pertussis.
They have set up an experiments with 40 rats.
16 rats were infected with pertussis and 24 rats received a control treatment.
Researchers measured the white blood cell concentration (WBC) in each rat (count per mm$^3$.
De data consists of two variables:
- WBC: white blood cell count (counts/mm$^3$).
- trt: treatment
- control: rat recieved control treatment
- pertussis: rat was infected with pertussis
Load the libraries
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
```
## Import the dataset
Data path:
`https://raw.githubusercontent.com/statOmics/PSLSData/main/wbcon.csv`
```{r, eval=FALSE}
wbcon <-
```
```{r, eval=FALSE}
glimpse(wbcon)
```
## Aim of the study
The overarching goal of this study was to assess if the white blood cell count changes upon pertussis infection. To this end, researchers randomized 40 rats
to two treatments: A control treatment and a treatment in which the rat was infected with pertussis.
We will explore the data to get insight on the impact of the pertussis infection on the white blood cell count.
A secondary goal of the data exploration is to assess assumptions that will be required to use a formal statistical test to assess if the white blood cell count is on average different between infected and control rats (see later exercises).
For this test to be valid, we have to assess following assumptions:
1. The data in each treatment group are normally distributed.
2. The data from the two treatment groups has the same variance.
# Data visualization
A crucial first step in a data analysis is to visualize and to explore the raw data.
## Histogram
First, make a histogram of the data. Fill in the
missing parts in the chunk of code below:
```{r, eval=FALSE}
wbcon %>%
ggplot(aes(x=...,fill=...)) + ## fill in the correct values for x and fill
geom_histogram() +
facet_grid(rows = vars(...)) + ## fill in to put the histograms for both treatment conditions in a separate row
theme_bw() +
xlab("relative abundance (%)")
```
Based on this plot, it seems that the white blood cell count
is ... for rats that were randomized to the pertussis infection than rats that received the control treatment.
## Boxplot
However, given the small sample size a better option to visualize these data are `boxplots`:
```{r, eval=FALSE}
wbcon %>%
ggplot(aes(x=...,y=...,fill=...)) +
```
What do you observe?
## QQ-plots
To assess the assumption that the data are normally distributed in each treatment group, we will use QQ plots.
```{r, eval=FALSE}
wbcon %>%
ggplot(aes(sample=...)) +
... +
... +
facet_grid(...)
```
What do you observe?
# Descriptive statistics
Here, we will generate some informative descriptive statistics
for the dataset.
We first summarize the data and calculate the mean, standard
deviation, number of observations and standard error and store the
result in an object wbcSum via 'wbcSum<-`
1. We pipe the `wbcon` dataframe to the group_by function to group
the data by treatment groep `group_by(trt)`
2. We pipe the result to the `summarize()` function to summarize
the "WBC" variable and calculate the mean, standard deviation and
the number of observations
3. We pipe the result to the `mutate` function to make a new
variable in the data frame that is named `se` for which we calculate the
standard error ($\sigma / n$)
```{r, eval=FALSE}
## Use the instructions from above to generate the summary statistics
...
```
This concludes the data exploration. In the next exercise sessions, we will learn how to formally
test if the observed difference in WBC between rats that were infected with pertussis and those receiving the control treatment is **statistically significant**.