generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
10-categoricalDataAnalysis.Rmd
935 lines (677 loc) · 32.2 KB
/
10-categoricalDataAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
---
title: "10. Categorical Data Analysis"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
include = TRUE, comment = NA, echo = TRUE,
message = FALSE, warning = FALSE, cache = TRUE
)
```
```{r libraries, message=FALSE, warning=FALSE, cache=FALSE}
library(Rmisc)
library(tidyverse)
library(NHANES)
```
# Introduction
Until now we used models for a continuous response in function of categorical or continuous predictor. Here we will focus on a categorical response.
- we study the association of a categorical outcome and a categorical predictor and
- touch upon methods to model a categorical outcome with a continuous response.
---
# Test for a proportion
## Saksen-study
- Fairly closed population (few migration)
- Probability that an unborn child is male?
```{r}
boys <- 3175
n <- 6155
```
- On `r n` unborn children `r boys` boys are observed.
- Is there a difference in the probability on the gender of an unborn child?
---
- The data are derived from a binary random variable $X$
- $X=1$ for a boy and
- $X=0$ for a girl.
- Note: count problem: the outcome is a count (number of boys)
- Formally we have considered a population of unborn children where each individual is characterized by a 0 or 1.
---
## Bernoulli Probability Mass Distribution
- Binary data can be modelled using a Bernoulli distribution:
\begin{eqnarray*}
X_i &\sim& B(\pi) \text{ with}\\
B(\pi)&=&\pi^{X_i}(1-\pi)^{(1-X_i)},
\end{eqnarray*}
- It has one model parameter $\pi$
- Expected value of $X_i$: $\text{E}[X_i]=\pi,$
- Proportion of unborn boys (children with $X=1$) in the population.
- Hence $\pi$ is the probability that a random individual from the population is a boy (an observation with $X=1$).
- The variance of Bernoulli data is also related to $\pi$.
$$\text{Var}[X_i]=\pi (1-\pi).$$
---
Some Bernoulli probability mass functions
```{r, out.width='100%', fig.asp=1, fig.align='center',echo=FALSE}
par(mfrow = c(1, 3), pty = "s")
probs <- c(0.25, .5, .75)
for (i in 1:length(probs))
{
plot(c(0, 1), c(1 - probs[i], probs[i]), ylim = c(0, 1), type = "h", xaxt = "n", xlab = "X", ylab = "Probability", main = as.expression(substitute(pi == val, list(val = probs[i]))), lwd = 3)
axis(1, at = c(0, 1))
}
```
---
- In the Saksen study 6155 observaties were sampled at random from the population.
- We can estimate $\pi$ using the sample mean:
$$\hat \pi = \bar X = \frac{\sum\limits_{i=1}^n X_i}{n},$$
```{r}
pi <- boys / n
pi
```
In our example $\bar x =$ `r boys` / `r n` = `r format(pi*100,digits=3)`%.
---
## Binomial test
- Is the observed probability `r format(pi*100,digit=3)`% that an unborn child is male, sufficient evidence to conclude that there is a higher chance on a boy than on a girl?
- Statistical test for
$$H_0: \pi=1/2 \text{ versus } H_1: \pi\neq 1/2,$$
- We need known the distribution of
- $X$ and $\bar X$
- or of the sum $S=n\bar X$.
---
- Suppose that $H_0:\pi=1/2$ is true (boys and girls have an equal frequency in the population)
- A random individual from the population has a probability of $$P(X=1)=\pi=1/2.$$ to be a boy
- Two children that are drawn independently :
- Probability of $\pi=1/2$ on a boy for the first and second child (independent from each other)
- Outcomes $(x_1, x_2)$ for both children have 4 possible combinations:
$(0,0),(0,1),(1,0)\text{ en }(1,1).$
- Each of them has a probability of $1/4 = 1/2 \times 1/2$.
- Random variable $S$ is the sum of the outcomes :
$(x_1,x_2)$|$s$|$P(S = s)$|
|:---:|:---:|:---:|
(0,0)|0|1/4|
(0,1), (1,0)|1|1/2|
(1,1)|2|1/4|
---
### General: $n$ independent samples
- Probability for $\pi$ on success ($X=1$)
- Total number of successes $S$ (sum of all values of 1) can take $n+1$ possible values
$$S=k\text{, met }k=0,\ldots,n$$
- Distribution of $S$?
\begin{equation}
P(S=k) = \left (
\begin{array}{c}
n \\
k \\
\end{array}
\right ) \pi^k (1-\pi)^{n-k}
\end{equation}
- $1-\pi$: probability on 0 for individual observation and
- binomial coefficient
\begin{equation*}
\left (
\begin{array}{c}
n \\
k \\
\end{array}
\right ) = \frac{n \times (n-1) \times ...\times (n-k+1) }{ k!} = \frac{ n!}{ k!(n-k)! }
\end{equation*}
- In R you can calculate the probabilities on each $S=k$ with the function `dbinom(k,n,p)`
---
### Binomial Distribution
S is a:
- *Binomial distributed random variable* with *Binomial probability mass function*
- Parameters
- $n$ (total number of draws from the population, is equivalent to maximal value of the outcome)
- $\pi$ (probability on `success` for each draw).
- Calculate probability on $k$ successes on $n$ independent draws each with a succes probability of $\pi$.
- For binary data X.
- e.g.: wild type vs mutant of a gene, infected or not infected with HIV, ...
- Use: Compare proportions or risks on a particular event between groups.
---
### Some Binomial probability mass functions.
```{r binoms,fig.cap='Binomial distributions.', out.width='80%', fig.asp=.8, fig.align='center',echo=FALSE}
par(mfrow = c(2, 2))
probs <- c(0.25, .5, .75)
for (i in 1:length(probs))
{
plot(0:10, dbinom(0:10, prob = probs[i], size = 10), ylim = c(0, 1), type = "h", xlab = "X", ylab = "Kans (Dichtheid)", main = as.expression(substitute(pi == val, list(val = paste(probs[i], ", nobs=10")))), lwd = 3)
}
plot(2925:3225, dbinom(2925:3225, prob = .5, size = n), type = "h", xlab = "X", ylab = "Kans (Dichtheid)", main = as.expression(substitute(pi == val, list(val = paste0("0.5, nobs=", n)))))
```
---
Test statistic $$H_0:\pi=1/2\text{ vs }H_1:\pi\neq 1/2$$
- $\bar X-1/2$ or, equivalent,
- $\Delta=n(\bar X-\pi_0)=S-s_0$.
- Distribution of the latter test statistic can be derived immediately from the Binomial distribution:
- We observe $s=$ `r boys` and thus $\delta=s-s_0=$ `r boys` $-$ `r n` $\times 0.5=$ `r boys-n/2`.
- When boys and girls are equally likely, i.e. under $H_0:\pi=1/2$, we will get following two-sided p-value:
$$p=\text{P}_0\left[S-s_0\geq \vert \delta\vert \right] + \text{P}_0\left[S-s_0\leq - \vert \delta\vert \right].$$
- Note, that we can rewrite it in terms of S.
$$p=\text{P}_0\left[S\geq s_0+ \vert \delta\vert \right] + \text{P}_0\left[S \leq s_0 - \vert \delta\vert \right].$$
---
- For the Saksen study we calculate:
\begin{eqnarray*}
\text{P}_0\left[S\geq s_0+ \vert \delta\vert \right] &=& P(S \geq 6155 \times 0.5 + \vert 3175 - 6155 \times 0.5\vert ) \\&=& P(S \geq 3175)\\
&= &P(S= 3175) + P(S=3176) + ... + P(S=6155)\\
& =& 0.0067\\\\
\text{P}_0\left[S \leq s_0 - \vert \delta\vert \right] &=& P(S \leq 6155 \times 0.5 - \vert 3175- 6155 \times 0.5\vert) \\&=& P(S \leq 2980)\\ &= &P(S=0) + ... + P(S=2980) \\
&=&0.0067
\end{eqnarray*}
- The Binomial distribution is symmetric for$\pi=1/2$:
$$\text{P}_0\left[S\geq s_0+ \vert \delta\vert \right] = \text{P}_0\left[S \leq s_0 - \vert \delta\vert \right]$$
- This is no longer the case when $\pi$ deviate from 0.5.
---
```{r}
pi0 <- 0.5
s0 <- pi0 * n
delta <- abs(boys - s0)
delta
sUp <- s0 + delta
sDown <- s0 - delta
c(sDown, sUp)
pUp <- 1 - pbinom(sUp - 1, n, pi0)
pDown <- pbinom(sDown, n, pi0)
p <- pUp + pDown
c(pUp, pDown, p)
```
---
- If $\pi= 1/2$, the probability to observe at least $\delta=$ `r delta` boys more or less than the mean under $H_0: s_0=$ `r s0` in a random sample under H_0 is only `r format(p*100,digits=3)`% is: **$p$-value of binomial test.**
- Very unlikely to observe such a large number of boys in a random sample when boys and girls have the same frequency in the population (under $H_0$).
- Hence the hypothesis that the frequency of boys and girls is the same is not supported by the data.
---
```{r, out.width='100%', fig.asp=.8, fig.align='center',echo=FALSE}
plot(s0 + seq(-150.5, 150.5, 1), dbinom(s0 + seq(-150.5, 150.5, 1), prob = .5, size = n), type = "h", xlab = "X", ylab = "Probability", main = as.expression(substitute(pi == val, list(val = paste0("0.5, nobs=", n)))), ylim = c(-.0009, 0.011))
abline(v = s0, lwd = 2, col = 4)
abline(v = boys, lwd = 2, col = 2)
abline(v = sDown, lwd = 1, col = 2, lty = 2)
text(c(sDown, s0, boys, boys), c(rep(0.011, 3), 0.01), labels = c(expression(paste(s[0] - delta)), expression(s[0]), expression(s == s[0] + delta), as.expression(substitute(s == val, list(val = boys)))), pos = 4, col = c(2, 4, 2))
text(sDown - 50, -.0005, label = "p-value", col = 2, pos = 4)
text(sUp, -.0005, label = "p-value", col = 2, pos = 4)
arrows(s0 + 1500.5, -.0009, sUp, -.0009, col = 2, lwd = 2, angle = 20, length = .1)
arrows(s0 - 1500.5, -.0009, sDown, -.0009, col = 2, lwd = 2, angle = 20, length = .1)
```
---
The test can immediately be conducted using the `binomial.test` function in R.
```{r}
binom.test(x = boys, n = n, p = pi0)
```
On the 5% significance-level we conclude that there is on average a higher probability on a unborn male child than an unborn female child.
---
### Confidence interval on a proportion
- Estimator of the proportion of boys in the population is the sample mean $\hat \pi=\bar x=$ `r format(pi,digits=3)`
- The standard error is
$$SE_{\bar x}=\sqrt{\frac{\text{Var}[X]}{n}}=\sqrt{\frac{\pi(1-\pi)}{n}}$$
- We can estimate this based on the sample: $SE_{\bar x}=\sqrt{\frac{\hat\pi(1-\hat\pi)}{n}}=$ `r format(sqrt(pi*(1-pi)/n),digits=2)`.
- 95% CI via CLT: $\hat\pi \pm 1.96 SE_{\hat\pi}.$
```{r}
se <- sqrt(pi * (1 - pi) / n)
pi + c(-1, 1) * qnorm(0.975) * se
```
---
### CI on proportion in small sample?
```{r}
CI <- binom.test(x = boys, n = n, p = pi0)$conf.int
CI
```
---
### Conclusion
- Note that the test for a proportion is equivalent to a one-sample t-test for binary data.
- For the Saksen population we conclude at the 5% significance level that the gender of unborn children is more likely to be male than female ($p=$ `r round(binom.test(x=boys,n=n,p=pi0)$p.value,3)`).
The probability that an unborn child is male equals `r format(pi*100,digits=3)`% (95% CI [`r paste(format(CI*100,digits=3),collapse=",")`]%).
---
# Test for association between two qualitative variables
## Paired observations
- 2 measurements on same individual
- e.g. before and after exposure to a chemical substance
- observe a categorical outcome before and after.
- *paired binary responses*
- Statistical analysis has to account for paired design.
---
### Example: Females choose gentle, but not healthy or macho males in Campbell dwarf hamsters (Rogovin et al. 2017)
```{r, out.width='70%', fig.asp=1, fig.align='center',echo=FALSE}
library(plotrix)
par(mar = c(0, 0, 0, 0), mai = c(0, 0, 0, 0))
plot(0, 0, axes = FALSE, xlab = "", ylab = "", col = 0, xlim = c(0, 3), ylim = c(0, 3))
rect(0, 0, 3, 3)
lines(c(1, 1), c(0, 3))
lines(c(2, 2), c(0, 3))
lines(c(1, 1), c(1, 2), lwd = 2)
lines(c(1, 1), c(1, 2), lwd = 4)
lines(c(2, 2), c(1, 2), lwd = 4)
rect(1.45, 1, 1.55, 1.3)
draw.circle(1.5, 1.36, 0.05)
lines(c(1.5, 1.5), c(1, .85))
text(1.5, 1.15, "\\VE", vfont = c("sans serif", "bold"), cex = 1.3)
rect(0.5, 1.05, 0.8, 1.15)
draw.circle(0.86, 1.1, 0.05)
lines(c(0.5, 0.35), c(1.1, 1.1))
lines(c(0.8, 0), c(1.1, 1.5), lty = 2)
text(0.65, 1.1, "\\MA", vfont = c("sans serif", "bold"), cex = 1.3)
rect(2.5, 1.95, 2.2, 1.85)
draw.circle(2.14, 1.9, 0.05)
lines(c(2.5, 2.65), c(1.9, 1.9))
lines(c(2.2, 3), c(1.9, 1.5), lty = 2)
text(2.35, 1.9, "\\MA", vfont = c("sans serif", "bold"), cex = 1.3)
```
---
- The gate is opened upon 3 minutes
- aggressive vs non-agressive male
- Each female undergoes the test twice:
- once upon stay in hostile environment (high population density, shortage of feed, a lot of competition)
- and once upon stay in gentle evironment.
```{r catHamster, echo=FALSE}
hamster <- matrix(c(3, 17, 1, 13), ncol = 2, byrow = TRUE)
rownames(hamster) <- c("hostile_agressive", "hostile_non-agressive")
colnames(hamster) <- c("friendly_agressive", "friendly_non-agressive")
hamsterTot <- matrix(0, nrow = 3, ncol = 3)
hamsterTot[1:2, 1:2] <- hamster
hamsterTot[3, 1:2] <- colSums(hamster)
hamsterTot[, 3] <- rowSums(hamsterTot[, 1:2])
hamsterLetters <- matrix(c(" (e)", " (f)", "", " (g)", " (h)", "", "", "", ""), ncol = 3, byrow = TRUE)
hamsterTot <- matrix(paste0(hamsterTot, hamsterLetters), byrow = FALSE, ncol = 3)
colnames(hamsterTot) <- c(colnames(hamster), "total")
rownames(hamsterTot) <- c(rownames(hamster), "total")
knitr::kable(hamsterTot, booktabs = TRUE)
```
---
### Absolute risk difference (ARD)
- $\pi_1=P[\text{agressive male } \vert \text{hostile}]$
- $\hat \pi_1=(e+f)/n$, with $n=e+f+g+h$.
- $\pi_0=P[\text{agressive male } \vert \text{ friendly}]$
- $\hat \pi_0=(e+g)/n$
\begin{equation*}
\widehat{\text{ARD}}=\hat\pi_1-\hat\pi_0=\frac{e+f}{n}-\frac{e+g}{n}=\frac{f-g}{n}
\end{equation*}
- Only affected by discordant pairs $f$ en $g$
---
- Standard error on ARD
\begin{equation*}
\text{SE}_{\widehat{\text{ARD}}}=\frac{1}{n}\sqrt{f+g-\frac{(f-g)^2}{n}}
\end{equation*}
- If we have large number of observations we can use the CLT to establish an $(1-\alpha)100\%$ CI on the ARD
$$\left[\widehat{\text{ARD}}-z_{\alpha/2}\text{SE}_{\widehat{\text{ARD}}},\widehat{\text{ARD}}-z_{\alpha/2}\text{SE}_{\widehat{\text{ARD}}}\right]$$
or
$$\left[\frac{f-g}{n}-\frac{z_{\alpha/2}}{n}\sqrt{f+g-\frac{(f-g)^2}{n}},\frac{f-g}{n}+\frac{z_{\alpha/2}}{n}\sqrt{f+g-\frac{(f-g)^2}{n}}\right] $$
---
```{r}
hamster <- matrix(c(3, 17, 1, 13), ncol = 2, byrow = TRUE)
rownames(hamster) <- c("hostile-agressive", "hostile-non-agressive")
colnames(hamster) <- c("friendly-agressive", "friendly-non-agressive")
f <- hamster[1, 2]
g <- hamster[2, 1]
n <- sum(hamster)
riskdiff <- (f - g) / n
riskdiff
se <- sqrt(f + g - (f - g)^2 / n) / n
se
ci <- riskdiff + c(-1, 1) * qnorm(0.975) * se
ci
```
---
\begin{equation*}
\widehat{\text{ARD}}=\frac{17-1}{34}=0.471
\end{equation*}
The absolute risk difference on choosing for an agressive male is 47.1% larger upon staying in a hostile environment that when residing in a gentle environment.
- The standard error
\begin{equation*}
\text{SE}_{\widehat{\text{ARD}}}=\frac{1}{34}\sqrt{17+1-\frac{(17-1)^2}{34}}=0.0952
\end{equation*}
- A 95\% confidence interval on this absolute risk difference is
\begin{equation*}
\left[0.471-1.96\times 0.0952,0.471+1.96\times 0.0952\right]=[0.284,0.658]
\end{equation*}
---
## McNemar test
```{r,echo=FALSE}
knitr::kable(hamsterTot, booktabs = TRUE)
```
- Assess if risk on choice for agressive male differs between residing in hostile and friendly environment.
- Only discordant pairs give information.
- $f>g$ indication against $H_0$: choice of partner not associated with environment.
- Evaluate probability that in a random discordant pair, a female chooses an agressive male upon a stay in a hostile environment.
- We estimate the probability as
$$\frac{f}{f+g}$$
---
\begin{eqnarray*}
\text{E}\left[f/(f+g)\right]&\stackrel{H_0}{=}& 0.5\\
f & \stackrel{H_0}{\sim}& \text{Binom}(n=f+g,\pi=0.5)\\
\text{SE}_{\frac{f}{f+g}} & \stackrel{H_0}{=}& \sqrt{(f+g)\times 0.5\times 0.5}=\frac{\sqrt{f+g}}{2}
\end{eqnarray*}
---
- Asymptotically a one-sample z-test (based on the Normal distribution)
\begin{equation*}
z=\frac{f-(f+g)/2}{\sqrt{f+g}/2}=\frac{f-g}{\sqrt{f+g}}
\end{equation*}
- Normal approximation is good if $$f \times g/(f+g) \geq 5$$
The **Mc Nemar test** is the analogon of the paired t-test for binary qualitative variables.
---
In R we can conduct the analysis using the `mcnemar.test` function
```{r}
mcnemar.test(hamster)
```
- We reject the null hypothesis at the 5% significance level
- We conclude that the choice of partner is extremely significantly associated with the environment.
---
- Normale approximation is not optimal and we can perform an exact test using the binomial test
```{r}
binom.test(x = f, n = f + g, p = 0.5)
```
---
### Conclusion
- We conclude that the partner choice is extremely signficantly associated with the environment ($p<0.001$).
- The probability on choosing an agressive male is on average `r format(riskdiff*100,digits=3)`% higher when a female hamster resides in a hostile environment than when she resides in a friendly environment (95% CI [`r paste(format(ci*100,digits=3),collapse=",")`]%).
---
# Unpaired observations
## Genetic association study
- Are genetic polymorphisms in the BRCA1 gene associated with breast cancer?
- Retrospective case-control study with 800 breast cancer cases en 572 controls
```{r}
brca <- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/brca.csv")
head(brca)
summary(brca)
```
---
```{r echo=FALSE}
brcaHlp <- data.frame(Genotype = c("Pro/Pro", "Pro/Leu", "Leu/Leu", "Totaal"), Control = c("266 (a)", "250 (b)", "56 (c)", "572 (a+b+c)"), Cases = c("342 (d)", "369 (e)", "89 (f)", "800 (d+e+f)"), Total = c("608 (a+d)", "619 (b+e)", "145 (c+f)", "1372 (n)"))
knitr::kable(brcaHlp,
booktabs = TRUE
)
```
- In case-control study one chooses a fixed number of cases and controls and registers which exposures they had in the past.
- Retrospective study
- Impossible to assess risks and risk-differences on breast cancer because the proportion of cases and controls does not reflect that in the population!
---
```{r, tidy=FALSE,echo=FALSE}
brcaHlp=data.frame(Genotype=c("Pro/Pro","Pro/Leu","Leu/Leu","Totaal"),Controls=c("266 (a)","250 (b)","56 (c)","572 (a+b+c)"),Cases=c("342 (d)","369 (e)","89 (f)","800 (d+e+f)"),Total=c("608 (a+d)","619 (b+e)","145 (c+f)","1372 (n)"))
knitr::kable(brcaHlp,
booktabs = TRUE
)
```
- Is possible to assess probability on allel Leu/Leu
- cases: $\pi_1=f/(d+e+f)=89/800=11.1\%$
- controls:$\pi_0=c/(a+b+c)=56/572=9.8\%$
- Relative risk on exposure for cases versus controls is $11.1/9.8=1.14$.
- The probability on the Leu/Leu genotype is 14% higher for women with breast cancer than for women without borstkanker.
- It suggests an association, but it does not let us conclude how much higher the risk is on breast cancer for women with the Leu/Leu genotype as compared to the other women.
- Other statistic?
---
\begin{equation*}
Odds=\frac{p}{1-p}
\end{equation*}
with $p$ the probability on the event.
Transformation of risk with following properties:
- odds has value between 1 and $\infty$.
- odds only equals 1 for probability $p=1/2$.
- de odds increases if probability increases.
- popular in gambling: how much more likely is it to win than to loose
---
```{r, tidy=FALSE,echo=FALSE}
brcaHlp=data.frame(Genotype=c("Pro/Pro","Pro/Leu","Leu/Leu","Totaal"),Controls=c("266 (a)","250 (b)","56 (c)","572 (a+b+c)"),Cases=c("342 (d)","369 (e)","89 (f)","800 (d+e+f)"),Total=c("608 (a+d)","619 (b+e)","145 (c+f)","1372 (n)"))
knitr::kable(brcaHlp,
booktabs = TRUE
)
```
Odds on allel Leu/Leu
- Cases: $\mbox{odds}_1=\frac{ f/(d+e+f)}{(d+e)/(d+e+f)}=f/(d+e)=89/711=0.125$. The double Leu/Leu variant is 8 times less likely than other allele combinations in the breast cancer cases
- Controls: $\mbox{odds}_2=c/(a+b)=56/516=0.109$.
- Association between exposure and outcome:
$$
OR_{Leu/Leu}=\frac{\mbox{odds}_T}{\mbox{odds}_C}= \frac{f/(d+e)}{c/(a+b)}=\frac{f/(d+e)}{c/(a+b)}=1.15
$$
---
```{r, tidy=FALSE,echo=FALSE}
brcaHlp=data.frame(Genotype=c("Pro/Pro","Pro/Leu","Leu/Leu","Totaal"),Controls=c("266 (a)","250 (b)","56 (c)","572 (a+b+c)"),Cases=c("342 (d)","369 (e)","89 (f)","800 (d+e+f)"),Total=c("608 (a+d)","619 (b+e)","145 (c+f)","1372 (n)"))
knitr::kable(brcaHlp,
booktabs = TRUE
)
```
- If the study would have been a random sample of the population (number of cases and controls not fixed by design) then we would be able to calculate the odds ratio on breast cancer for people with and without the double Leu/leu variant.
\begin{equation*}
OR_{case}=\frac{ \frac{ f}{c}}{ \frac{(d+e)}{(a+b)}} = \frac{f(a+b)}{c(d+e)}=OR_{Leu/Leu}=1.15,
\end{equation*}
- OR is a symmetric statistic!
- OR on borstcancer can be estimated!
- The odds on borstcancer is 15% higher for women with this specific allele combination.
---
- Is the difference large enough to generalize the effect in the sample towards the population?
- We will first rewrite the data in a 2x2 table
```{r leu4, tidy=FALSE,echo=FALSE}
brcaTab2 <- table(brca$variant2,brca$cancer)
brcaTab2 <- brcaTab2[2:1,]
brcaHlp2 <- data.frame(Genotype=c("other","Leu/Leu","Total"),Controls=c("516 (a)","56 (b)","572 (a+b)"),Cases=c("711 (c)","89 (d)","800 (c+d)"),Totaal=c("1227 (a+c)","145 (b+d)","1372 (n)"))
knitr::kable(brcaHlp2,
booktabs = TRUE
)
```
---
## Pearson Chi-square test for independent samples
- Test association between categorical exposure(e.g. variant, X) en categorical response (e.g. disease, Y).
$$H_0: \text{There is no association between } X \text{ and } Y \text{ vs } H_1: X \text{ and } Y \text{ are associated}$$
- Consider row totals $n_\text{other}=a+c$, $n_\text{leu,leu}=b+d$ and
- column totals $n_\text{contr}=a+b$ en $n_\text{case}=c+d$.
- They give information on *marginal distribution* of the exposure (variant, X) and outcome (disease, Y),
but not on the association between these variables.
- Under $H_0$ $X$ and $Y$ are independent and one expects that $(b+d)/n$
of the $a+b$ controls have a Leu/Leu variant, or that $(a+b)(b+d)/n$ of them has a Leu/Leu variant
- We can calculate this expected number $E_{ij}$ under the null hypothesis for *each cell* of the $2
\times 2$ table.
---
- $E_{11}$ = Expected number under $H_0$ in (1,1)-cell = `r sum(brcaTab2[1,])` $\times$ `r sum(brcaTab2[,1])`/`r sum(brcaTab2)` = `r format(sum(brcaTab2[1,])*sum(brcaTab2[,1])/sum(brcaTab2),digits=4)` ;
- $E_{12}$ = Expected number under $H_0$ in (1,2)-cell = `r sum(brcaTab2[1,])` $\times$ `r sum(brcaTab2[,2])`/`r sum(brcaTab2)` = `r format(sum(brcaTab2[1,])*sum(brcaTab2[,2])/sum(brcaTab2),digits=4)` ;
- $E_{21}$ = Expected number under $H_0$ in (2,1)-cell = `r sum(brcaTab2[2,])` $\times$ `r sum(brcaTab2[,1])`/`r sum(brcaTab2)` = `r format(sum(brcaTab2[2,])*sum(brcaTab2[,1])/sum(brcaTab2),digits=4)` ;
- $E_{22}$ = Expected number under $H_0$ in (2,2)-cell = `r sum(brcaTab2[2,])` $\times$ `r sum(brcaTab2[,2])`/`r sum(brcaTab2)` = `r format(sum(brcaTab2[2,])*sum(brcaTab2[,2])/sum(brcaTab2),digits=4)` ;
Test-statistic:
\begin{eqnarray*}
X^2 &=& \frac{\left (|O_{11} - E_{11}| - .5 \right)^2 }{ E_{11}} + \frac{
\left ( |O_{12} - E_{12}| - .5 \right)^2 }{E_{12} }+ \\
&&\quad\quad
\frac{ \left ( |O_{21}
- E_{21}| - .5 \right)^2 }{E_{21}}+ \frac{ \left ( |O_{22} - E_{22}| - .5
\right)^2 }{E_{22} }\\
X^2 &\stackrel{H_0}{\longrightarrow}& \chi^2(df=1)
\end{eqnarray*}
---
```{r echo=FALSE}
grid <- seq(0, 10, .1)
plot(grid, dchisq(grid, 1), type = "l", lwd = 2)
dfs <- c(1, 2, 5)
for (i in 2:3) {
lines(grid, dchisq(grid, dfs[i]), col = i, lwd = 2)
}
legend("topright", lty = 1, lwd = 2, col = 1:3, legend = sapply(dfs, function(d) as.expression(substitute(chi[df == val]^2, list(val = d)))))
```
---
- A large value of the test statistic gives an indication that the null hypothesis is false.
- The test will reject $H_0$ at the $\alpha 100\%$
significance level as soon as the observed test-statistic is larger than the $100\%(1-\alpha)$-quantile, $\chi^2_{1,
\alpha}$, of the $\chi^2_1$-distribution.
- Otherwise we do not reject $H_0$.
- The p-value of a 2-sided test is the probability to observe a larger test-statistic in a random sample under the null than what we observed in our sample.
$$p=P_0[\chi^2_1 \geq x^2]$$.
---
```{r}
expected <- matrix(0, nrow = 2, ncol = 2)
for (i in 1:2) {
for (j in 1:2) {
expected[i, j] <-
sum(brcaTab2[i, ]) * sum(brcaTab2[, j]) / sum(brcaTab2)
}
}
expected
x2 <- sum((abs(brcaTab2 - expected) - .5)^2 / expected)
1 - pchisq(x2, 1)
```
---
- Because $O_{ij}$ are discrete values, the $X^2$ can only take discrete values and the continuouss $\chi^2_1$-distribution is only an approximation of the real distribution.
- To improve the approximation one substracts 0.5 from the value in each cell:
*continuity-correction*
- We refer to this test as *Pearson Chi-squared test with Yates correction*.
- When the correction is not used (i.e. when the values of `0.5' in the estimator $X^2$ are replaced) we use the *Pearson Chi-squared test*.
---
In R you can switch the correction on or off by setting the argument `correct` on TRUE or FALSE, respectively:
```{r}
chisq.test(brcaTab2)
chisq.test(brcaTab2, correct = FALSE)
```
---
- Even with the $\chi^2_1$ correction the approximation is only valid if non of the cells has an expected count below 5 under $H_0$.
- When the $\chi^2$-approximation is invalid we will use *Fisher's exact test*.
- Null hypothesis is also that $X$ and $Y$ are independent, and, the alternative hypothesis that $X$ and $Y$ are dependent.
```{r}
fisher.test(brcaTab2)
```
---
## Extension to categorical variables with multiple levels
- The $\chi^2$-test can also be used if one of the categorical variables $X$ and $Y$ has more than 2 levels
- Again: the null hypothesis $H_0$: $X$ and $Y$ are independent (not associated), against the alternative $H_A: X$ and $Y$
are associated.
- Let the variable in the rows have $r$ different possible outcomes and the one in the columns $c$ possible outcomes, then we obtain an $r \times c$ table.
- Again we will compare the observed values in cell $(i,j)$, referred to as $O_{ij}$, with the expected number under $H_0$, $E_{ij}$
- Again $E_{ij}$ is the product of the $i^\text{th}$ row-total and the $j^\text{th}$ column total devided by the overall total.
\begin{equation*}
X^2 = \sum_{ij} \frac{\left (O_{ij} - E_{ij}\right)^2 }{ E_{ij}}
\end{equation*}
---
- We can show that the statistic follows a $\chi^2$ distribution with $(r-1) \times
(c-1)$ degrees of freedom under $H_0$.
- No continuity correction
- **Pearson $\chi^2$ test** is analogon of one-way ANOVA for qualitative variables.
---
```{r}
brcaTab <- table(brca$variant, brca$cancer)
chisq.test(brcaTab)
```
- To assess if the variant of the BRCA1 gene is associated with breast cancer we conducted a Pearson $\chi^2$test for the $3 \times 2$ table.
- The statistic is now `r format(chisq.test(brcaTab)$statistic,digits=4)` and follows a $\chi^2$ distribution with `r chisq.test(brcaTab)$parameter` degrees of freedom under $H_0$. The probability to obtain a $\chi^2$-test statistic in a random sample under $H_0$ that is more extreme than `r format(chisq.test(brcaTab)$statistic,digits=4)`, is `r format(chisq.test(brcaTab)$p.value*100,digits=2)`%.
- On the 5% level of significance we conclude that the variant of the BRCA-gene is not associated with breast cancer.
---
# Logistic regression
- Framework to model binary data (e.g cancer vs no cancer): *logistic regression model*.
- Model binary data with continuous and/or dummy variables.
- The model assumes that observations for subject $i=1,\ldots,n$ are independent and follow a Bernoulli distribution.
- The logarithm of the odds is modelled using a linear model, also referred to as linear predictor:
\begin{equation}
\left\{
\begin{array}{ccl}
Y_i&\sim&B(\pi_i)\\\\
\log \frac{\pi_i}{1-\pi_i}&=&\beta_0 + \beta_1X_{i1} + \ldots + \beta_p X_{ip}
\end{array}\right.
\end{equation}
---
## Categorical predictor
- breast cancer example: is BRCA 1 variant associated with breast cancer.
- As in the Anova context, a factor in logistic regression is coded using dummy variables.
- 1 dummy variable less than the number of groups.
- For BRCA 1 we need two dummy variables and obtain the following linear predictor:
\begin{eqnarray*}
\log \frac{\pi_i}{1-\pi_i} &=& \beta_0+\beta_1 x_{i1} +\beta_2 x_{i2}
\end{eqnarray*}
- with :
$$x_{i1} = \left\{ \begin{array}{ll}
1 & \text{ if subject $i$ is heterozygous, Pro/Leu variant} \\
0 & \text{if subject $i$ is homozygous, (Pro/Pro or Leu/Leu variant)} \end{array}\right. .$$
$$x_{i2} = \left\{ \begin{array}{ll}
1 & \text{ if subject $i$ is homozygous is the Leucine mutation: Leu/Leu } \\
0 & \text{ of subject $i$ is not homozygous in the Leu/Leu variant} \end{array}\right. .$$
- Homozygosity in the wild type allele Pro/Pro is the **reference group**.
---
We fit the model in R.
- Note that we use the function `as.factor` to convert the cancer variable to a factor variable. - We further use the function `relevel` to specify the control treatment as the reference group.
```{r}
brca$cancer <- brca$cancer %>%
as.factor() %>%
relevel("control")
brca$variant <- brca$variant %>%
as.factor() %>%
relevel("pro/pro")
brcaLogit <- glm(cancer ~ variant, data = brca, family = binomial)
summary(brcaLogit)
```
The intercept is the log-odds on cancer in the reference class (Pro/Pro) and the slope terms are log odds ratios between treatment and reference class:
\begin{eqnarray*}
\log \text{ODDS}_\text{Pro/Pro}&=&\beta_0\\\\
\log \text{ODDS}_\text{Pro/Leu}&=&\beta_0+\beta_1\\\\
\log \text{ODDS}_\text{Leu/Leu}&=&\beta_0+\beta_2\\\\
\log \frac{\text{ODDS}_\text{Pro/Leu}}{\text{ODDS}_\text{Pro/Pro}}&=&\log \text{ODDS}_\text{Pro/Leu}-\log ODDS_{Pro/Pro}\\
&=&\beta_0+\beta_1-\beta_0=\beta_1\\\\
\log \frac{\text{ODDS}_\text{Leu/Leu}}{\text{ODDS}_\text{Pro/Pro}}&=&\beta_2
\end{eqnarray*}
- The analysis allows us to interpret the result immediately in oddses and odds ratios!
---
```{r}
anova(brcaLogit, test = "Chisq")
```
The $\chi^2$-test on the logsitic regression model also indicates that there is no significant association between cancer status and the genetic variant of the BRCA gene ($p=$ `r format(anova(brcaLogit,test="Chisq")[2,"Pr(>Chi)"],digits=3)`).
De p-value is almost equivalent to the one of the $\chi^2$-test (see previous section).
---
- Significant association?
- Post-hoc tests to evaluate which of the odds ratios are different from 1.
- For BRCA1 example we did not reject the omnibus hypothesis so no post-hoc analysis
- For your reference we include the post hoc analysis so that you would have the code.
---
```{r}
suppressPackageStartupMessages({
library(multcomp)
})
posthoc <- glht(brcaLogit, linfct = mcp(variant = "Tukey"))
posthocTests <- summary(posthoc)
posthocTests
```
---
```{r}
posthocCI <- confint(posthoc)
posthocCI
```
- With the `confint` function CI are obtained on the log-odds ratios corrected for multiple testing.
---
- CI's can be backtransformed to odds ratios:
```{r}
OR <- exp(posthocCI$confint)
OR
```
- De odds ratios that we obtain is exactly equal to the one we calculated based on the contingency table:
- e.g $\text{OR}_\text{Leu/Leu-Pro/Pro}=89\times 266/(56\times 342)=$ `r format((89/56)/(342/266),digits=4)`.
- Note, that statistical inference for logistic regression relies on asymptotic theory.
---
## Continuous Predictor
- Toxicolocal effect of carbon disulfite (CS$_2$) on beetles.
- Research hypothesis is there an effect of the CS$_2$ concentration on the mortality of beetles?
Design:
- 32 independent experiments
- Each time 1 beatle is exposed to one of 8 CS$_2$ concentrations (mg/l).
- The outcome: mortality ($y=1$) or survival ($y=0$).
```{r}
beetles <- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/beetles.csv")
head(beetles)
table(beetles$dose, beetles$status)
```
---
We build a model for the log odds in function of the dose $x_i$:
$$\log \frac{\pi_i}{1-\pi_i}=\beta_0+\beta_1 \times x_i.$$
```{r}
beatleModel <- glm(status ~ dose, data = beetles, family = binomial)
summary(beatleModel)
```
---
- Intercept has an interpretation of a log odds on mortality when no $\text{CS}_2$ gas is applied.
- Very small odds on mortality ($\pi/(1-\pi)=\exp(-53.2)$) so the probability is almost 0.
- Note that this is a very large extrapolation: minimum dose in dataset is `r min(beetles$dose)` mg/l.
- Estimated odds ratio for the effect of dose on the mortality probability is $\exp(0.3013)=1.35$.
- So a beatle exposed to a CS$_2$ concentration that is 1 mg/l larger than another beatle, will on average have an odds ratio on mortality of $1.35$.
---
- We conclude that this effect is very significant ($p=$ `r round(summary(beatleModel)$coef[2,4],3)`).
- Increasing the CS$_2$ dose increases the mortality.
```{r}
beetlesTab <- table(beetles) %>% data.frame()
data.frame(grid = seq(min(beetles$dose), max(beetles$dose), .1)) %>%
mutate(piHat = predict(beatleModel,
newdata = data.frame(dose = grid),
type = "response"
)) %>%
ggplot(aes(grid, piHat)) +
geom_line() +
xlab("dose") +
ylab("probability (dead)") +
geom_text(aes(x = dose %>% as.character() %>% as.double(), y = status %>% as.character() %>% as.double(), label = Freq), beetlesTab %>% filter(status == 0)) +
geom_text(aes(x = dose %>% as.character() %>% as.double(), y = status %>% as.character() %>% as.double(), label = Freq), beetlesTab %>% filter(status == 1))
```