-
Notifications
You must be signed in to change notification settings - Fork 3
/
07.0_ConfIntHoTesting.Rmd
1730 lines (983 loc) · 90.9 KB
/
07.0_ConfIntHoTesting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
fig_caption: yes
number_sections: yes
theme: readable
toc: yes
toc_depth: 3
---
```{r setup0, message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE, include=FALSE}
# load packages for chapter
options(digits = 10)
library(arrangements)
library(bookdown)
library(kableExtra)
library(knitr)
library(pander)
library(car)
library(tidyr)
library(ggplot2)
library(latex2exp)
library(plotrix)
library(animation)
options(scipen = 999)
```
```{r ShadeLeftTail, message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE, include=FALSE}
left.tail <- function(mean = 0, sd = 1, p = 0.025, lblx = "", ...){
x <- qnorm(p = p, mean = mean, sd = sd)
from <- mean - 3.5 * sd
to <- mean + 3.5 * sd
curve(dnorm(x, mean = mean, sd = sd),
from = from,
to = to,
lwd = 2.5,
xlab = "",
ylab = "",
ylim = c(-0.10 * dnorm(x = mean, mean = mean, sd = sd),
dnorm(x = mean, mean = mean, sd = sd)))
# Add the shaded area.
cord.x <- c(from, seq(from = from, to = x, length.out = 50), x)
cord.y <- c(0, dnorm(x = seq(from = from, to = x, length.out = 50),
mean = mean, sd = sd), 0)
polygon(cord.x, cord.y, col = "orange")
text(x = x, y = -0.05 * dnorm(x = mean, mean = mean, sd = sd), labels = lblx)
}
```
# Confidence intervals and Hypothesis testing {#chHotest}
## Learning Objectives for Chapter
1. Describe the purpose of hypothesis testing, and in what situation it would be necessary.
1. State null and alternative hypotheses in words and equations.
1. Calculate and explain the meaning of level of significance in words and with a graph.
1. Calculate power of a test and explain it with a graph.
1. Calculate and interpret confidence intervals.
1. Describe how sample size and alpha affect CI's.
1. Explain the difference between a one-tailed and two-tailed test and determine when each one is applicable.
1. Recreate and explain a table showing the outcomes of statistical tests and the types of errors.
1. Test if the mean of a population is equal to a hypothetical mean when variance is known and unknown.
## Introduction to inference
With this chapter we enter the topic of **Inference**. Statistical inference consists of making statements about population characteristics based on information obtained from samples. We aim at presenting a general idea that in order to answer interesting practical questions we need to estimate parameters or other unknown and unknowable quantities, and we need to determine how confident we are about the estimated values. Frequently, the topics of estimation and inference are separated in textbooks, because estimation methods can be treated independently of the statistical distributions of the estimators and associated probabilities. Many important properties of estimators, like expected value and variance, can be worked out without assuming any particular distribution for the random components of models. One estimation method we mentioned before is called *least-squares*, another one is *maximum likelihood*.
The main point of this chapter is to present methods to test hypotheses where unknown parameters, typically the mean of normal distributions, are estimated and hypothesized to have specific known values. The mean is unknown so we estimate it with the average of a sample. Using the mathematical "trick" that is the most important equation in this course, we also use a single sample to estimate the sampling variance of the sample average. Armed with a known distribution shape (Normal), an estimate of its mean (the sample average), and an estimate of the variance of sample average, we make inferences about the population. We test hypotheses of specific values of the mean by calculating how likely it would be to observe the data in hand or more extreme data if indeed the hypothesis were true. In later chapters we expand the test to two and then several means.
<br>
```{block StatementCertainty, type='stattip'}
- A main goal of statistics is to estimate unknown parameters or quantities and state the estimates and our degree of certainty about the estimate. Confidence intervals are the primary tool for this and have a simple structure that is the same across apparently diverse statistical tests:
$$CI = \text{estimate} \ \pm \ \text{margin of error}$$
```
<br>
## Testing a hypothesis
A **statistical hypothesis** is an assertion made about some parameter or characteristic of a population. By testing a statistical hypothesis we are making a statistical inference, which consists of making a statement about population characteristics based on information obtained from samples. Keep in mind that statistics, like science in general, is not meant to lead to "the truth." The scientific method and its direct implementation in statistical inference only allow us to make good guesses about which model of reality is better. Other things being equal, the models that agree more with the data are preferred.
Performing a hypothesis test consists of posing a statement about a population parameters or a "model" of some aspect or phenomenon. Then, we gather data and calculate the probability of observing such data under the assumption that the statement or model is true. If the probability is too low, we guess that the statement is not correct,and we reject it. Obviously, the whole testing procedure is a random process because it is based on a random sample, and we never know for sure if our conclusion is correct or not. We just know that the probability of incorrectly rejecting the statement is low. This is the logic of frequentists hypothesis testing. We will come back to this in more detail.
We insist in the understanding of the logic behind hypothesis testing in an overt attempt to correct the widespread misunderstanding and misuse of hypothesis testing in science. One of the most important points of this chapter is to understand the logic behind hypothesis testing and to be able to spot its misuse and misinterpretation. Correctly used and interpreted, hypothesis testing can be a useful tool in science and business. First, let's show the mechanics of hypothesis testing. Then we will point out pitfalls and things to watch out for.
The steps involved for conducting a hypothesis test are:
1) State null and alternative hypotheses
2) Set the level of "significance"
3) Determine/state the decision rule
4) Calculate test-statistic
5) Interpret the decision and state conclusion
### State null and alternative hypotheses
A hypothesis is stated about a population, usually involving a parameter. For example, the mean photosynthetic rate for the non-native eelgrass *Zostera japonica* found on the south side quadrant of Padilla Bay, WA is the same as the mean photosynthetic rate for *Z. japonica* eelgrass found in all quadrants in Padilla Bay.
The null hypothesis states that the mean photosynthetic rate of the eelgrass population in the south quadrant $(\mu_{south})$ is the *same* as the mean photosynthetic rate of the eelgrass population for all quadrants $(\mu_{all})$ (i.e., no difference in photosynthetic rate between the two populations). Thus, the *null hypothesis**, which is denoted by $H_o$, states that there is no difference between means.
The alternative hypothesis for this example would state that the mean photosynthetic rate of the eelgrass population in the south quadrant is *different* from the mean photosynthetic rate of eelgrass found in all quadrants. The alternative hypothesis is denoted $H_a$
Null hypothesis:
<br>
$$H_o: \mu_{south} = \mu_{all}$$
Alternative hypothesis:
$$H_a: \mu_{south} \neq \mu_{all}$$
<br>
Above, the **alternative hypothesis** states that the mean photosynthetic rate of the south quadrant population is *not equal* to the mean photosynthetic rate of the all quadrant population. An important aspect of this statement is that it is not specified how the two population means are different;l it does not say how different they are or which one is expected to be greater. The null hypothesis might be rejected regardless of which one is greater, and the test is said to be **two-tailed**. In general, null hypotheses like this one, where means are hypothesized to be different, are not very useful. Unless we are sampling exactly the same population in the same conditions, it is unlikely that means are truly equal to an infinitesimal degree. It is more useful and informative to hypothesize that means differ by a biologically or practically relevant amount in a given direction.
The more useful and informative test of whether the mean of one population is greater than the other by a biologically or practically significant amount is referred to as a **one-tailed** test. The "tails" refer to left and right areas under the statistical distribution of sample averages, which represent extreme values relative to what would be expected if the null hypothesis were true.
Inference is strongest when the null hypothesis is rejected. Failure to reject a null hypothesis is not as strong, because it may just mean that we simply do not have enough data to "see" the difference. Because of this, the statement of equality should always be part of the null hypothesis, not of the alternative one.
- Two-tailed test:
$$H_o: \mu = \mu_0$$
$$H_a: \mu \neq \mu_0$$
- One-tailed test:
$$H_o: \mu \le \mu_0$$
$$H_a: \mu \gt \mu_0$$
- One-tailed test:
$$H_o: \mu \ge \mu_0$$
$$H_a: \mu \lt \mu_0$$
### Level of significance $\alpha$ and the p-value
Informally, the level of significance is a measure of how unusual the result has to be to reject the null hypothesis. It is a measure of how consistent the data are with the null hypothesis. This measure is the probability of observing the data in the sample or more extreme data if the null hypothesis were actually true. The details of the logic behind this **frequentist** hypothesis testing scheme is explored further down.
Setting the level of significance $(\alpha)$ determines at what calculated probability value the null hypothesis is either rejected or not rejected. The desired level must be set prior to the experiment and should not be changed to avoid the introduction of subjective bias in the tests. Typical values are 0.05 and 0.01 or 5 and 1\%. If we changed the p-value after the data are known, we could control the results of the tests by simply choosing a very low $\alpha$ for hypotheses that we do not want to reject, and a larger one for those that we feel like rejecting.
By definition, $\alpha$ is the probability of obtaining a sample with "extreme" data for a given parameter value (usually the mean of the normal distribution). Those extreme samples may actually come from a population with the hypothesized mean, but the method establishes that in those cases the null hypothesis is rejected. Therefore, $\alpha$ is the probability that a true null hypothesis will be rejected in error. The rejection of a true null hypothesis is called **Error Type I**, and it is expected to happen 5 or 1 times out of 100 tests when $\alpha$ is 0.05. The problem is that we will never know for sure which conclusions were incorrect.
#### What is the meaning of the p-value?
We use the term p-value to refer to the probability of observing a test statistic of equal or greater value than the one actually observed, given or assuming that the null hypothesis is correct. This is a conditional probability.
**Frequentist test of hypothesis**
If $P(\text{calculated } t \ | \ \text{True } H_0) < \alpha$ then, reject $H_0$.
Thus, the test of hypothesis is based on a conditional probability, the probability that the summary statistics or calculated t value is greater than or equal to the one observed, given that the distribution of the observations is normal and independent with mean equal to that hypothesized. For this to make more sense formally we need to distinguish between the random function or variable calculated, call it $s = (\bar{Y} - \mu_0)/S_{\bar{Y}}$ (s for "statistic") and any specific value that it takes for a sample, say $t_{calc}$ (because it is the name we generally use). In symbols:
$$p-value = P(s \ge t_{calc}|Y \sim N(\mu_0, \sigma))$$
The following video has a brief description of the historical origin of the p-value. Unfortunately, a scientific culture developed where the p-value lost its original meaning and resulted in one of the major issues in science today.
<a href = "https://player.pbs.org/viralplayer/3010227907/" target="_blank"> Tea and milk</a>
[@Nuzzo2014] stated:
>P values have always had critics. In their almost nine decades of existence, they have been lik- ened to mosquitoes (annoying and impossible to swat away), the emperor’s new clothes (fraught with obvious problems that everyone ignores) and the tool of a “sterile intellectual rake” who ravishes science but leaves it with no progeny. One researcher suggested rechristening the methodology “statistical hypothesis inference testing”, presumably for the acronym it would yield.
```{block, type = 'think'}
Interesting... It looks like we are studying p-values and that they may mean “statistical hypothesis inference testing.” You may think: "I do not have time for this SHI Testing!" but you do. You are traveling the statistical times at an exciting crossroads!
```
Do not confuse the incorrect use of p-values with incorrect p-values. The formal procedures of obtaining and interpreting p-values are correct. P-values in themselves are not flawed, except that they are used in ways that are not correct. A culture developed in which "statistical significance with p = 0.05"" took a life of its own. Data are analyzed with intention to achieve this false icon, and in doing so, scientists invalidate the p-value. How should p-value be used then??
>Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.
>
> [@WassersteinLazar2016]
There are three main problems with the p-value:
1. Misunderstanding. The meaning of the p-value is widely misunderstood. Its correct meaning is stated above. Th p-value **is not** the probability that the null hypothesis is true.
1. Irrelevant significance. The "culture" of science established that small p-values and "significant" results are more interesting than results that have larger p-values. In general, they should not be. But the idea of "statistically significant" became conflated with that of biological or practical significance. In reality they are different. A difference of 10 g in body mass of mature cattle may be statistically significant, but it is not biologically or practically significant.
1. Cherry-picking of results. Researchers became keen on getting "significant" results, because journals and peers want to have significant results. Because of this, the papers that are published are heavily biased to report only those experiments that show significant results. The problem is that any random experiment will always result in significant results and rejection of any null hypothesis if repeated sufficient number of times. In the logic of null hypothesis testing (see below) this is equivalent to hypothesizing that an urn with 80 blue and 20 yellow marbles has 80% blue marbles and then taking samples of 5 marbles with replacement until one gets a sample with more than 4 yellow marbles. A that point one has a "significant" result that is published. This is not good because publications do not usually state how many times the experiment was repeated. By "cherry-picking" only the significant results for publication we contribute to a weakening of science.
### Test-statistic
In order to assess the probability of observing the data in the sample or more extreme data under the assumption that the null hypothesis is true, a test-statistic needs to be calculated. The test statistic selected depends on the specific parameter being estimated, because the statistic has to have a *sampling statistical distribution* that is appropriate. When we test hypotheses about population means we use the standard normal distribution when the variance is known or Student's t distribution when the variance is not known. For tests about values of population variances we use the $\chi^2$ distribution, whereas to compare variances we use the F distribution. Those distributions were derived mathematically by looking at what happens when samples are used to estimate means and variances.
The **sampling distribution** of a statistics is the distribution of the values that the statistic takes over infinitely many samples obtained exactly in the same manner. For example, the sample average is a statistic that has a normal distribution if the original distribution is normal, or if sample size is large, regardless of the original distribution. The sample variance is a statistic that is proportional to a $\chi^2$, whereas the ratio of two independent estimates of the sample variance has an F distribution.
When the data are normally distributed, or approximately normal given a large sample size (e.g., n >30), tests about population means with **known variance** use the z-score as the statistic. The **z-statistic** is calculated as follows:
<br>
A hypothesis is stated about a population, usually involving a parameter. For example, the mean photosynthetic rate for the non-native eelgrass *Zostera japonica* found on the south side quadrant of Padilla Bay, WA is the same as the mean photosynthetic rate for *Z. japonica* eelgrass found in all quadrants in Padilla Bay.
$$z_{calc} = \frac{\bar{Y} - \mu_0}{\sigma / \sqrt{r}}$$
<br>
where z is the calculated *z-statistic*, $\bar{Y}$ is the sample mean, $\mu_0$ is the hypothesized population mean, $\sigma$ is the known population standard deviation and r is the sample size. If the calculated z is greater than the value of the standard normal distribution that leaves 5 or 1\% in the tails, the hypothesized mean is rejected.
If the population variance is unknown, it can be estimated with the sample variance. When the data are normally distributed, or approximately normal given a large sample size (e.g., n >30), tests about population means with **unknown variance** use the t-score as the statistic. The t-statistic is calculated as follows:
<br>
$$t_{calc} = \frac{\bar{Y} - \mu_0}{S / \sqrt{r}}$$
The null hypothesis states that the mean photosynthetic rate of the eelgrass population in the south quadrant $(\mu_{south})$ is the *same* as the mean photosynthetic rate of the eelgrass population for all quadrants $(\mu_{all})$ (i.e., no difference in photosynthetic rate between the two populations). Thus, the *null hypothesis**, which is denoted by $H_o$, states that there is no difference between means.
where $t_calc$ is the calculated statistic, which under a true null hypothesis should have a Student's t distribution, and S is the sample standard deviation. The t-distribution is necessary because the unknown variance increases the uncertainty in the test. The t distribution accounts for this by having more variance than the normal, and its variance increases as the number of observations used to estimate the mean declines. This is in addition to the effect through the $\sqrt{r}$.
<br>
## Decision rule
Once the test statistic is calculated it has to be inspected to make a decision about the null hypothesis. The decision can be made in two ways that are equivalent in the result:
1. If the calculated statistic value is greater than the critical value, reject the null hypothesis, otherwise, fail to reject the null hypothesis. The critical value is that which leaves $\alpha$ in the tails. Critical values are quantiles, they are values in the x-axis of the distributions beyond which lies $100 \ \alpha$ \% of the population.
1. If the tail probability associated with the calculated statistic is smaller than $\alpha$, reject the null hypothesis, otherwise, fail to reject it.
The decision rules depend on the nature of the hypothesis, because the error rate $\alpha$ has to be distributed on the tails. If the test is one tails, the area under the curve on the side of rejection is 0.05, but when the test is two-tailed, only 0.025 is in each tail.
<br>
````{block,, type='stattip'}
- Never accept the null hypothesis (unless your power is very large, but that is another story).
````
<br>
```{r NormalTails, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '90%', fig.align='center', echo=FALSE, fig.cap ="Decision rules for test of hypothesis about an unknown mean using an estimated variance."}
knitr::include_graphics("images/NormalTails.png")
```
<!-- Figure \@ref(fig:prob_dist) is a visual representation of test-statistic values under a normal distribution. -->
## Test of hypothesis about a mean when variance is known
Suppose that methane production per unit milk production for US cows in the last 5 decades has a mean of 13 g/kg with variance of 9 g^2^/kg^2^. A new feed additive was tested in 16 randomly selected cows and resulted in emissions of 11.5 g/kg. Assuming that the treated population has the same variance as the untreated one, is there evidence that the treatment reduces emissions relative to the mean of the untreated population?
<br>
$$H_0: \mu \ge \mu_0 = 13 \ g/kg \qquad H_a: \mu < \mu_0 = 13 \ g/kg$$
The population variance is known, therefore we use a z-test.
$$\mu_0 = 13 \quad g/kg \qquad \bar{Y} = 11.5 \quad g/kg \qquad \sigma^2 = 9 \qquad r = 16\\[25pt]
z_{calc} = \frac{\bar{Y} - \mu_0}{\sigma / \sqrt{r}} = \frac{11.5 - 13}{\sqrt{\frac{9}{16}}} = `r (11.5 - 13) / sqrt(9/16)`$$
<br>
The calculated z value of `r (11.5 - 13) / sqrt(9/16)` has to be checked for "significance." We can do this in two ways that are equivalent, shown in Figure \@ref(fig:LeftTailTestFig). First, the null hypothesis indicated a two-tailed test where the rejection region is the range of z values that have 5% of the area under the curve in the left of the normal distribution. Since we hypothesize that the new mean is greater than or equal to the untreated population mean, we reject it only if the calculated value is too low to be consistent with the null. The calculated z is compared to the standard normal quantile that corresponds to 0.05 on the left: `qnorm(p = 0.05)` = `r qnorm(p = 0.05)`. Because the calculated z = ` r (11.5 - 13) / sqrt(9/16)` is lower than the critical value of `r qnorm(p = 0.05)` we reject the null hypothesis.
Second, we can calculate the p-value or area under the curve on the left side of the calculated z. In R code that is `pnorm(q = -2)` = `r pnorm(q = -2)`. Because the probability is smaller than $\alpha$, we reject the null hypothesis and accept the alternative one.
<br>
```{r LeftTailTestFig, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '70%', fig.align='center', echo=FALSE, fig.cap ="Area of rejection for a left-tailed test. Calculated z is less than the critical z, therefore, the null hypothesis is rejected."}
knitr::include_graphics("images/LeftTailTestFig.png")
```
<br>
Note that when the population variance is known, the test of the mean does not require that we know the individual values in the sample. We just need to know their averages.
## Test of hypothesis about a mean when variance is unknown
We will use the same example as above, except that we will not assume that the variance is known. Instead, we will use the following values from a sample of 16 cows that received the food additive:
```{r cowMeth, echo=FALSE, message=FALSE}
set.seed(121023)
cow.meth <- data.frame(cow = 1:16,
methane = round(rnorm(16, mean = 11.5, sd = 4), 1),
". " = "")
names(cow.meth)[3] <- ""
methane <- cow.meth$methane
kable(cow.meth, caption = "Grams of methane produced per kg of milk by a sample of US dairy cows that received a new feed supplement (fictitious data).") %>%
kable_styling(full_width = FALSE, position = "float_left",
bootstrap_options = c("striped", "condensed"),
font_size = 12)
```
The sample standard deviation is used to estimate the population standard deviation. Because we are using an estimate of the standard deviation we need to use Student's t distribution.
$$\bar{Y} = `r round(mean(methane), 2)`$$
$$S_Y = `r round(sd(methane), 2)`$$
$$t_{calc} = \frac{\bar{Y} - \mu_0}{S_y / \sqrt{r}} = \frac{`r round(mean(methane), 2)` - 13.0}{`r round(sd(methane), 2)` / \sqrt{16}} = `r (mean(methane) - 13)/(sd(methane)/sqrt(16))`$$
This calculated t has to be compared to the critical value for a 5% left tail in a Student's t distribution with 16 - 1 = 15 degrees of freedom, $t_{critical} = `r qt(0.05, df = 15)`$. Because the calculated t is smaller than the critical t, we reject the null hypothesis.
## Test of hypothesis about an unknown variance
First of all, make sure that you understand that it would not make sense to talk about a test for the variance when the variance is known! For this example we hypothesize that the variance of the population that receives the feed additive is the same as the known variance of the untreated population. We use the same sample of 16 cows above.
$$H_0: \ \sigma^2 = \sigma_0^2 = 9 \\[25pt]
H_a: \ \sigma^2 \ne \sigma_0^2 = 9$$
Under the assumption that sampling was random, that each observation is independent from the others and that the null hypothesis is true, the sum of squared deviations of a sample divided by the variance has a $\chi^2$ distribution with r-1 degrees of freedom. Because the null is an equality, the test is two-tailed and there is an area equal to 0.25 in each side of the curve.
$$\chi^2_{calc} = \sum_{i = 1}^{16} \ \frac{(Y_i - \bar{Y})^2}{\sigma_0^2} = `r round(sum((methane - mean(methane)) ^ 2) / 9, 2)`$$
The lower critical value is the 0.025 quantile of a $\chi^2$ with 15 df: $\chi^2_{Lcritical} = `r qchisq(p = 0.025, 15)`$, whereas the upper critical value is $\chi^2_{Ucritical} = `r qchisq(p = 0.975, 15)`$. Because the calculated value falls between the critical levels, we *cannot reject* the null hypothesis.
<br>s
```{r TwoTailedChisqFig, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '60%', fig.align='center', echo=FALSE, fig.cap ="Area of rejection for a two-tailed test of an unknown variance. Calculated statistic value is greater than the lower critical level and less than the upper critical level, therefore, the null hypothesis cannot be rejected."}
knitr::include_graphics("images/TwoTailedChisq.png")
```
<br>
## Confidence Intervals for the Mean
Another way to make statistical statements without conducting a hypothesis test is to use confidence intervals. A **confidence interval** is a range of values that has a known probability of containing the true value of an unknown parameter. Just like sample averages, confidence intervals vary from sample to sample. The level of confidence selected, or the probability that a randomly selected (because the sample is random) confidence interval (CI) contains the actual value of the parameter is the complement of the p-value: $1 - \alpha$. Thus, we use and refer to 95% and 99% confidence intervals. The confidence refers to the probability that any given sample yields an interval that actually brackets the parameter value. Once an interval is calculated, the parameter either is or is not in it, but we will never know.
One way to visualize the relationship between confidence interval and population mean is to use a simulation with a population of known mean and variance. We simulate samples from the population and for each random sample we calculate a 95% confidence interval for the mean.
<br>
When the variance of the population is known the confidence intervals are calculated as follows:
$$\text{Lower Boundary} = \bar{Y} - z_{1-\alpha/2} \cdot \frac{\sigma}{\sqrt r}$$
$$\text{Upper Boundary} = \bar{Y} + z_{1-\alpha/2} \cdot \frac{\sigma}{\sqrt r}$$
<br>
where *z* is the z-critical value, which for a 95% confidence equals 1.96.
We simulate a population of horseshoe wear, where the amount of horseshoe wear has a mean of 2 mm with standard deviation equal to 0.4 mm. We pretend that we do not know the mean and take random samples of size 9 from the population using the `rnorm` R function. For each sample we calculate the average and create a 95\% confidence interval. Those intervals are plotted about the true mean to determine the proportion that actually contains the true mean. This proportion is called "coverage" and it is expected to be equal to the confidence. The actual proportion may be different due to the random component in sampling, but it should approach the confidence level as the number of samples increases. In real samples, the difference between expected and real coverage is also affected by deviations from normality and independence assumptions.
<br>
```{r zCiCoverage, message=FALSE, fig.cap="Location of 100 confidence intervals about the true mean of a population. Population standard deviation is 0.4 and sample size = 9. Each CI is a vertical line. Red CI's do not include the true mean."}
mean <- 2
sd <- 0.4
ssize <- 9
set.seed(141149)
samples <- matrix(rnorm(n = 900, mean = mean, sd = sd),
byrow = TRUE, nrow = 100, ncol = 9)
averages <- apply(samples, 1, mean)
# Calculate CI boundaries using z distribution
# because variance is known.
ci.lo <- averages + qnorm(0.025) * sd / sqrt(ssize)
ci.hi <- averages + qnorm(0.975) * sd / sqrt(ssize)
ymin <- 0.90 * min(ci.lo)
ymax <- 1.10 * max(ci.hi)
ci.colors <- c("red", "blue")
ci.color <- ci.colors[as.numeric(ci.lo < 2 & ci.hi > 2) + 1]
plot(averages, xlab = "Sample", ylab = "95% CI",
ylim = c(ymin, ymax), col = ci.color, pch = 16)
abline(h = mean, col = "grey")
segments(x0 = 1:100, y0 = ci.lo, y1 = ci.hi, col = ci.color)
coverage <- paste("Coverage = ", as.character(sum(ci.lo < 2 & ci.hi > 2)), "%", sep = "")
text(x = 20, y = 1.05 * max(ci.hi), labels = coverage, cex = 1.2)
```
### Test-statistic
In order to assess the probability of observing the data in the sample or more extreme data under the assumption that the null hypothesis is true, a test-statistic needs to be calculated. The test statistic selected depends on the specific parameter being estimated, because the statistic has to have a *sampling statistical distribution* that is appropriate. When we test hypotheses about population means we use the standard normal distribution when the variance is known or Student's t distribution when the variance is not known. For tests about values of population variances we use the $\chi^2$ distribution, whereas to compare variances we use the F distribution. Those distributions were derived mathematically by looking at what happens when samples are used to estimate means and variances.
Given that the population standard deviation of the horseshoe wear is known to be 0.4 mm, and given that it is known that wear follows a normal distribution, then, we know that averages of samples of size r = 9 will have a normal distribution with mean equal to the true wear mean and standard deviation equal to 0.4/3. This means that, in the long run, 95% of the sample averages will fall within $1.96 \times 0.4/3 =$ mm (margin of error) of the true mean. Thus, if we extend "arms" that are 0.63 cm long from each average we calculate, the segment encompassed should contain the mean 95% of the time, in the long, long run. Just like when drawing marbles from a urn with white and blue marbles there is a certain probability that it is blue, when we "draw" confidence intervals, there is a certain probability that they contain the mean. We make the confidence intervals such that the probability is 95% or other desired degree of certainty.
**Margin of Error of estimated mean when variance is known:**
$$MOE = z_{1-\alpha/2} \cdot \frac{\sigma}{\sqrt r}$$
Given the equation above either decreasing the standard deviation or increasing sample size would reduce the margin of error (the length of the "arm" on each side of the average) and give a narrower range within which the true population mean lies. For any given standard deviation and sample size, the margin of error depends on the degree of certainty $1-\alpha$ selected. The greater the certainty desired, the larger the z and the wider the CI. Obviously, the CI looks for a balance between width and certainty. One can be immediately certain that the true mean is something between $-\infty$ and $+\infty$, but that is not be very helpful. By observing the sample we can eliminate values that are extremely unlikely and concentrate our guess of the mean value within a narrow section of the real numbers.
**Margin of Error of estimated mean when variance is NOT known:**
$$MOE = t_{r-1, \ 1-\alpha/2} \cdot \frac{S}{\sqrt r}$$
When the variance is not known, the test statistic has a Student's t distribution that depends on sample size. The t distribution has degrees of freedom that are equal to the divisor used to estimate the variance, r-1. In this cases, an increase in sample size has a positive impact in reducing the margin of error through two effects:
- as before, increasing r reduces the $S / \sqrt{r}$ factor, and
- as r increases, degrees of freedom increase and the critical t value decreases.
the impact of not knowing the variance on the Ci is illustrated by repeating the calculations for Figure \@ref(fig:zCiCoverage), but using each samples standard deviation and the t distribution instead of the known variance and the z distribution. Because of the additional uncertainty introduced due to not knowing the population variance, confidence intervals in Figure \@ref(fig:tCiCoverage) are wider than in Figure \@ref(fig:zCiCoverage).
```{r tCiCoverage, message=FALSE, fig.cap="Location of 100 confidence intervals about the true mean of a population. Population variance is NOT known and sample size = 9. Each CI is a vertical line. Red CI's do not include the true mean."}
mean <- 2
sd <- 0.4
ssize <- 9
set.seed(141149)
samples <- matrix(rnorm(n = 900, mean = mean, sd = sd),
byrow = TRUE, nrow = 100, ncol = 9)
averages <- apply(samples, 1, mean)
ses <- apply(samples, 1, sd) / sqrt(ssize)
# Calculate CI boundaries using z distribution
# because variance is known.
ci.lo <- averages + qt(p = 0.025, df = ssize - 1) * ses
ci.hi <- averages + qt(p = 0.975, df = ssize - 1) * ses
ci.colors <- c("red", "blue")
ci.color <- ci.colors[as.numeric(ci.lo < 2 & ci.hi > 2) + 1]
plot(averages, xlab = "Sample", ylab = "95% CI",
ylim = c(ymin, ymax), col = ci.color, pch = 16)
abline(h = mean, col = "grey")
segments(x0 = 1:100, y0 = ci.lo, y1 = ci.hi, col = ci.color)
coverage <- paste("Coverage = ", as.character(sum(ci.lo < 2 & ci.hi > 2)), "%", sep = "")
text(x = 20, y = 1.05 * max(ci.hi), labels = coverage, cex = 1.2)
```
## Confidence Intervals for the Variance
In the section about [$\chi^2$ distribution](#chisqDist) we noted that the $\chi^2$ distribution is the distribution of a random variable created by summing the squares of multiple standard normal random variables. A little of work on the formula for the estimated variance using the sample variance shows that the estimated variance is a random variable with a distribution related to the $\chi^2$.
<br>
````{block,, type='stattip'}
- Never accept the null hypothesis (unless your power is very large, but that is another story).
````
<br>
\begin{align}
$$\begin{align}
&\hat{\sigma}^2 = S^2 = \sum_{i = 1}^r \ \frac{(Y_i - \bar{Y})^2}{r - 1} \implies S^2 \ (r - 1) = \sum_{i = 1}^r \ (Y_i - \bar{Y})^2 \\[25pt]
&\text{Because } Y_i \sim N(\mu, \sigma^2),\qquad (Y_i - \bar{Y}) \sim N(0, \sigma^2) \ \text{ and} \\[25pt]
&\frac{(Y_i - \bar{Y})}{\sigma} \sim N(0, 1) \ \text{ is a standard normal random variable, and} \\[25pt]
&\frac{S^2 \ (r - 1)}{\sigma^2} = \sum_{i = 1}^r \ \frac{(Y_i - \bar{Y})^2}{\sigma^2} \sim \ \chi^2_{r-1}
\end{align}
\end{align}$$
<br>
Based on this result, we use the $chi^2_{r-1}$ distribution to make confidence intervals for the true variance based on the sample variance. By definition
<br>
<!-- Figure \@ref(fig:prob_dist) is a visual representation of test-statistic values under a normal distribution. -->
$$P \left (\chi^2_{r-1, \ \alpha/2} \ \lt \frac{S^2 \ (r - 1)}{\sigma^2} \lt \chi^2_{r-1, \ 1 - \alpha/2} \right ) = 0.95$$
<br>
therefore, after taking the reciprocal of the expression, changing the direction of the inequalities to preserve their validity, and multiplying all parts by $S^2 \ (r - 1)$, we get
<br>
$$P \left (\frac{S^2 \ (r - 1)}{\chi^2_{r-1, \ 1 - \alpha/2}} \ \lt \sigma^2 \lt \frac{S^2 \ (r - 1)}{\chi^2_{r-1, \ \alpha/2}} \right ) = 0.95 \\[35pt]
\text{Lower Boundary} = \frac{S^2 \ (r - 1)}{\chi^2_{r-1, \ 1 - \alpha/2}} \\[35pt]
\text{Upper Boundary} = \frac{S^2 \ (r - 1)}{\chi^2_{r-1, \ \alpha/2}}$$
<br>
The construction of confidence intervals for the variance is illustrated with R and a plot of coverage is produced below. Note the asymmetry of confidence intervals, which stems from the asymmetry of the $\chi^2$ distribution.
<br>
```{r chiCiCoverage, message=FALSE, fig.cap="Location of 100 confidence intervals about the true variance of a population. Population variance is NOT known and sample size = 9. Each CI is a vertical line. Red CI's do not include the true variance"}
mean <- 2
sd <- 0.4
ssize <- 9
nsamples <- 100
set.seed(141307)
samples <- matrix(rnorm(n = ssize * nsamples, mean = mean, sd = sd),
byrow = TRUE, nrow = nsamples, ncol = ssize)
averages <- apply(samples, 1, mean)
vars <- apply(samples, 1, var)
# Calculate CI boundaries using z distribution
# because variance is known.
lo.chisq <- qchisq(p = 0.025, df = ssize - 1)
hi.chisq <- qchisq(p = 0.975, df = ssize - 1)
ci.lo <- vars * (ssize - 1) / hi.chisq
ci.hi <- vars * (ssize - 1) / lo.chisq
ci.colors <- c("red", "blue")
ci.color <- ci.colors[as.numeric(ci.lo < sd^2 & ci.hi > sd^2) + 1]
ymin <- 0.90 * min(ci.lo)
ymax <- 1.10 * max(ci.hi)
plot(vars, xlab = "Sample", ylab = "95% CI",
ylim = c(ymin, ymax), col = ci.color, pch = 16)
abline(h = sd^2, col = "grey")
segments(x0 = 1:200, y0 = ci.lo, y1 = ci.hi, col = ci.color)
coverage <- paste("Coverage = ",
as.character(
sum(ci.lo < sd^2 & ci.hi > sd^2) * 100 / nsamples),
"%", sep = "")
text(x = 20, y = 1.05 * max(ci.hi), labels = coverage, cex = 1.2)
```
<br>
We use the first sample obtained above to illustrate the calculation for a CI.
<br>
### Correct and incorrect interpretation of the Confidence Interval
$$ Y = \{`r round(samples[1, ], 2)`\} \\[30pt]
S^2 = `r round(var(samples[1, ]), 4)` \\[30pt]
df = r-1 = 9 - 1 = 8 \\[30pt]
\chi^2_{8, \ 0.975} = `r round(qchisq(p = 0.975, df = ssize - 1), 3)` \\[30pt]
\chi^2_{8, \ 0.025} = `r round(qchisq(p = 0.025, df = ssize - 1), 3)` \\[30pt]
\text{Lower Boundary } = `r round(var(samples[1, ]) * (ssize - 1) / qchisq(p = 0.975, df = ssize - 1), 3)` \\[30pt]
\text{Upper Boundary } = `r round(var(samples[1, ]) * (ssize - 1) / qchisq(p = 0.025, df = ssize - 1), 3)` $$
<br>
### Correct and incorrect interpretation of the Confidence Interval
Example of a published incorrect interpretation of a confidence interval:
"Owing to variation, we know the sample point estimate is not the true population value. Consequently, we add to it a statement indicating how far the point estimate is likely to be from the true value. One way is to supplement the point estimate by interval estimate. We can say, for example, that from the sample evidence we are confident that the number of farmers in Boone County who sprayed was between 345 and 736. By "confident" we mean that the probability is 95 chances in 100 that the interval from 345 to 736 contains the true but unknown number of farmers in Boone County who sprayed."
This is incorrect because it takes the CI as constant and the mean as variable. The boundary values of the CI will change from one sample to another, they will not be fixed at 345 and 736. The CI is the random range that changes. We know that based on the theory we expect 95 out of 100 confidence intervals to bracket the fixed true mean, and we know that the CI we actually calculated is one of those 100. Therefore, there is a 95% chance that the one we calculated brackets the mean.
Correct interpretation of the confidence interval found in another page in the same statistics book [@SnedecorCochran1989]:
"When we make a 95% confidence interval statement, the population mean $\mu$ has a fixed but unknown value. The uncertainty attached to a confidence interval for $\mu$ comes from the variability in the sampling process. We do not know whether the confidence interval constructed from the data from any particular sample includes $\mu$ or not. We do know that in the long run 95% of the intervals will include $\mu$, and similarly, 95% of the statements that "$\mu$ lies between the lower and the upper limit"" will be correct"
The main difference that makes the first statement incorrect and the second one correct is that the second statement does not specify the values for the extremes of the confidence interval. Those values will be changing from sample sample. Thus, the probability statement refers to a confidence interval that changes from sample sample, not to a single specific confidence interval. In other words, when we make hundred statement the extremes of the confidence interval change from one statement to the next. In 100 statements we expect to be correct 95 times, but all 100 statements will be different. Notice that the 95% correct rate is expected and may not be realized in any specific set of statements or samples.
## Logic of Null Hypothesis test
The structure of decision making when testing of null hypotheses within a frequentist framework is not intuitive, and it has weaknesses. In this section we explain the logic with a clear example in discuss some of the issues that are common and hard to resolve.
The logic behind test of null hypothesis is analogous to the following situation. Imagine that there are two jars with blue and yellow marbles. These jars represent the two possible cases: jar "Different" on the left is the case where we have different treatment means, whereas jar "Equal" on the right represents the case where all means are the same. One of the jars is the true situation available to be observed, the jar from which the marbles are selected at random, representing the process of collecting data. However, we do not know which one is being observed. We just know what the jars have inside, which represents the two possible states of reality, means equal or means different. The observation/experimentation process does not allow us to see the "real" jar, but we get a sample of what it has inside. Of course, we can only sample or partially observe the "real" situation. You can think that the hypothesis being "The jar sampled (the real situation) has mostly yellow marbles." This is equivalent, in the example, to hypothesizing that the means are equal, because the jar "Equal" has 80% yellow marbles.
On the outside, the jars are identical and you cannot see inside. The "Different" jar has 90% blue marbles, whereas the "Equal" jar has only 20% blue marbles. Suppose that the number of marbles is large enough that the probabilities are not affected by the removal of marbles, or that the marbles are returned to the jar after observing them. Only one jar is available to sample from, depending on whether the true but unknown means are different or not, but you do not know which one. You sample 5 marbles and they are all blue. Did you sample from "Different" or "Equal?" The null hypothesis is that the means are equal, which in the analogy signifies that the hypothesis is that we sampled the jar on the right labeled E for "Equal." We cannot know for sure where the marbles came from, because there are both blue and yellow marbles in both jars.
<br>
<br>
```{r jars1, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '60%', fig.align='center', echo=FALSE, fig.cap ="Representation of the logic of hypothesis testing in frequentist statistics. Jar D on the left represents the situation when means of treatments are different. Jar E on the right represents the situation when means are all equal. Imagine that there is a large number of marbles in each jar or that sampling is done with replacement. Marbles in each jar represent the distribution of the test statistic under each condition. A sample of r = 5 marbles represents the process of experimentation, sampling and calculation of the observed statistic value. Intuitively, if most of the marbles sampled are blue, we will tend to say that the jar sampled was D. As an exercise, design jars with various proportions of colors and see if it becomes easier or harder to guess the sampled jar."}
knitr::include_graphics("images/UrnsAnova1.png")
```
<br>
<br>
Intuition indicates that if all 5 marbles sampled are blue, we should reject the null hypothesis. Say that you sample 5 marbles and 4 are blue. The logic of Ho testing would go something like this: "I hypothesize that we sampled the E jar with mostly yellow marbles, but I got 4 blue, and the chances of getting 4 blue marbles from the E jar is very small, so I will guess that in reality I sampled from the D jar and reject the hypothesis that I sampled from the E jar. I will guess that the means are different." In the case of the marbles, all parameters of the problem are known, so we can calculate the exact probabilities of making errors. Because sampling is with replacement (or the number of marbles is really large), if the means are all equal, i.e., if we sampled from jar E the probability of getting a blue marble P(blue) in any draw would be constant at 0.20, and the number of blue marbles drawn would have a binomial distribution. Therefore, the probability of getting 4 or more blue marbles from E would be `r round(1 - pbinom(3, 5, 0.20), 5)`. This is the probability that we reject the null hypothesis while in fact it is true. We can call the number of blue marbles the "decision variable" or sample statistic that we use to determine whether to reject the null hypothesis. The probability of drawing 3 or more blue marbles out of 5 from E is `r round(1 - pbinom(2, 5, 0.20), 3)`, which is close to the standard $\alpha = 0.05$. Thus, we can say that we will reject the null that we sampled from E if we get 3 or more blue marbles out of 5. In this case, calculating the sample statistic and its sampling distribution is rather simple: to calculate the sample statistic we count the blue marbles, and the number of blue marbles has a binomial distribution.
<br>
```{block, type = 'think'}
The proportion of marbles in the D jar was not used for anything! What if the D jar had just one more blue marble than the E jar??? This is a problem in frequentist hypothesis testing: it does not take into account the sampling distribution of statistics under plausible alternative hypotheses.
```
<br>
In many experiments the response variable of interest is not binary; the variable observed is not like the number of marbles and it does not have a binomial distribution. A slightly more complicated procedure than applying the binomial distribution is used to calculate a statistic (calculated F-value) whose theoretical distribution is known when the null hypothesis is true (F distribution). This procedure is the analysis of variance, and the statistic calculated is the calculated F-value. When all means are the same, the calculated F has an F-distribution defined by certain parameters called "degrees of freedom" that depend on sample size and number of treatments compared.
```{block, type = "stattip"}
LOGIC OF TESTS OF NULL HYPOTHESIS:
IF the assumptions are true, AND the null hypothesis is true,\
THEN, over many samples, the calculated t has a Student's t distribution with r-1 df.\
THEREFORE, the probability of $t_{calc} \gt t{critical}$ is small.\
If $t_{calc}$ in this test is greater than $t{critical}$ we doubt (reject) the null hypothesis.\
Essentially, we say that events with low probability did not occur.
```
```{block, type = "stattip"}
- If you try a sufficiently large number of times, any hypothesis, regardless of their truthfulness, will be rejected. When somebody tells you they rejected a hypothesis, make sure to ask how many times they tried. Remember that correctly applied statistical methods will lead to rejection of true null hypothesis with an expected rate of 5 out of 100 tests.
```
<!-- ** Add shiny for students to run simulations of draws from randomly selected urns and the be shown the true urn. Use probabilities such that they can explore both types of errors. Assign homework where they explore the rates of errors as a function sample size and probabilities.** -->
<!-- A lot of the course contents can be reduced to comparing two unknown parameters. ANOVA is used when whe have more than one pair of parameters, comparison of two population means is used when we have just two parameters, and a single population test is used when one of the values is a known number. Therefore, the concept of comparing two means generalizes to ANOVA and becomes more specific in one population mean. The equations are all versions of the same general concept: from two random variables (one for each parameter estimate) we build a single random variable (the difference between estimates) and estimate its variance. In all cases the best estimate of the variance of the experimental error is the MSE, the pooled within group variance. The variance of the new single random variable is derived directly from the MSE. For example, in one-population one of the estimated parameters has variance 0 ($\mu$ is known) and there is a single group for the other one, so the MSE is equal to the sample variance, and the variance of the estimated parameter follows direclty from the most important formula for PLS 120, which is simply a rehash of the variance for the sum of independent random variables, where independence is obtained by sampling (assumption). -->
## Types of Errors in Hypothesis testing
```{block, type = 'stattip'}
- Even when everything is done correctly and assumptions are met, we are expected to make errors in hypothesis testing. We will reject true null hypothesis and fail to reject false ones. Statistics gives us methods to estimate and set the approximate rates at which we makes different types of mistakes. If assumptions are not valid, or methods are applied incorrectly we will make mistakes with unknown frequency.
```
Two types of errors can be made when properly testing hypotheses in with frequentist methods (Figure \@ref(fig:TypesOfErrors)). A true null hypothesis may be incorrectly rejected (Error type I) or a false null hypothesis may not be rejected (Error type II). The probability of making an Error type I is $\alpha$, because alpha is the probability of observing calculated statistics in the rejection range of values *when the null hypothesis is TRUE*. Therefore, by setting the value of $\alpha$ we are setting the rate of Error type I.
<br>
```{r TypesOfErrors, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '90%', fig.align='center', echo=FALSE, fig.cap ="Error types in hypothesis testing. Errors can be though of as analogous to trial decisions where the consequence of convicting an innocent person is highly undesirable."}
knitr::include_graphics("images/TypesOfErrors.png")
```
<br>
The probability of making an Error type II is called $\beta$. Its complement, the probability of NOT making an error type II is called "power." Thus, power = $1 - \beta$. Unlike $\alpha$, modifying $1 - \beta$ is harder. Power in a test of the null hypothesis that a mean is equal to a present number depends on four factors:
1. Variance of the population.
2. Sample size.
3. Difference between the true and the hypothesized mean.
4. Selected value for $\alpha$.
Power decreases with increasing variance and it increases with increasing sample size, difference between true and hypothesized mean, and $\alpha$.
<br>
```{r PowerFigure1, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '100%', fig.align='center', echo=FALSE, fig.cap ="Effect of sample size on power of a test that the mean of a population equals 50 or less when variance is known and the true mean is 60. The orange area represents the probability of error type I and the blue area is the probability of error type II."}
knitr::include_graphics("images/PowerFigure1.png")
```
<br>
## Pitfalls to avoid
Avoid the following statements or concepts:
"The probability that the null hypothesis is correct is 0.95."
"The probability that the null hypothesis is correct is 0.05."
"We accept the null hypothesis."
"There is a statistically significant difference, therefore there is something interesting and important to discuss and or report."
## Facts to remember
- Any null hypothesis will be rejected if it is tested enough times, because there are always samples for which the null is rejected. Cure: look at all studies done, not just those published, which are heavily biased towards significance. At least, look at many published studies. Publish negative results. Publish data.
- With enough sample size almost any null hypothesis of equality will be rejected, because no two parts of the world are exactly the same. Of course, if we sample exactly the same population, then we will not reject $H_0$. Cure: Test whether differences are greater than a value that has practical, theoretical or biological significance.
- Originally, statistical significance was designed to be an indicator that there was probably something interesting warranting further investigation. Statistical significance was the start of the work, not the end. Cure: Value and promote replication of experiments in multiple labs.
- Data are fact, except for errors of measurement or transcription. Interpretations and models are ephemeral; they can be disproved and improved.
## Exercises and Solutions
### Exercise
Generate 200 random samples from the same population and test the null hypothesis that the mean is equal to the true value. How many time is the null hypothesis incorrectly rejected? What is the expected number of times that it will be rejected?.
Solution:
We choose to select samples of size 4 from a population with normal distribution, mean equal to 50 and variance equal to 16. We set $alpha$ to 0.05 and make a two-tailed test.
<br>
```{r}
# Calculate the critical z value.
zcrit <- qnorm(p = 0.05/2)
# Generate 200 samples of size 4, or 800 random normal numbers
# and format as a matrix with one column per observation and
# one row per sample.
set.seed(131649) # set seed to get always the same random numbers.
samples <- matrix(
rnorm(n = 800, mean = 50, sd = sqrt(16)),
nrow = 200, byrow = TRUE)
# Apply a function to each row. The function finds the calculated
# z score for each row and compares it to the critical z.
results <- apply(samples, 1, FUN = function(x) {
abs((mean(x) - 50) / sqrt(16/4)) > abs(zcrit)
})
# Show the vector of tests. TRUE means Ho was rejected.
results
# Print results in normal English.
print(paste("The true null hypothesis was rejected in ", as.character(sum(results) / 2), " % of the tests"))
```
<br>
By definition, the probability of rejecting a true null hypothesis is $alpha$. We expect the number of rejections to be 5% of the tests.
### Exercise
Generate 200 random samples from the same population and test the null hypothesis that the mean is equal to the true value $\pm \delta$ for each one of them. Count how many times the true null is rejected. Plot the proportion of times that Ho is rejected as delta increases. What should the graph look like? Plot against the theoretical t-distribution.
Solution
The probability of rejecting the false null hypothesis is the power of the test. When the true mean is almost equal to the hypothesized mean, the probability of rejection tends to equal $\alpha$. As the difference $\delta$ increases in absolute value, power increases and approaches 1.0. The shape of the curve should be the same as distribution function for the normal, because power is the area under the right side of a normal as it slides away from a critical value.
<br>
```{r}
# Prepare a data frame to receive results
plot.data <- data.frame(delta = (1:200)/20, prop.rejected = NA)
# Calculate the critical z value.
zcrit <- qnorm(p = 0.05/2)
# Repeat for many values of delta
for (i in 1:200) {
# Generate 200 samples of size 4, or 800 random normal numbers
# and format as a matrix with one column per observation and
# one row per sample.
samples <- matrix(
rnorm(n = 800, mean = 50 + i / 20, sd = sqrt(16)),
nrow = 200, byrow = TRUE)
# Apply a function to each row. The function finds the calculated
# z score for each row and compares it to the critical z.
plot.data$prop.rejected[i] <-
sum(apply(samples, 1,
FUN = function(x) {abs((mean(x) - 50) / sqrt(16/4)) > abs(zcrit)}))/200
}
plot(plot.data$prop.rejected ~ plot.data$delta, type = "l",
xlab = "Difference between Ho and true mean.",
ylab = "Power or probability of rejecing the false Ho.")
```
<br>
### Exercise
A population of pears is known to have individual pear mass with a normal distribution with mean of 200 g and standard deviation equal to 18 g. Pears that weigh less than 180 g are used for processing and the larger ones are sold fresh. What proportion of the pears is used for processing?
<br>
```{r PearRejects1, message=FALSE, warning=FALSE, out.width = '60%', fig.align='center', fig.cap ="Proportion of pears that are used for processing because they are too small."}
left.tail(mean = 200,
sd = 18,
p = pnorm(q = 180, mean = 200, sd = 18),
lblx = "180 g")
# Proportion rejected is:
pnorm(q = 180, mean = 200, sd = 18)
```
<br>
### Exercise
A population of pears is known to have individual pear mass with a normal distribution with mean of 200 g and standard deviation equal to 18 g. What should be the lightest pear accepted if the lowest 10% is to be rejected?
<br>
```{r PearRejects2, message=FALSE, warning=FALSE, out.width = '60%', fig.align='center', fig.cap ="Left 10% of pears that are to be rejected."}
left.tail(mean = 200,
sd = 18,
p = 0.10,
lblx = "what is this value?")
# The value that leaves 10% below is:
qnorm(p = 0.1, mean = 200, sd = 18)
```
<br>
### Exercise
Ten pears are selected at random from the same population. What is the probability that 3 or more will be rejected if all pears lighter than `r round(qnorm(p = 0.1, mean = 200, sd = 18), 3)` g are rejected?
Solution
The probability of rejection is constant, and pears are selected independently from each other. Therefore, the number of rejected pears in a set number of 10 trials should follow a binomial distribution. It is simpler to calculate the complement probability of rejecting 0, 1 or 2 and subtract from 1 than to calculate and sum the probabilities of 3, 4, ... 10.
$$P(r \ge 3) = 1 - P(r \le 2) = 1 - \sum_{i = 0}^2 \ C^{10}_i \ 0.10^i \ q^{10-i} \\[25pt]
= 1 - 1 \times 0.10^0 \times 0.90^{10} + 10 \times 0.10^1 \times 0.90^{9} + 45 \times 0.10^2 \times 0.90^{8} \\[25pt]
= `r round(1 - pbinom(q = 2, prob = 0.10, size = 10), 4)`$$
## Homework
### Calculate a normal probability
A population of soybean yields in fields in Southern Brazil is known to have a normal distribution with mean equal to 3500 kg/ha and standard deviation equal to 500 kg/ha. What is the proportion of fields that have yields between 3000 and 4500 kg/ha?
This is the area under a normal pdf with mean 3500 and sd 500 and between 3000 and 4500. In R, I calculate this as the area under the curve for values lower than 4500 minus the area under the curve for values lower than 3000: `pnorm(q = 4500, mean = 3500, sd = 500) - pnorm(q = 3500, mean = 3500, sd = 500)`. If I have to use a z table where only right tail areas are given for the SND, I reason as follows: the area below 4500 in a normal with mean 3500 and sd 500 equals the area under the SND below (4500 - 3500)/500 = 2, which is equal to 1 - area above 2. Find the area above 2 in the table and subtract it from 1. The result is 0.9772499. Then I find the area under the curve to the left of 3000. Standardized, 3000 become (3000 - 3500)/500 = -1. Because of the symmetry of the standard normal distribution, the area under the curve to the left of -1 (which is not shown in the table) is the same as the area under the curve above +1, which you can look up in the table, and is equal to 0.1586553.
Therefore, the answer to the question is 0.9772499 - 0.1586553 = 0.8185946.
What is the proportion of fields that have yields between 2500 and 4000 kg/ha? You can use R, a table or any other method to get the answer. Specify how you obtained the answer.
```{r eval=FALSE}
answer1 <- # <- type your answer there, to the left of the # symbol; use 4 decimals
```
### Calculate a normal quantile
Fields in the lower 10-percentile are eligible for subsidies. What is the highest yield that a field can have to qualify for a subsidy?
This can be solve by direclty asking R for the quantile that corresponds the lowest 10 percentile:
`qnorm(p = 0.10, mean = 3500, sd = 500)` = 2859.224
What is the highest yield that a field can have to qualify for a subsidy?
```{r}
answer2 <- NA # <- delete the NA and type your answer there,
# on the left of the # symbol; use 4 decimals
```
### Calculate a Student's t probability
Nine rats were randomly selected from a population of laboratory rats. Each rat was tested in a spatial memory maze and the time necessary for the rat to exit the maze was recorded. Maze times recorded for the 9 rats in seconds were: 94.5, 104.7, 97.9, 103.4, 99.1, 95.2, 88.3, 105.1, 98.0.
A vector with the values can be made in R using the `c()` function. For example,
`v1 <- c(1 ,4 , 2)`
makes a vector with the numbers 1, 4, 2 and puts it into an object called "v1."
With the R code `t. avg <- mean(v1)` we obtain the sample average and put it into `t.avg`.
### Calculate a Student's t quantile
### Test a hypothesis about a mean when variance is known
### Test a hypothesis about a mean when variance is unknown
### Make a confidence interval for the mean with known variance
### Make a confidence interval for the mean with known variance
## Laboratory Exercises
### Plant Sciences Lab
"---" Unquote the three dashes
Title: "Normal distribution, T-distribution, Confidence Intervals, and One-sample Hypothesis testing"
author: "YourFirstName YourLastName"
date: "today's date here"
output: html_document
"---" Unquote the three dashes
### Instructions
For this lab you will modify and submit this file with the file name changed so it has your email ID (the part before @) in lower case instead of "email." Do not add spaces to the file name.
This is a markdown document. You will type your code and run it one line at a time as you add it in the lines indicated below. Add code **ONLY** in the areas between "\```{r}" and "\```". These areas are highlighted with a light grey color. Run each line and part to learn and experiment until you get the result you want. Keep the lines that worked and move on. At any time you can see if your document "knits" or not by clicking on the Knit HTML icon at the top. Once you have completed all work, knit your document and save the html file produced with the same filename but with an html extension (Lab02email.html).
**Submit BOTH files for your lab report using the appropriate Canvas tool**
For each part and question below, type your code in the grey area below, between the sets of back-ticks (```) to perform the desired computation and get output. Type your answers below the corresponding grey area.
```{r setup1, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Part 1. Normal Distribution R functions [10 points]
Normal Distribution
A statistical distribution is a function that associates a probability to results of random experiments. For example, the random experiment can be the flip of a coin. Let's "measure" the result by assigning 1 to a tail and 0 to a head. The result is a discrete random variable Y that can take values 0 or 1. The statistical distribution, in this case a Probability Mass Function (pmf), is a function that associates a probability to each value of Y. For the example of the coin, we usually pick a function defined as follows: P(Y = 1) = P(Y = 0) = 0.5. This read as "the probability of Y taking a value of 1 is equal to the probability of Y taking a value of 0 and is equal to 0.5." Keep in mind that the actual pmf can and probably does differ for different individual coins. For an extreme example, remember that for a coin with two heads P(Y = 1) = 0 and P(Y = 0) = 1.0. The probability for the real coin experiment also depends on who flips the coin. It is possible to learn to get any side of the coin one wants.
For continuous random variables we have statistical distributions that are called "Probability Density Functions" for which the probability of any specific number is 0, but the probability of getting a number in any interval, no matter how small, can be positive. The Normal distribution is one of those pdf's that is most used in statistics for many reasons. In this section we will explore and operate with the Normal distribution.
In the R code section below, we will read the help for "Normal" and list the functions that are explained there. Set the seed of random numbers to 39 so everyone gets the same random numbers. Create a vector called “Y5” containing 5 random numbers from the standard normal distribution using the “rnorm()” function. In the “rnorm()” function, give values to arguments called q, p, n, mean and sd by using their names as shown in the **Usage** part of the help. For example, rnorm(n = 5, mean = 0, sd = 1). Calculate the sample variance of Y5. Now, set the seed to 57 and create a new vector called “Y50” containing 50 random numbers from the standard normal distribution. Calculate the sample variance of Y50. Compare the variances. Is the relationship of the sample variances what you should have expected? What is the true variance of the populations from which you obtained the samples? Why do the calculated variances differ?
```{r}
help("Normal")
# Note, dnorm gives the probability density, pnorm gives the probability distribution function, qnorm gives the quantile function, and rnorm generates random values from the distribution.
set.seed(39)
# obtain 5 random numbers from the standard normal distribution and put into Y5
Y5 <- rnorm(n = 5,
mean = 0,
sd = 1)
# obtain 50 random numbers
Y50 <- rnorm(n = 50,
mean = 0,
sd = 1) # obtain 50 random numbers
# obtain the variances
(var5 <- var(Y5))
(var50 <- var(Y50))
```
Is the relationship of the sample variances what you should have expected? What is the true variance of the populations from which you obtained the samples? Why do the calculated variances differ?
Answers here:
### Part 2. Normal PDF [10 points]
Plot the pdf for a standard normal distribution. Modify the code and produce a graph of the pdf for a normal distribution with mean = 2 and sd = 0.7.
```{r}
# This plots the standard normal distribution.
plot(function(x) dnorm(x, mean = 0, sd = 1),
xlim = c(-5,5),
ylim = c(0, 0.5))
#Modify for mean = 2, sd = 0.7:
plot(function(x) dnorm(x, mean = 2, sd = 0.7),
xlim = c(-5,5),
ylim = c(0, 0.6))
```
### Part 3. Students's t Distribution [20 points]
When your sample size is small, and the true population standard deviation is unknown for a continuous probability distribution that you assume is normal, we use the t distribution instead.
```{r}
#plot t distribution with 100 degrees of freedom
plot(function(x) dt(x, df = 100),
xlim = c(-5,5),
ylim = c(0, 0.6))
#note the arguments “xlim” and “ylim” set the axis boundaries for the plot
```
Confidence interval from the t distribution, in relation to alpha level: Use R to determine the sample size (n), mean, standard deviation, standard error, and 95% (alpha = 0.05) confidence interval (CI) for sepal length for the iris data. Then calculate the confidence intervals at the
alpha levels of 0.10 (90% CI) and .01 (99% CI).
```{r iris.info, echo = TRUE, include = TRUE}
myiris <- iris #import the iris data into a data frame called myiris
str(myiris) # see components and types of iris
myiris.sl <- myiris$Sepal.Length # Create sepal length vector
myiris.n <- length(myiris.sl) # get sample size
myiris.mean <- mean(myiris.sl) # get mean
myiris.sd <- sd(myiris.sl) # complete the code to get standard deviation