-
Notifications
You must be signed in to change notification settings - Fork 0
/
tscs_demo.Rmd
1047 lines (711 loc) · 38.1 KB
/
tscs_demo.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Time Series Cross Sectional (TSCS) Data Tutorial"
output:
html_document:
toc: yes
df_print: paged
html_notebook:
toc: yes
df_print: paged
date: "June 21, 2024"
link-citations: yes
bibliography: ref.bib
---
# Introduction
This workshop offers an introduction to time series cross sectional (TSCS) data with a focus on applications in R. Some familiarity with time series cross sectional econometrics, as well as R, is assumed. We will cover some basic theoretical results, but this workshop is not intended to substitute for a more detailed theoretical treatment.
The workshop starts with some basic description of the structure of TSCS data and the risks of the pooled (OLS) estimator, then covers standard approaches to unit-level heterogeneity: the fixed effect (within) estimator, the random effect estimator. Interpretation and drawbacks of the fixed effect estimator are discussed, as well as a potential solution to the time invariant covariate problem via penalized maximum likelihood. We will discuss testing the key assumption behind the random effects model, and introduce the unifying "within-between" model.
The second section focuses on issues related to time. We start by considering the implications of a static model, then incorporate basic dynamics via the finite distributed lag and lagged dependent variable (LDV) specification. We briefly discuss tests for serial correlation in the model errors, as well as introduce the Anderson-Hsiao estimator for FE estimation with an LDV.
# Setup
## Downloads
[Training Repo, including this file](https://github.com/Lucy-Family-Institute/tscs_demo.git)
[R](https://cran.r-project.org)
[Rstudio IDE](https://www.rstudio.com/products/rstudio/download/)
## Packages
*install.packages()* and *library()* are common package downloads and loading. Here we'll load some useful packages. *pacman::pload()* combines both of these in one function.
```{r, results='hide', message=F, warning=F}
if (!require("pacman")) install.packages("pacman", repos = "https://cloud.r-project.org")
pacman::p_load("tidyverse", "data.table", "readstata13", "plm", "bife", "survival", "brglm2", "ivreg", "lmtest", "tseries", "plotly")
#setwd()
set.seed(1015)
```
# Review of OLS and the Linear Model
Some clarification of terminology
A linear regression models a dependent variable $Y$ as a function of one or more independent variables ($X$) and an error term $\epsilon$
(Ordinary) Least Squares (OLS/LS) is a popular estimator of the linear model, but not the only estimator
Ridge Regression / Lasso are examples of other estimators of the linear model
From an ML perspective, linear regression is a form of supervised learning while OLS is uses error minimization as measured by the sum of squared residuals
## Intuition
Create a univariate linear relationship using a Monte Carlo simulation
```{r, warning=F, message=F}
sample <- 200
x <- rnorm(sample, 0, 2)
e <- rnorm(sample, 0 , 10)
y <- 3 + x + e
univariate <- data.frame(y, x)
univariate
plot<-univariate %>%
ggplot(aes(x, y)) +
geom_point()
ggplotly(plot)
```
Estimating a univariate linear model is equivalent to drawing a straight line through this scatter plot
Can fit any number of linear models to this data - need a criteria for determining the best model
OLS posits that the best model is one that minimizes the error of the model's predicted $Y$ and the real $Y$
OLS measures model error as the sum of squared resiudals
```{r}
univariate$pred <- predict(lm(y~x, data=univariate))
univariate$resid <- univariate$y - univariate$pred
resid_plot<-univariate %>%
ggplot(aes(x, y)) +
geom_point()+
geom_smooth(method = "lm", se=F)+
geom_segment(aes(xend=x, yend = pred), color="red")
ggplotly(resid_plot)
```
## Formalization
Write the unknown, but linear, data generating process for $Y$ as
$$ Y = \beta_0 + \beta_1X + \epsilon $$
Our linear model returns estimates of coefficients, $\beta$, which we denote with $\hat{}$. We use these estimates to construct the estimate of $Y$
$$\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X $$
We define the residual $e$ as the difference between $Y$ and its estimate $\hat{Y}$, which we use as an estimate of the unknown disturbance $\epsilon$. Note that we move to estimates for an individual observation $i$ below:
$$ e_i = \hat{y}_i - y_i$$
Finally, we denote the residual sum of squares (RSS) as the sum of the squared residuals for all observations in the data
$$ RSS = \sum_{i=1}^{n}e_i^2$$
OLS estimates the coefficients $\beta$ that minimize $RSS$. For the simple bivariate case the coefficient on $X$ is simply the ratio of it's covariance with $Y$ over its variance:
$$\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^{n}(x_i - \overline{x})^2}$$
The constant is simply the difference between the mean of $Y$ and the predicted $Y$ at the mean of $X$
$$\hat{\beta_0} = \overline{y} - \hat{\beta_1}\overline{x}$$
Given that the coefficient estimates that minimize $RSS$ are known there is no need to perform model training or tuning to estimate them - they can be calculated directly from the data
```{r}
coef(lm(y~x, data=univariate))
mean_x <- mean(x)
mean_y <- mean(y)
beta1_hat <-(sum((univariate$x - mean_x)*(univariate$y - mean_y)))/(sum((univariate$x-mean_x)^2))
beta0_hat <- mean_y - beta1_hat*mean_x
print(c(beta0_hat, beta1_hat))
```
## What's Special About RSS?
Under certain assumptions, OLS is the 'best' estimator for the linear model
*Zero conditional mean assumption* (along with some other technical assumptions) is sufficient to ensure unbiasedness ($E(\hat{\beta})=\beta$):
$$E(\epsilon|X) = 0 $$
Note that this is an assumption about the unobserved error $\epsilon$ and cannot be tested - OLS will fit residual $e$ that is mean zero whether or not it's appropriate
The additional assumption of *Homoskedastic and Serially Uncorrelated Errors* makes OLS the Minimum Variance Unbiased Linear Estimator
$$Var(\epsilon | X) = \sigma^2 $$
$$Corr(\epsilon_t, \epsilon_s) = 0, \forall t \neq s $$
Informally, these two assumptions ensure that of all the unbiased linear estimators of the linear model, OLS has the smallest variance
This does not mean OLS is the absolute best estimator in any circumstance
If the standard errors satisfy the above assumptions, but are not normally distributed, then non-linear unbiased estimators may offer improvement, such as trimmed least squares
Variance may be reduced by using a biased estimator, such as ridge regression
## Coefficient Variance
OLS is unbaised, but any individual estimate of $\hat\beta$ may be far from $\beta$
The variance of $\hat\beta_1$ is a function of the variance of the error and the variance of $X$
Higher variance in the error makes estimation less precise while higher variance in $X$ makes it more precise:
$$ Var(\hat\beta_1) = \frac{\sigma^2}{\sum_{i=1}^n(x_i-\overline{x})^2} $$
Denominator is known as a function of the data - numerator is unknown and must be estimated
An unbiased estimator of $\sigma^2$ is $RSS/(n-2)$ where $n-2$ is a degrees of freedom adjustment due to constraints of OLS
For test statistics, we work with the standard error of the coefficient, as opposed to the variance
$$se(\hat\beta_1) = \sqrt\frac{\sigma^2}{\sum_{i=1}^n(x_i-\overline{x})^2} = \sqrt\frac{RSS/(n-2)}{\sum_{i=1}^n(x_i-\overline{x})^2} $$
## Multiple Regression
Consider a linear model where $Y$ is a function of two independent variables:
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon$$
```{r}
x1 <- rnorm(sample, 0, 5)
x2 <- 2*x1 + rnorm(sample, 0, 5)
y <- 2*x1-x2+rnorm(sample, 0,5)
multivariate <- data.frame(x1, x2, y)
multi_plot<-plot_ly(multivariate, x = ~x1, y = ~x2, z = ~y)
multi_plot<-multi_plot %>%
add_markers()
multi_plot
```
Estimation of this linear model via OLS is similar
Recall OLS minimizes $RSS$
$$ RSS = \sum_{i=1}^{n}e_i^2$$
$$ = \sum_{i=1}^{n}(\hat{y_i} - y_i)^2 $$
$$ = \sum_{i=1}^{n}((\hat{\beta_0} + \hat{\beta_1}X_{1i} + \hat{\beta_2}X_{2i}) - y_i)^2 $$
When we model $Y$ as a function of $X_2$ OLS minimizes $RSS$ over both $X$
The formulas for coefficients of multivariate regression are more complicated, but one representation:
$$\hat\beta_1 = \frac{\sum_{i=1}^{n}\hat{r_{i1}}y_i}{\sum_{i=1}^{n}(\hat{r_{i1}})^2} $$
Where $\hat{r_{i1}}$ is the residual from the auxiliary regression of $X_1$ on $X_2$
$$ X_1 = \hat\gamma_0+\hat\gamma_1{X_2} + \hat{r_1} $$
Note that our new estimate for $\hat\beta_1$ is identical to our old estimate except that $Cov(\hat{r}, y)$ and $Var(\hat{r})$ have replaced $Cov(x, y)$ and $Var(x)$ respectively
The specific form of these variances and covariances is slightly different - note no means! We've applied two simplifications to drop means
$$ E[\hat{r}] = 0 $$
$$ Cov(x, y) = \sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y}) = \sum_{i=1}^{n}(x_i - \overline{x})y_i $$
### Illustration of Multiple Regression by Partial Regression
```{r}
multivariate$resid <- residuals(lm(x1~x2, multivariate))
partial <- lm(y~resid, multivariate)
full <- lm(y~x1+x2, multivariate)
print(c(coef(partial), coef(full)))
```
### Illustration of Omitted Variable Bias
```{r}
short <- lm(y~x1, multivariate)
print(coef(short))
aux <- lm(x2~x1, multivariate)
coef(full)[2] + coef(full)[3]*coef(aux)[2]
```
This shows the relationship between the coefficients of the correct regression with $X_2$ and what is called the 'short' regression (because it's short $X_2$) where $Y$ is regressed on $X$ alone
$$ \tilde\beta_1 = \hat{\beta_1} + \hat{\beta_2}\hat{\gamma_1}$$
Where $\tilde{\beta_1}$ is the coefficient on $X_1$ from the short regression. The second term is the omitted variable bias from the exclusion of $X_2$ - the effect of $X_2$ on $Y$ (controlling for $X_1$) as well as the effect of $X_2$ on $X_1$ from the auxiliary regression ($\hat{\gamma_1}$)
Can $\tilde{\beta_1} = \hat{\beta_1}$?
# Frequentist Statistical Inference
Fundamental question: does $X$ effect $Y$?
Frequentist inference begins by assuming the opposite. We refer to this as the null hypothesis:
$H_0$: There is no relationship between $X$ and $Y$
Under the null hypothesis - what do we expect $\hat\beta_1$ to be?
Statistically, we want to test how likely our estimate of $\hat\beta_1$ under the null hypothesis - in other words if the effect of $X$ is zero, how unlikely is our estimate?
This depends on the coefficient and it's variance, which we use to construct a t-statistic:
$$t = \frac{\hat\beta_1 - \beta_{1|H_0}}{se(\hat\beta_1)} $$
Under the *normality assumption*:
$$ \epsilon \sim Normal(0, \sigma^2) $$
The t-statistic follows a t-distribution with n-k-1 degrees of freedom. Knowing this we can calculate a *p-value* which represents the probability that we could draw a test statistic at least as large as $T$ (one tailed test) or at least as large as $|T|$ (two tailed)
Note that normality implies homoskedasticity
## Manual Illustration
```{r}
univariate$y_hat = beta0_hat + beta1_hat * univariate$x
rss = sum((univariate$y_hat - univariate$y)^2)
se_beta1 <- sqrt(
(rss/(sample-2))/
sum((univariate$x - mean_x)^2)
)
t <- beta1_hat/se_beta1
p <- 2*pt(abs(t), df=sample-2, lower=F)
print(c(t, p))
```
## LM Function
```{r}
lm <- lm(y~x, data=univariate)
summary(lm)
plot(lm)
```
# TSCS Data
TSCS data have observations of multiple units *N* over time *T*.
Each specific observation within the data is of one unit at one time, e.g.
```{r, warning=F, message=F}
data<-read.dta13("ms_mil_spend_data.dta")
data %>%
filter(ccode==2, year==2000)
```
This observation is for the US (2) for the year 2000.
```{r}
data %>%
dplyr::arrange(ccode, year) %>%
dplyr::select(ccode, year, LMILEX, v2x_polyarchy)
data %>%
dplyr::filter(ccode %in% c(2,20)) %>%
ggplot(aes(x=year, y=LMILEX)) +
geom_line() +
facet_wrap(~ccode)
```
# Within / Between Decomposition
```{r}
summary(data$LMILEX)
#Between unit variation
data %>%
group_by(ccode) %>%
summarize(between=mean(na.omit(LMILEX)))%>%
dplyr::select(between) %>%
ungroup() %>%
summary()
#Within unit variation
data %>%
group_by(ccode) %>%
mutate(between=mean(na.omit(LMILEX)),
within=LMILEX-mean(na.omit(LMILEX))) %>%
dplyr::select(within) %>%
ungroup() %>%
summary()
```
# Pooling
The simplest approach to estimating a model on TSCS is to pool the data:
$Y_{it}=\beta_0+\beta_1X_{it}+e_{it}$
Where Y is the DV, X is the IV, $\beta_0$ is the intercept, $\beta_1$ is the coefficient on X and $e_{it}$ is the error.
What are we assuming here?
All of the $N$ units share a common intercept $\beta_0$ and a common slope $\beta_1$ on X.
Standard OLS assumptions regarding the error term (specifically zero conditional mean and white noise errors).
Whether pooling is appropriate depends on whether those assumptions are correct.
## Clean Pool
```{r}
#Monte Carlo simulations to better control the relationships
N<-seq(1,5,1)
Time<-seq(1,50,1)
clean<-expand.grid(N, Time)
names(clean)<-c("N", "Time")
clean<- clean %>%
dplyr::arrange(N, Time)
tibble(clean)
#Draw the IV, DV, error
clean$X<-rnorm(nrow(clean), 1, .25)
clean$Y<-3-2*clean$X+rnorm(nrow(clean), 0, .5)
tibble(clean)
#Run the pooled model
clean_pool<-lm(Y~X, data=clean)
summary(clean_pool)
#Plot to understand why this worked
clean %>%
ggplot(aes(x=X, y=Y))+
geom_point(aes(color=as.factor(N)))+
geom_smooth(method="lm", se=F)
```
Pooled model works here because there is no difference in the underlying relationship - we know this because we drew X without respect to N, Time
## Dirty Pool
Let's assume instead that there exists some unit-level heterogeneity. Some $Z$ affects both $Y$ and $X$ but is time constant.
You may recognize this as the setup for omitted variable bias.
```{r}
#Monte Carlo simulations to better control the relationships
N<-seq(1,5,1)
Time<-seq(1,50,1)
dirty<-expand.grid(N, Time)
names(dirty)<-c("N", "Time")
dirty<- dirty %>%
dplyr::arrange(Time, N)
#Draw the unobserved effect, IV, DV, error
Z<-rnorm(length(unique(dirty$N)), 0, .5)
dirty$Z<-rep(Z, length(Time))
dirty$X<-dirty$Z+rnorm(nrow(dirty), 1, .25)
dirty$Y<-3-2*dirty$X+4*dirty$Z+rnorm(nrow(dirty),0, .5)
#Run the pooled model
dirty_pool<-lm(Y~X, data=dirty)
summary(dirty_pool)
#Plot the pooled model - what are we looking at?
dirty %>%
ggplot(aes(x=X, y=Y))+
geom_point()+
geom_smooth(method="lm", se=F)
dirty %>%
ggplot(aes(x=X, y=Y))+
geom_point(aes(color=as.factor(N)))+
geom_smooth(method="lm", se=F)
#Plot the per unit regressions
dirty %>%
ggplot(aes(x=X, y=Y, color=as.factor(N)))+
geom_point()+
geom_smooth(method="lm", se=F)
```
In this case, pooling estimates a weak positive relationship between X and Y, despite the fact that the true relationship is negative.
This occurs because there is omitted variable bias introduced by Z.
Note that pooling would be fine if we could estimate $Y_{it}=\beta_1X_{it}+\beta_2Z_i+e_{it}$
# Fixed Effects
With TSCS data that contains time-constant unit-level omitted variable bias, fixed effects can be used to estimate the unbiased relationship, even if the omitted variable $Z$ is not observed.
```{r}
fe<-plm(Y~X, dirty, index=c("N", "Time"), method="within")
summary(fe)
```
Why does this work?
The fixed effect estimator considers only the within-unit variation, and removes the between unit variation.
In other words, the data are group demeaned- each observation for each *i* is subtracted from the mean for that unit over time.
Because the effect of *Z* is constant over time, demeaning removes its effect and the bias it introduces
We can do this manually:
```{r}
demean<-dirty %>%
dplyr::arrange(N,Time)%>%
dplyr::group_by(N) %>%
dplyr::mutate(mean_X=mean(X),
mean_Y=mean(Y),
demean_X=X-mean_X,
demean_Y=Y-mean_Y)
demean
fe_manual<-lm(demean_Y~demean_X-1, data=demean)
summary(fe_manual)
demean %>%
ggplot(aes(x=demean_X, y=demean_Y))+
geom_point(aes(color=as.factor(N)))+
geom_smooth(method="lm", se=F)
```
The within estimator is typically introduced with the dummy variable notation - where N indicators for each unit (or N-1 and the intercept). These are equivalent in OLS (not in GLMs). Most software (plm in R, xtreg, fe in Stata) uses the within transform
```{r}
fe_dummy <-
lm(Y~X+as.factor(N)-1, data=dirty)
summary(fe_dummy)
summary(fe_manual)
summary(fe)
```
Note that the standard errors are slightly different because the demeaned model doesn't have the right degrees of freedom. You should use plm() for the within transform as it corrects for this.
## Interpretation
Interpretation is standard, but with a caveat
```{r}
summary(fe)
sd(dirty$X)
coef(fe)*sd(dirty$X)
sd(demean$demean_X)
coef(fe)*sd(demean$demean_X)
```
Takeaway: Coefficients on Linear FE regressors are correctly interpreted as the effect of a one unit increase in *the within unit* X. The within unit variance is always no larger than the total variance in X (often significantly smaller, here about 1/2). Calculate substantive effects using within X. (@mummolo2018improving)
## Drawbacks
1. No estimates for time-invariant variables
```{r}
time_inv<-dirty
time_inv$K<-ifelse(dirty$N<=2, 1, 0)
time_inv$Y<-3-2*time_inv$X+4*time_inv$Z-3*time_inv$K+rnorm(nrow(time_inv),0, .5)
drop_model<-plm(Y~X+K, model="within", index = c("N", "Time"), data=time_inv)
summary(drop_model)
#Why not?
time_inv %>%
group_by(N) %>%
mutate(var_k=K-mean(K)) %>%
ungroup() %>%
select(var_k) %>%
summary()
```
The FE estimator sweeps out all variation on K, because K only has between-unit variance, not within unit variance.
2. Loss of observations without variation in the DV in either conditional or unconditional logistic regression.
```{r}
data %>%
group_by(ccode) %>%
summarize(var_dem=var(na.omit(vdem_dem)))
logit<-clogit(vdem_dem~LMILEX+strata(ccode), data=data)
summary(logit)
```
A lot has been made of this issue, but does it matter? (@cook2020fixed)
"Feels" problematic that the model ignores a portion of the data. Can be severe if the event in question is rare/common.
Concrete issues:
1. No estimates of FE, so no individual MEs (similar to Cox PH model).
2. Biased estimates of baseline rate of event because non-event units are removed. Correct interpretation is baseline rate for units that experience the event.
### Penalized ML
@cook2020fixed recast the unconditional (unit-dummy) FE problem as one of *separation*.
Separation occurs when a variable perfectly predicts the DV- here the unit does. (Imagine a table of Y and N, if one or more cells are empty, this is separation)
ML solutions *do not exist* under separation- the LL is flat because fit can always be improved by pushing the separating coefficient more towards negative (positive) infinity.
Well known solution is to penalize the maximum likelihood to punish large coefficients (may be familiar as ridge/lasso, but specific penalty is different.)
```{r}
penalized_fe<-glm(vdem_dem~LMILEX+as.factor(ccode), data=data,
familty="binomial"(link="logit"), method="brglmFit")
summary(penalized_fe)
```
Note that the unconditional FE estimator is biased due to the incidental parameters problem, which can be severe in small T. Penalizing the ML does not change this.
Incidental parameters problem is just a special case of well known small sample bias of MLE (MLE is consistent, not unbiased, and consistency relies on asymptotics).
As N grows without T, the number of 'incidental parameters' (the unit dummies) grows, so we are using T periods to estimate a greater and greater number of parameters.
# Random Effects
What if $Z$ does not cause $X$?
There's no OVB (recall necessary conditions)
FE estimator is still unbiased
Pooled estimator unbiased, but incorrect SE/t-stats because unit effect creates serial correlation in errors
```{r}
#Make some data to better control the relationships
N<-seq(1,5,1)
Time<-seq(1,50,1)
random<-expand.grid(N, Time)
names(random)<-c("N", "Time")
random<- random %>%
dplyr::arrange(Time, N)
#Draw the unobserved effect, IV, DV, error
Z<-rnorm(length(unique(random$N)), -2, .5)
random$Z<-rep(Z, length(Time))
random$X<-rnorm(nrow(random), 1, .25)
random$Y<-3-2*random$X+4*random$Z+rnorm(nrow(random),0, .5)
#Run the pooled model
random_pool<-lm(Y~X, data=random)
summary(random_pool)
#Plot the pooled model - what are we looking at?
random %>%
ggplot(aes(x=X, y=Y))+
geom_point(aes(color=as.factor(N)))+
geom_smooth(method="lm", se=F)
#Plot unit intercepts/slopes
random %>%
ggplot(aes(x=X, y=Y, color=as.factor(N)))+
geom_point()+
geom_smooth(method="lm", se=F)
#RE estimator
re<-plm(Y~X, random, model = "random")
summary(re)
```
## Pooled model has incorrect variance
```{r}
#True error
random$error <-random$Y-(3-2*random$X+4*random$Z)
#Get estimated error from pooled model
random$resid<-resid(random_pool)
random %>%
dplyr::select(error, resid) %>%
pivot_longer(cols = everything()) %>%
ggplot()+
geom_histogram(aes(value, fill=name, alpha=.2))+
guides(alpha="none")
e_resid = sum((random$resid)^2)/248
e_error = sum((random$error)^2)/250
e_resid
e_error
# How does this affect beta se and t-stats?
sqrt(e_resid / sum((random$X - mean(random$X))^2))
sqrt(e_error / sum((random$X - mean(random$X))^2))
summary(random_pool)
summary(re)
```
We can see that our SEs (and by extension t-stats) for the pooled model are incorrect. This is becuase our estimate of $\sigma^2$ (variance of the error term) is wrong because the pooled estimator includes the effect of $Z$ on $Y$ in the variance.
Random effects purge the variance contributed by $Z$ and therefore return SE that match SE calculated with the known error term.
### Autocorrelation robust standard errors
Since the pooled model is unbiased, we can simply correct the standard error esitmates. Here using cluster-robust SE.
However, cluster-robust SE are *asymptotically* correct (in the number of units $N$ *not* the number of observations). Our *N* is 5, which is...not asymptotic.
```{r}
coeftest(random_pool, vcov=vcovHC(random_pool, type = "HC0", group="N"))
```
These look very similar to our uncorrected SEs.
Make the data bigger.
```{r}
#Make some data to better control the relationships
N_large<-seq(1,5000,1)
Time<-seq(1,50,1)
random_large<-expand.grid(N_large, Time)
names(random_large)<-c("N", "Time")
random_large<- random_large %>%
dplyr::arrange(Time, N)
#Draw the unobserved effect, IV, DV, error
Z_large<-rnorm(length(unique(random_large$N)), -2, .5)
random_large$Z<-rep(Z_large, length(Time))
random_large$X<-rnorm(nrow(random_large), 1, .25)
random_large$Y<-3-2*random_large$X+4*random_large$Z+rnorm(nrow(random_large),0, .5)
#Run the pooled model
random_pool_large<-lm(Y~X, data=random_large)
summary(random_pool_large, cluster="N")
large_e_error <- sum((random_large$Y -(2*random_large$X+4*random_large$Z))^2)/(length(N_large))
sqrt(large_e_error / sum((random_large$X - mean(random_large$X))^2))
```
## Testing the validity of the RE assumption
In order to be unbiased, the Random Effects estimator requires that there is no correlation between the unit effects and the explanatory variables.
This assumption is typically tested with a Hausman test comparing an FE and RE model with the same specification
```{r}
#Fit a FE model to the same data
fe<-plm(Y~X, random, model = "within")
#Hausman Test
phtest(fe, re)
#Note, order of the arguments does not matter
#Can also calculate the Hausman statistic by hand
t(coef(fe) - coef(re)[2])%*% solve(vcov(fe)-vcov(re)[2,2]) %*% (coef(fe)-coef(re)[2])
```
The null hypothesis of the Hausman test is (roughly) that the coefficient vectors are equal in both the FE and RE models. Rejection indicates the two vectors are statistically different. Rejection is typically taken to suggest violation of the RE assumption.
Recall that both the FE and RE models are unbiased under the RE assumptions (the former is not efficient). Thus, if the RE assumptions hold, we would not expect a difference. In contrast, if the RE assumption regarding the correlation of the individual effects with the independent variables is violated, only the FE estimator is unbiased and the RE estimator is biased. Therefore, we would expect differences in the estimates.
# The Within-Between Model
Rather than choosing between fixed and random effects, we can estimate a compound model.
Recall that the goal of both the FE and RE models is to account for unit-level heterogeneity. The fundamental difference is that the RE model requires no covariance between the unit effects and the error. The FE model is agnostic on this because it eliminates all variation on all variables due to unit effects via the within transform (subtracting the unit means from each realization of $Y$, $X$)
The risk of the RE model, then, can be thought of as correlation between these unit means and the error term. As with standard OVB, putting the unit means into the equation will remove the bias and allow the RE estimator to recover the unbiased coefficients on $X$.
The traditional approach comes from @mundlak1978pooling and involves estimating the following equation with the RE estimator:
$Y_{it}=\beta_1X_{it}+\beta_2\overline{X_i}+e_{it}$
Where $\overline{X_i}$ is the unit level mean of $X$
Alternatively, the time-specific realizations of $X$ can be within-transformed prior to estimation.
$Y_{it}=\beta_1(X_{it}-\overline{X_i})+\beta_2\overline{X_i}+e_{it}$
The specific model you should use depends on which coefficients are more of interest: the individual or aggregate. The former (Mundlak) specification facilitates interpretation of the aggregate (group mean) variable.
The group demeaning in the latter approach ensures that the individual realizations are uncorrelated with the group-level mean. As a result, the group-level term cannot be interpreted, because it does not control for the effect of the individual realizations.
```{r}
#We'll fit the second equation to our RE data
#First, we have to group demean
demean_random<-random %>%
dplyr::arrange(N,Time)%>%
group_by(N) %>%
mutate(mean_X=mean(X),
demean_X=X-mean_X)
# Now include both X terms in the RE model
rebw<-plm(Y~demean_X + mean_X, demean_random, model = "random")
coef(rebw)
coef(fe)
```
Notice that the coefficient on the within effect from the REBW model is the same as the FE estimate.
We can also estimate the Mundlak version
```{r}
#Mundlak formulation
mundlak<-plm(Y~X + mean_X, demean_random, model = "random")
coef(mundlak)
```
Again, we've successfully purged the coefficients on the individual realizations of $X$. Our coefficient on $\overline{X}$ has changed. The REWB estimate was -5.2 which we can see here is actually the sum of the between and within coefficients from the Mundlak estimator of -3.2 and -2
# Dynamic Models
```{r}
#Check how our data look
data %>%
dplyr::select(c(ccode, year, LMILEX, v2x_polyarchy)) %>%
filter(ccode<100) %>%
pivot_longer(cols = c(LMILEX, v2x_polyarchy), values_to = "values")%>%
ggplot(aes(y=values, x=year, color=as.factor(ccode)))+
geom_line()+
facet_wrap(~name, scales="free_y")
```
## Panel Unit Root Testing
### Single TS Unit Root Testing
Many time series are persistent - current values depend on past values.
Formally, we can define an autoregressive process of order one AR(1) series as one that can be written
$$Y_t = \rho_1Y_{t-1}+\epsilon_t$$
where $\epsilon_t$ is iid. Order one means that $Y$ is only a function of one past realization.
$Y$ is *stationariy* if $|\rho_1| < 1$. Informally, this means the correlations between $Y$ and its past realizations is asymptotically 0. Over a long enough time period, the past is irrelevant.
When $|\rho_1| = 1$ $Y$ is said to have a *unit root.* We also call such a series as *integrated* of order one, or I(1). An I(1) series is only one type of unit root, and determining which kind of unit root is difficult.
Such series are problematic for modeling. First, correlations with the past do not die out, even asymptotically. Second, unit root series have increasing variance over time, meaning the central limit theorem does not hold (test statistics are asymptotically normal), so no hypothesis tests are valid. Third, regressions with non-stationary variables can produce spurious inferences - two integrated series are correlated due to integration, not a causal relationship.
Let's see:
```{r}
errors <- rnorm(1001,0,1)
unit_root <- cumsum(errors)
unit_root_data <- data.frame(error=errors, unit_root=unit_root, time=seq(1,1001,1))
unit_root_data <- unit_root_data %>%
mutate(stationary1 = .5*dplyr::lag(error)+rnorm(1,0,1),
stationary2 = .9*dplyr::lag(error)+rnorm(1,0,1),
first_diff = unit_root - dplyr::lag(unit_root))
unit_root_data %>%
dplyr::select(-error) %>%
pivot_longer(-time) %>%
ggplot(aes(y=value, x=time, color=name))+
geom_line()
```
You can see that over a long enough time period the persistence of the series is irrelevant - it's still mean reverting. In contrast, the unit root gets "stuck" at high or low values and stays there. Typically these are less obvious.
I(1) series are *difference stationary* meaning that if we subtract the prior value, the series is stationary.
$$Y_t = \rho_1Y_{t-1}+\epsilon_t$$
$$Y_t = Y_{t-1}+\epsilon_t$$
$$Y_t - Y_{t-1} = \epsilon_t$$
Typically we denote a differenced series with $\Delta$, e.g. $\Delta Y$
#### Augmented Dickey-Fuller Test
If a series has a unit root, then $\rho$ = 1. We could formally test this with a regression of $Y_t$ against its past values, and use a hypothesis test on the coefficient. This is the basis of the Dickey-Fuller test. However, we use different critical values since the central limit theorem is violated by a unit root, and we typically use the Augmented Dickey-Fuller, which includes additional lags to clean up residual serial correlation. Note, that the function we're using automatically determines these based on series length.
```{r}
apply(unit_root_data, 2, function(x) tseries::adf.test(na.omit(x)))
```
As noted by the test, the null hypothesis is that the series has a unit root. Thus at a small p-value we reject the null and treat the series as stationary.
### Back to Panels
Panel unit root testing is difficult and R implementations are somewhat basic
We can perform our own test via the p-value aggregation method of @choi2001unit
We perform an augmented Dickey-Fuller test on each panel, aggregate the p-values
Specific aggregation is $-2\sum_{n}^{N} ln(p_n)$
Test statistic is distributed $\chi^2(2N)$
N here indicates number of panels
We're going to loop over the panels, perform the test, store the results
(You could also do this as a vectorized function...)
```{r, warning=F, message=F}
pvals <- NULL
for (i in unique(data$ccode)){
try(dfuller <- adf.test(data[data$ccode==i, "LMILEX"], k = 1))
pvals<-rbind(pvals, dfuller$p.value)
}
head(pvals)
#Choi's P-value
test_stat<- -2*sum(log(na.omit(pvals)))
test_stat
#That's going to reject, but let's get a p-value for completeness
#Test is distributed chi2(2N) where N is panels
round(pchisq(test_stat, df=2*nrow(na.omit(pvals)), lower.tail = F), digits=3)
```
In the p-value aggregation chase, we are testing a specific null hypothesis that all the panels have a unit root against the alternative that at least one is stationary.
That might not be the most reassuring rejection; other tests have different nulls.
## First Differencing
The panel ADF didn't indicate a need for differencing, but let's to illustrate.
(Usual caveats here; first differencing is only appropriate for I(1) series, you need to test the differenced series for stationarity, etc.)
```{r}
#Difference these out to be safe
fd<-pdata.frame(data, index = c("ccode", "year"))
fd$d_lmilex=diff(fd$LMILEX)
fd$d_polyarchy=diff(fd$v2x_polyarchy)
fd %>%
dplyr::select(c(ccode, year, d_lmilex, d_polyarchy)) %>%
filter(ccode %in% seq(2,99,1)) %>%
pivot_longer(cols = c(d_lmilex, d_polyarchy), values_to = "values")%>%
ggplot(aes(y=values, x=year, color=as.factor(ccode)))+
geom_point()+
facet_wrap(~name, scales="free_y")
```
## Static
More details in @de2008taking
```{r}
static<-lm(d_lmilex~d_polyarchy, data=fd)
summary(static)
#Illustrate the impact of 1 SD change in polyarchy (temporary)
time<-seq(1,10,1)
dem<-c(rep(0,2),
sd(na.omit(fd$d_polyarchy)),
rep(0, 7))
pred_dat<-cbind.data.frame(time,dem)
pred_dat$milex_hat<-pred_dat$dem*coef(static)[2]
pred_dat %>%
ggplot(aes(x=time, y=milex_hat))+
geom_line()
```
If the regression equation is $Y_t=\beta_0+\beta_1X_t+\epsilon_t$
The effect of $X_t$ on $Y_t$ is $\frac{\delta Y_t}{\delta X_t}$ which is $\beta_1$
Moving forward one period: $Y_{t+1}=\beta_0+\beta_1X_{t+1}+\epsilon_{t+1}$
The derivative of that regression with respect to $X_t$ is zero.
## Finite Lag
```{r}
fd$lag_polyarchy=lag(fd$d_polyarchy)
fd$lag_lmilex=lag(fd$d_lmilex)
lag1<-lm(d_lmilex~d_polyarchy+lag_polyarchy, data=fd)
summary(lag1)
pred_dat<-pred_dat %>%
mutate(lag_dem=dplyr::lag(dem))
pred_dat$milex_hat<-pred_dat$dem*coef(lag1)[2]+pred_dat$lag_dem*coef(lag1)[3]
pred_dat %>%
ggplot(aes(x=time, y=milex_hat))+
geom_line()
```
If the regression equation is $Y_t=\beta_0+\beta_1X_t+\beta_2X_{t-1}+\epsilon_t$
The effect of $X_t$ on $Y_t$ is $\frac{\delta}{\delta X} \beta_0+\beta_1X_t+\beta_2X_{t-1}+\epsilon_t$ which is $\beta_1$
Moving forward one period: $Y_{t+1}=\beta_0+\beta_1X_{t+1}+\beta_2X_t+\epsilon_{t+1}$
The derivative of that regression with respect to $X_t$ is $\beta_2$
Derivative in periods beyond $t+1$ is zero as before.
## Lagged Dependent Variable Models
```{r}
ldv<-lm(d_lmilex~lag_lmilex+d_polyarchy, data=fd)
summary(ldv)
pred_dat[1:3,]$milex_hat<-coef(ldv)[3]*pred_dat[1:3,]$dem
pred_dat[4:10,]$milex_hat<-coef(ldv)[2]^(pred_dat[4:10,]$time-3)*coef(ldv)[3]*pred_dat[3,"dem"]
pred_dat %>%
ggplot(aes(x=time, y=milex_hat))+
geom_line()
#What about a permanent shift?
coef(ldv)
lrm<-coef(ldv)[3]*sd(na.omit(fd$d_polyarchy))/(1-coef(ldv)[2])
lrm
```
In the case of a temporary shock to X, the effect decays geometrically by assumption (forced by the assumption of the model)
To see that, let's consider the effect of $X_t$ at time $t+1$ and $t+2$
The general equation $Y_{t}=\beta_0+\alpha Y_{t-1}+\beta_1X_{t}+\epsilon_{t}$
At $t+1$ we have
$Y_{t+1}=\beta_0+\alpha Y_{t}+\beta_1X_{t+1}+\epsilon_{t+1}$
Substitute for $Y_t$ - drop out the error terms, constant for ease
$Y_{t+1}=\alpha(\alpha Y_{t-1}+\beta_1X_{t})+\beta_1X_{t+1}$
First derivative with respect to $X_t$ is $\alpha\times\beta_1$
This makes intuitive sense - $X_t$ affects $Y_t$ by $\beta_1$. In turn, $Y_t$ affects $Y_{t+1}$ by $\alpha$, the extent to which $Y$ is serially correlated.
Similarly at $t+2$
$Y_{t+2}=\beta_0+\alpha Y_{t+1}+\beta_1X_{t+2}+\epsilon_{t+2}$
Substitute for $Y_{t+1}$
$Y_{t+2}=\beta_0+\alpha (\alpha Y_{t}+\beta_1X_{t+1})+\beta_1X_{t+2}+\epsilon_{t+2}$
And again for $Y_{t}$
$Y_{t+2}=\beta_0+\alpha [\alpha(\alpha Y_{t-1}+\beta_1X_{t})+\beta_1X_{t+1}]+\beta_1X_{t+2}+\epsilon_{t+2}$
This is kind of ugly, but anything that's not multiplied by $X_t$ can go away, which is basically everything
We're left with $\alpha^2 \times \beta_1$
Generalizes: the effect of $X_t$ any arbitrary leads into the future is going to be $\alpha^d \times \beta_1$ where $d$ is however many leads we want.
Note that the effect of $X_t$ decays because of the requirement that $\vert \alpha \vert < 1$ - otherwise the effect does not decay at all ($= 1$) or increases in time ($>1$).
### Serial correlation in error term
Haven't made too much of this until now.
In general, serial correlation in the error is not a big issue (no bias / inconsistency).
We do get bad SE/t-stats under serial correlation.
Two approaches:
1. Just use OLS and clean up the SEs to be robust to serial correlation (HAC SEs, PCSEs)
2. Try to better model the dynamics to eliminate serial correlation - for example with deeper lags of $Y$
But, with an LDV, serial correlation in the error creates bias. (@achen2000lagged, @keele2006dynamic)
Need to be careful to ensure no serial correlation in error term of the model.
```{r}
pbgtest(d_lmilex~lag_lmilex+d_polyarchy, data=fd, model="pooling", order=1)
ldv<-plm(d_lmilex~lag_lmilex+d_polyarchy, data=fd, model="pooling")
res_df <- data.frame(residual = residuals(ldv), attr(residuals(ldv), "index"))
res_df<- res_df %>%
left_join(fd, by=c("ccode", "year"))
res_df$lag_res <- lag(res_df$residual)
res_df<-plm::pdata.frame(res_df, index=c("ccode", "year"))
wooldridge <- plm(residual~lag_res, model="pooling", data=res_df)