tscs_demo.Rmd

---
title: "Time Series Cross Sectional (TSCS) Data Tutorial"
output:
  html_document:
    toc: yes
    df_print: paged
  html_notebook:
    toc: yes
    df_print: paged
date: "June 21, 2024"
link-citations: yes
bibliography: ref.bib
---

# Introduction

This workshop offers an introduction to time series cross sectional (TSCS) data with a focus on applications in R.  Some familiarity with time series cross sectional econometrics, as well as R, is assumed.  We will cover some basic theoretical results, but this workshop is not intended to substitute for a more detailed theoretical treatment.

The workshop starts with some basic description of the structure of TSCS data and the risks of the pooled (OLS) estimator, then covers standard approaches to unit-level heterogeneity: the fixed effect (within) estimator, the random effect estimator.  Interpretation and drawbacks of the fixed effect estimator are discussed, as well as a potential solution to the time invariant covariate problem via penalized maximum likelihood.  We will discuss testing the key assumption behind the random effects model, and introduce the unifying "within-between" model. 

The second section focuses on issues related to time.  We start by considering the implications of a static model, then incorporate basic dynamics via the finite distributed lag and lagged dependent variable (LDV) specification.  We briefly discuss tests for serial correlation in the model errors, as well as introduce the Anderson-Hsiao estimator for FE estimation with an LDV.

# Setup
## Downloads
[Training Repo, including this file](https://github.com/Lucy-Family-Institute/tscs_demo.git)

[R](https://cran.r-project.org)

[Rstudio IDE](https://www.rstudio.com/products/rstudio/download/)

## Packages
*install.packages()* and *library()* are common package downloads and loading.  Here we'll load some useful packages. *pacman::pload()* combines both of these in one function.

```{r, results='hide', message=F, warning=F}
if (!require("pacman")) install.packages("pacman", repos = "https://cloud.r-project.org")

pacman::p_load("tidyverse", "data.table", "readstata13", "plm", "bife", "survival", "brglm2", "ivreg", "lmtest", "tseries", "plotly")

#setwd()
set.seed(1015)
```
# Review of OLS and the Linear Model
Some clarification of terminology

A linear regression models a dependent variable $Y$ as a function of one or more independent variables ($X$) and an error term $\epsilon$

(Ordinary) Least Squares (OLS/LS) is a popular estimator of the linear model, but not the only estimator

Ridge Regression / Lasso are examples of other estimators of the linear model

From an ML perspective, linear regression is a form of supervised learning while OLS is uses error minimization as measured by the sum of squared residuals

## Intuition
Create a univariate linear relationship using a Monte Carlo simulation

```{r, warning=F, message=F}
sample <- 200
x <- rnorm(sample, 0, 2)
e <- rnorm(sample, 0 , 10)
y <- 3 + x + e
univariate <- data.frame(y, x)
univariate

plot<-univariate %>%
  ggplot(aes(x, y)) +
  geom_point()

ggplotly(plot)
```

Estimating a univariate linear model is equivalent to drawing a straight line through this scatter plot

Can fit any number of linear models to this data - need a criteria for determining the best model

OLS posits that the best model is one that minimizes the error of the model's predicted $Y$ and the real $Y$

OLS measures model error as the sum of squared resiudals

```{r}
univariate$pred <- predict(lm(y~x, data=univariate))
univariate$resid <- univariate$y - univariate$pred

resid_plot<-univariate %>%
  ggplot(aes(x, y)) +
  geom_point()+
  geom_smooth(method = "lm", se=F)+
  geom_segment(aes(xend=x, yend = pred), color="red")

ggplotly(resid_plot)
```

## Formalization

Write the unknown, but linear, data generating process for $Y$ as 

$$ Y = \beta_0 + \beta_1X + \epsilon $$
Our linear model returns estimates of coefficients, $\beta$, which we denote with $\hat{}$.  We use these estimates to construct the estimate of $Y$

$$\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X $$
We define the residual $e$ as the difference between $Y$ and its estimate $\hat{Y}$, which we use as an estimate of the unknown disturbance $\epsilon$.  Note that we move to estimates for an individual observation $i$ below:

$$ e_i = \hat{y}_i - y_i$$
Finally, we denote the residual sum of squares (RSS) as the sum of the squared residuals for all observations in the data

$$ RSS = \sum_{i=1}^{n}e_i^2$$
OLS estimates the coefficients $\beta$ that minimize $RSS$.  For the simple bivariate case the coefficient on $X$ is simply the ratio of it's covariance with $Y$ over its variance:

$$\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^{n}(x_i - \overline{x})^2}$$
The constant is simply the difference between the mean of $Y$ and the predicted $Y$ at the mean of $X$

$$\hat{\beta_0} = \overline{y} - \hat{\beta_1}\overline{x}$$
Given that the coefficient estimates that minimize $RSS$ are known there is no need to perform model training or tuning to estimate them - they can be calculated directly from the data

```{r}
coef(lm(y~x, data=univariate))

mean_x <- mean(x)
mean_y <- mean(y)
beta1_hat <-(sum((univariate$x - mean_x)*(univariate$y - mean_y)))/(sum((univariate$x-mean_x)^2))
beta0_hat <- mean_y - beta1_hat*mean_x
print(c(beta0_hat, beta1_hat))
```

## What's Special About RSS?
Under certain assumptions, OLS is the 'best' estimator for the linear model

*Zero conditional mean assumption* (along with some other technical assumptions) is sufficient to ensure unbiasedness ($E(\hat{\beta})=\beta$):

$$E(\epsilon|X) = 0 $$
Note that this is an assumption about the unobserved error $\epsilon$ and cannot be tested - OLS will fit residual $e$ that is mean zero whether or not it's appropriate

The additional assumption of *Homoskedastic and Serially Uncorrelated Errors* makes OLS the Minimum Variance Unbiased Linear Estimator

$$Var(\epsilon | X) = \sigma^2 $$
$$Corr(\epsilon_t, \epsilon_s) = 0, \forall t \neq s $$
Informally, these two assumptions ensure that of all the unbiased linear estimators of the linear model, OLS has the smallest variance

This does not mean OLS is the absolute best estimator in any circumstance 

If the standard errors satisfy the above assumptions, but are not normally distributed, then non-linear unbiased estimators may offer improvement, such as trimmed least squares

Variance may be reduced by using a biased estimator, such as ridge regression

## Coefficient Variance

OLS is unbaised, but any individual estimate of $\hat\beta$ may be far from $\beta$

The variance of $\hat\beta_1$ is a function of the variance of the error and the variance of $X$

Higher variance in the error makes estimation less precise while higher variance in $X$ makes it more precise:

$$ Var(\hat\beta_1) = \frac{\sigma^2}{\sum_{i=1}^n(x_i-\overline{x})^2} $$
Denominator is known as a function of the data - numerator is unknown and must be estimated

An unbiased estimator of $\sigma^2$ is $RSS/(n-2)$ where $n-2$ is a degrees of freedom adjustment due to constraints of OLS

For test statistics, we work with the standard error of the coefficient, as opposed to the variance

$$se(\hat\beta_1) = \sqrt\frac{\sigma^2}{\sum_{i=1}^n(x_i-\overline{x})^2} = \sqrt\frac{RSS/(n-2)}{\sum_{i=1}^n(x_i-\overline{x})^2} $$

## Multiple Regression

Consider a linear model where $Y$ is a function of two independent variables:

$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon$$
```{r}
x1 <- rnorm(sample, 0, 5)
x2 <- 2*x1 + rnorm(sample, 0, 5)
y <- 2*x1-x2+rnorm(sample, 0,5)

multivariate <- data.frame(x1, x2, y)

multi_plot<-plot_ly(multivariate, x = ~x1, y = ~x2, z = ~y)
multi_plot<-multi_plot %>%
  add_markers()
multi_plot
```

Estimation of this linear model via OLS is similar

Recall OLS minimizes $RSS$

$$ RSS = \sum_{i=1}^{n}e_i^2$$
$$ = \sum_{i=1}^{n}(\hat{y_i} - y_i)^2 $$
$$ = \sum_{i=1}^{n}((\hat{\beta_0} + \hat{\beta_1}X_{1i} + \hat{\beta_2}X_{2i}) - y_i)^2 $$
When we model $Y$ as a function of $X_2$ OLS minimizes $RSS$ over both $X$

The formulas for coefficients of multivariate regression are more complicated, but one representation:

$$\hat\beta_1 = \frac{\sum_{i=1}^{n}\hat{r_{i1}}y_i}{\sum_{i=1}^{n}(\hat{r_{i1}})^2} $$
Where $\hat{r_{i1}}$ is the residual from the auxiliary regression of $X_1$ on $X_2$

$$ X_1 = \hat\gamma_0+\hat\gamma_1{X_2} + \hat{r_1} $$
Note that our new estimate for $\hat\beta_1$ is identical to our old estimate except that $Cov(\hat{r}, y)$ and $Var(\hat{r})$ have replaced $Cov(x, y)$ and $Var(x)$ respectively

The specific form of these variances and covariances is slightly different - note no means! We've applied two simplifications to drop means

$$ E[\hat{r}] = 0 $$

$$ Cov(x, y) = \sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y}) = \sum_{i=1}^{n}(x_i - \overline{x})y_i $$

### Illustration of Multiple Regression by Partial Regression

```{r}
multivariate$resid <- residuals(lm(x1~x2, multivariate))
partial <- lm(y~resid, multivariate)
full <- lm(y~x1+x2, multivariate)

print(c(coef(partial), coef(full)))
```


### Illustration of Omitted Variable Bias
```{r}
short <- lm(y~x1, multivariate)
print(coef(short))
aux <- lm(x2~x1, multivariate)
coef(full)[2] + coef(full)[3]*coef(aux)[2]

```

This shows the relationship between the coefficients of the correct regression with $X_2$ and what is called the 'short' regression (because it's short $X_2$) where $Y$ is regressed on $X$ alone

$$ \tilde\beta_1 = \hat{\beta_1} + \hat{\beta_2}\hat{\gamma_1}$$
Where $\tilde{\beta_1}$ is the coefficient on $X_1$ from the short regression. The second term is the omitted variable bias from the exclusion of $X_2$ - the effect of $X_2$ on $Y$ (controlling for $X_1$) as well as the effect of $X_2$ on $X_1$ from the auxiliary regression ($\hat{\gamma_1}$)

Can $\tilde{\beta_1} = \hat{\beta_1}$?

# Frequentist Statistical Inference
Fundamental question: does $X$ effect $Y$?

Frequentist inference begins by assuming the opposite. We refer to this as the null hypothesis:

$H_0$: There is no relationship between $X$ and $Y$

Under the null hypothesis - what do we expect $\hat\beta_1$ to be?

Statistically, we want to test how likely our estimate of $\hat\beta_1$ under the null hypothesis - in other words if the effect of $X$ is zero, how unlikely is our estimate?

This depends on the coefficient and it's variance, which we use to construct a t-statistic:

$$t = \frac{\hat\beta_1 - \beta_{1|H_0}}{se(\hat\beta_1)} $$

Under the *normality assumption*:

$$ \epsilon \sim Normal(0, \sigma^2) $$

The t-statistic follows a t-distribution with n-k-1 degrees of freedom. Knowing this we can calculate a *p-value* which represents the probability that we could draw a test statistic at least as large as $T$ (one tailed test) or at least as large as $|T|$ (two tailed)

Note that normality implies homoskedasticity

## Manual Illustration
```{r}
univariate$y_hat = beta0_hat + beta1_hat * univariate$x
rss = sum((univariate$y_hat - univariate$y)^2)
se_beta1 <- sqrt(
  (rss/(sample-2))/
    sum((univariate$x - mean_x)^2)
  )
t <- beta1_hat/se_beta1
p <- 2*pt(abs(t), df=sample-2, lower=F)
print(c(t, p))
```
## LM Function
```{r}
lm <- lm(y~x, data=univariate)
summary(lm)
plot(lm)
```

# TSCS Data
TSCS data have observations of multiple units *N* over time *T*.

Each specific observation within the data is of one unit at one time, e.g.

```{r, warning=F, message=F}
data<-read.dta13("ms_mil_spend_data.dta")
data %>%
  filter(ccode==2, year==2000)
```
This observation is for the US (2) for the year 2000.

```{r}
data %>% 
  dplyr::arrange(ccode, year) %>%
  dplyr::select(ccode, year, LMILEX, v2x_polyarchy)

data %>%
  dplyr::filter(ccode %in% c(2,20)) %>%
  ggplot(aes(x=year, y=LMILEX)) +
  geom_line() +
  facet_wrap(~ccode)
```

# Within / Between Decomposition

```{r}
summary(data$LMILEX)

#Between unit variation
data %>%
  group_by(ccode) %>%
  summarize(between=mean(na.omit(LMILEX)))%>%
  dplyr::select(between) %>%
  ungroup() %>%
  summary()

#Within unit variation
data %>%
  group_by(ccode) %>%
  mutate(between=mean(na.omit(LMILEX)),
         within=LMILEX-mean(na.omit(LMILEX))) %>%
  dplyr::select(within) %>%
  ungroup() %>%
  summary()
```

# Pooling
The simplest approach to estimating a model on TSCS is to pool the data:

$Y_{it}=\beta_0+\beta_1X_{it}+e_{it}$

Where Y is the DV, X is the IV, $\beta_0$ is the intercept, $\beta_1$ is the coefficient on X and $e_{it}$ is the error.

What are we assuming here?

All of the $N$ units share a common intercept $\beta_0$ and a common slope $\beta_1$ on X.

Standard OLS assumptions regarding the error term (specifically zero conditional mean and white noise errors).

Whether pooling is appropriate depends on whether those assumptions are correct.

## Clean Pool
```{r}
#Monte Carlo simulations to better control the relationships
N<-seq(1,5,1)
Time<-seq(1,50,1)
clean<-expand.grid(N, Time)
names(clean)<-c("N", "Time")
clean<- clean %>% 
  dplyr::arrange(N, Time)

tibble(clean)

#Draw the IV, DV, error
clean$X<-rnorm(nrow(clean), 1, .25)
clean$Y<-3-2*clean$X+rnorm(nrow(clean), 0, .5)
tibble(clean)
#Run the pooled model
clean_pool<-lm(Y~X, data=clean)
summary(clean_pool)

#Plot to understand why this worked
clean %>%
  ggplot(aes(x=X, y=Y))+
  geom_point(aes(color=as.factor(N)))+
  geom_smooth(method="lm", se=F)

```

Pooled model works here because there is no difference in the underlying relationship - we know this because we drew X without respect to N, Time

## Dirty Pool

Let's assume instead that there exists some unit-level heterogeneity.  Some $Z$ affects both $Y$ and $X$ but is time constant.

You may recognize this as the setup for omitted variable bias.

```{r}
#Monte Carlo simulations to better control the relationships
N<-seq(1,5,1)
Time<-seq(1,50,1)
dirty<-expand.grid(N, Time)
names(dirty)<-c("N", "Time")
dirty<- dirty %>% 
  dplyr::arrange(Time, N)

#Draw the unobserved effect, IV, DV, error
Z<-rnorm(length(unique(dirty$N)), 0, .5)
dirty$Z<-rep(Z, length(Time))
dirty$X<-dirty$Z+rnorm(nrow(dirty), 1, .25)
dirty$Y<-3-2*dirty$X+4*dirty$Z+rnorm(nrow(dirty),0, .5)

#Run the pooled model
dirty_pool<-lm(Y~X, data=dirty)
summary(dirty_pool)

#Plot the pooled model - what are we looking at?

dirty %>%
  ggplot(aes(x=X, y=Y))+
  geom_point()+
  geom_smooth(method="lm", se=F)

dirty %>%
  ggplot(aes(x=X, y=Y))+
  geom_point(aes(color=as.factor(N)))+
  geom_smooth(method="lm", se=F)

#Plot the per unit regressions
dirty %>%
  ggplot(aes(x=X, y=Y, color=as.factor(N)))+
  geom_point()+
  geom_smooth(method="lm", se=F)

```

In this case, pooling estimates a weak positive relationship between X and Y, despite the fact that the true relationship is negative.

This occurs because there is omitted variable bias introduced by Z.

Note that pooling would be fine if we could estimate $Y_{it}=\beta_1X_{it}+\beta_2Z_i+e_{it}$

# Fixed Effects
With TSCS data that contains time-constant unit-level omitted variable bias, fixed effects can be used to estimate the unbiased relationship, even if the omitted variable $Z$ is not observed.

```{r}
fe<-plm(Y~X, dirty, index=c("N", "Time"), method="within")
summary(fe)
```

Why does this work?

The fixed effect estimator considers only the within-unit variation, and removes the between unit variation.

In other words, the data are group demeaned- each observation for each *i* is subtracted from the mean for that unit over time.

Because the effect of *Z* is constant over time, demeaning removes its effect and the bias it introduces

We can do this manually:

```{r}
demean<-dirty %>%
  dplyr::arrange(N,Time)%>%
  dplyr::group_by(N) %>%
  dplyr::mutate(mean_X=mean(X),
         mean_Y=mean(Y),
         demean_X=X-mean_X,
         demean_Y=Y-mean_Y)
demean
fe_manual<-lm(demean_Y~demean_X-1, data=demean)
summary(fe_manual)

demean %>%
  ggplot(aes(x=demean_X, y=demean_Y))+
  geom_point(aes(color=as.factor(N)))+
  geom_smooth(method="lm", se=F)

```

The within estimator is typically introduced with the dummy variable notation - where N indicators for each unit (or N-1 and the intercept).  These are equivalent in OLS (not in GLMs).  Most software (plm in R, xtreg, fe in Stata) uses the within transform  

```{r}
fe_dummy <-
  lm(Y~X+as.factor(N)-1, data=dirty)

summary(fe_dummy)
summary(fe_manual)
summary(fe)
```

Note that the standard errors are slightly different because the demeaned model doesn't have the right degrees of freedom.  You should use plm() for the within transform as it corrects for this.

## Interpretation
Interpretation is standard, but with a caveat
```{r}
summary(fe)
sd(dirty$X)
coef(fe)*sd(dirty$X)

sd(demean$demean_X)
coef(fe)*sd(demean$demean_X)
```

Takeaway: Coefficients on Linear FE regressors are correctly interpreted as the effect of a one unit increase in *the within unit* X.  The within unit variance is always no larger than the total variance in X (often significantly smaller, here about 1/2).  Calculate substantive effects using within X. (@mummolo2018improving)

## Drawbacks
1. No estimates for time-invariant variables

```{r}
time_inv<-dirty
time_inv$K<-ifelse(dirty$N<=2, 1, 0)
time_inv$Y<-3-2*time_inv$X+4*time_inv$Z-3*time_inv$K+rnorm(nrow(time_inv),0, .5)

drop_model<-plm(Y~X+K, model="within", index = c("N", "Time"), data=time_inv)
summary(drop_model)

#Why not? 
time_inv %>%
  group_by(N) %>%
  mutate(var_k=K-mean(K)) %>%
  ungroup() %>%
  select(var_k) %>%
  summary()
```

The FE estimator sweeps out all variation on K, because K only has between-unit variance, not within unit variance.

2. Loss of observations without variation in the DV in either conditional or unconditional logistic regression. 
```{r}
data %>%
  group_by(ccode) %>%
  summarize(var_dem=var(na.omit(vdem_dem)))

logit<-clogit(vdem_dem~LMILEX+strata(ccode), data=data)
summary(logit)

```
A lot has been made of this issue, but does it matter? (@cook2020fixed)

"Feels" problematic that the model ignores a portion of the data.  Can be severe if the event in question is rare/common.

Concrete issues:

1. No estimates of FE, so no individual MEs (similar to Cox PH model).

2. Biased estimates of baseline rate of event because non-event units are removed. Correct interpretation is baseline rate for units that experience the event.

### Penalized ML 
@cook2020fixed recast the unconditional (unit-dummy) FE problem as one of *separation*.

Separation occurs when a variable perfectly predicts the DV- here the unit does. (Imagine a table of Y and N, if one or more cells are empty, this is separation)

ML solutions *do not exist* under separation- the LL is flat because fit can always be improved by pushing the separating coefficient more towards negative (positive) infinity.

Well known solution is to penalize the maximum likelihood to punish large coefficients (may be familiar as ridge/lasso, but specific penalty is different.)

```{r}
penalized_fe<-glm(vdem_dem~LMILEX+as.factor(ccode), data=data,
                  familty="binomial"(link="logit"), method="brglmFit")
summary(penalized_fe)
```
Note that the unconditional FE estimator is biased due to the incidental parameters problem, which can be severe in small T.  Penalizing the ML does not change this. 

Incidental parameters problem is just a special case of well known small sample bias of MLE (MLE is consistent, not unbiased, and consistency relies on asymptotics).

As N grows without T, the number of 'incidental parameters' (the unit dummies) grows, so we are using T periods to estimate a greater and greater number of parameters.

# Random Effects

What if $Z$ does not cause $X$?

There's no OVB (recall necessary conditions)

FE estimator is still unbiased

Pooled estimator unbiased, but incorrect SE/t-stats because unit effect creates serial correlation in errors


```{r}
#Make some data to better control the relationships
N<-seq(1,5,1)
Time<-seq(1,50,1)
random<-expand.grid(N, Time)
names(random)<-c("N", "Time")
random<- random %>% 
  dplyr::arrange(Time, N)

#Draw the unobserved effect, IV, DV, error
Z<-rnorm(length(unique(random$N)), -2, .5)
random$Z<-rep(Z, length(Time))
random$X<-rnorm(nrow(random), 1, .25)
random$Y<-3-2*random$X+4*random$Z+rnorm(nrow(random),0, .5)

#Run the pooled model
random_pool<-lm(Y~X, data=random)
summary(random_pool)

#Plot the pooled model - what are we looking at?
random %>%
  ggplot(aes(x=X, y=Y))+
  geom_point(aes(color=as.factor(N)))+
  geom_smooth(method="lm", se=F)

#Plot unit intercepts/slopes
random %>%
  ggplot(aes(x=X, y=Y, color=as.factor(N)))+
  geom_point()+
  geom_smooth(method="lm", se=F)

#RE estimator
re<-plm(Y~X, random, model = "random")
summary(re)

```

## Pooled model has incorrect variance
```{r}
#True error
random$error <-random$Y-(3-2*random$X+4*random$Z)

#Get estimated error from pooled model
random$resid<-resid(random_pool)

random %>%
  dplyr::select(error, resid) %>%
  pivot_longer(cols = everything()) %>%
  ggplot()+
  geom_histogram(aes(value, fill=name, alpha=.2))+
  guides(alpha="none")

e_resid = sum((random$resid)^2)/248
e_error = sum((random$error)^2)/250

e_resid
e_error

# How does this affect beta se and t-stats?
sqrt(e_resid / sum((random$X - mean(random$X))^2))
sqrt(e_error / sum((random$X - mean(random$X))^2))

summary(random_pool)
summary(re)

```

We can see that our SEs (and by extension t-stats) for the pooled model are incorrect.  This is becuase our estimate of $\sigma^2$ (variance of the error term) is wrong because the pooled estimator includes the effect of $Z$ on $Y$ in the variance.

Random effects purge the variance contributed by $Z$ and therefore return SE that match SE calculated with the known error term.

### Autocorrelation robust standard errors
Since the pooled model is unbiased, we can simply correct the standard error esitmates.  Here using cluster-robust SE.

However, cluster-robust SE are *asymptotically* correct (in the number of units $N$ *not* the number of observations). Our *N* is 5, which is...not asymptotic.
```{r}

coeftest(random_pool, vcov=vcovHC(random_pool, type = "HC0", group="N"))

``` 

These look very similar to our uncorrected SEs.

Make the data bigger.

```{r}
#Make some data to better control the relationships
N_large<-seq(1,5000,1)
Time<-seq(1,50,1)
random_large<-expand.grid(N_large, Time)
names(random_large)<-c("N", "Time")
random_large<- random_large %>% 
  dplyr::arrange(Time, N)

#Draw the unobserved effect, IV, DV, error
Z_large<-rnorm(length(unique(random_large$N)), -2, .5)
random_large$Z<-rep(Z_large, length(Time))
random_large$X<-rnorm(nrow(random_large), 1, .25)
random_large$Y<-3-2*random_large$X+4*random_large$Z+rnorm(nrow(random_large),0, .5)

#Run the pooled model
random_pool_large<-lm(Y~X, data=random_large)
summary(random_pool_large, cluster="N")

large_e_error <- sum((random_large$Y -(2*random_large$X+4*random_large$Z))^2)/(length(N_large))
sqrt(large_e_error / sum((random_large$X - mean(random_large$X))^2))

```
## Testing the validity of the RE assumption

In order to be unbiased, the Random Effects estimator requires that there is no correlation between the unit effects and the explanatory variables.

This assumption is typically tested with a Hausman test comparing an FE and RE model with the same specification

```{r}
#Fit a FE model to the same data
fe<-plm(Y~X, random, model = "within")

#Hausman Test
phtest(fe, re)

#Note, order of the arguments does not matter

#Can also calculate the Hausman statistic by hand
t(coef(fe) - coef(re)[2])%*% solve(vcov(fe)-vcov(re)[2,2]) %*% (coef(fe)-coef(re)[2])


```

The null hypothesis of the Hausman test is (roughly) that the coefficient vectors are equal in both the FE and RE models.  Rejection indicates the two vectors are statistically different.  Rejection is typically taken to suggest violation of the RE assumption.

Recall that both the FE and RE models are unbiased under the RE assumptions (the former is not efficient).  Thus, if the RE assumptions hold, we would not expect a difference.  In contrast, if the RE assumption regarding the correlation of the individual effects with the independent variables is violated, only the FE estimator is unbiased and the RE estimator is biased.  Therefore, we would expect differences in the estimates.

# The Within-Between Model
Rather than choosing between fixed and random effects, we can estimate a compound model.

Recall that the goal of both the FE and RE models is to account for unit-level heterogeneity.  The fundamental difference is that the RE model requires no covariance between the unit effects and the error.  The FE model is agnostic on this because it eliminates all variation on all variables due to unit effects via the within transform (subtracting the unit means from each realization of $Y$, $X$)

The risk of the RE model, then, can be thought of as correlation between these unit means and the error term.  As with standard OVB, putting the unit means into the equation will remove the bias and allow the RE estimator to recover the unbiased coefficients on $X$.

The traditional approach comes from @mundlak1978pooling and involves estimating the following equation with the RE estimator:

$Y_{it}=\beta_1X_{it}+\beta_2\overline{X_i}+e_{it}$

Where $\overline{X_i}$ is the unit level mean of $X$

Alternatively, the time-specific realizations of $X$ can be within-transformed prior to estimation.

$Y_{it}=\beta_1(X_{it}-\overline{X_i})+\beta_2\overline{X_i}+e_{it}$


The specific model you should use depends on which coefficients are more of interest: the individual or aggregate.  The former (Mundlak) specification facilitates interpretation of the aggregate (group mean) variable.

The group demeaning in the latter approach ensures that the individual realizations are uncorrelated with the group-level mean.  As a result, the group-level term cannot be interpreted, because it does not control for the effect of the individual realizations.

```{r}
#We'll fit the second equation to our RE data

#First, we have to group demean
demean_random<-random %>%
  dplyr::arrange(N,Time)%>%
  group_by(N) %>%
  mutate(mean_X=mean(X),
         demean_X=X-mean_X)

# Now include both X terms in the RE model
rebw<-plm(Y~demean_X + mean_X, demean_random, model = "random")

coef(rebw)
coef(fe)
```
Notice that the coefficient on the within effect from the REBW model is the same as the FE estimate.

We can also estimate the Mundlak version
```{r}
#Mundlak formulation
mundlak<-plm(Y~X + mean_X, demean_random, model = "random")
coef(mundlak)

```

Again, we've successfully purged the coefficients on the individual realizations of $X$.  Our coefficient on $\overline{X}$ has changed.  The REWB estimate was -5.2 which we can see here is actually the sum of the between and within coefficients from the Mundlak estimator of -3.2 and -2

# Dynamic Models

```{r}

#Check how our data look
data %>%
  dplyr::select(c(ccode, year, LMILEX, v2x_polyarchy)) %>%
  filter(ccode<100) %>%
  pivot_longer(cols = c(LMILEX, v2x_polyarchy), values_to = "values")%>%
  ggplot(aes(y=values, x=year, color=as.factor(ccode)))+
  geom_line()+
  facet_wrap(~name, scales="free_y")
```

## Panel Unit Root Testing

### Single TS Unit Root Testing
Many time series are persistent - current values depend on past values.

Formally, we can define an autoregressive process of order one AR(1) series as one that can be written

$$Y_t = \rho_1Y_{t-1}+\epsilon_t$$
where $\epsilon_t$ is iid.  Order one means that $Y$ is only a function of one past realization. 

$Y$ is *stationariy* if $|\rho_1| < 1$. Informally, this means the correlations between $Y$ and its past realizations is asymptotically 0.  Over a long enough time period, the past is irrelevant.

When $|\rho_1| = 1$ $Y$ is said to have a *unit root.*  We also call such a series as *integrated* of order one, or I(1).  An I(1) series is only one type of unit root, and determining which kind of unit root is difficult.

Such series are problematic for modeling.  First, correlations with the past do not die out, even asymptotically.  Second, unit root series have increasing variance over time, meaning the central limit theorem does not hold (test statistics are asymptotically normal), so no hypothesis tests are valid.  Third, regressions with non-stationary variables can produce spurious inferences - two integrated series are correlated due to integration, not a causal relationship.

Let's see:

```{r}
errors <- rnorm(1001,0,1)
unit_root <- cumsum(errors)
unit_root_data <- data.frame(error=errors, unit_root=unit_root, time=seq(1,1001,1))
unit_root_data <- unit_root_data %>%
  mutate(stationary1 = .5*dplyr::lag(error)+rnorm(1,0,1),
         stationary2 = .9*dplyr::lag(error)+rnorm(1,0,1),
         first_diff = unit_root - dplyr::lag(unit_root)) 
unit_root_data %>%
  dplyr::select(-error) %>%
  pivot_longer(-time) %>%
  ggplot(aes(y=value, x=time, color=name))+
  geom_line()
```

You can see that over a long enough time period the persistence of the series is irrelevant - it's still mean reverting.  In contrast, the unit root gets "stuck" at high or low values and stays there.  Typically these are less obvious.

I(1) series are *difference stationary* meaning that if we subtract the prior value, the series is stationary.

$$Y_t = \rho_1Y_{t-1}+\epsilon_t$$
$$Y_t = Y_{t-1}+\epsilon_t$$
$$Y_t - Y_{t-1} = \epsilon_t$$
Typically we denote a differenced series with $\Delta$, e.g. $\Delta Y$

#### Augmented Dickey-Fuller Test

If a series has a unit root, then $\rho$ = 1.  We could formally test this with a regression of $Y_t$ against its past values, and use a hypothesis test on the coefficient. This is the basis of the Dickey-Fuller test.  However, we use different critical values since the central limit theorem is violated by a unit root, and we typically use the Augmented Dickey-Fuller, which includes additional lags to clean up residual serial correlation.  Note, that the function we're using automatically determines these based on series length. 

```{r}
apply(unit_root_data, 2, function(x) tseries::adf.test(na.omit(x)))
```

As noted by the test, the null hypothesis is that the series has a unit root.  Thus at a small p-value we reject the null and treat the series as stationary. 

### Back to Panels
Panel unit root testing is difficult and R implementations are somewhat basic

We can perform our own test via the p-value aggregation method of @choi2001unit

We perform an augmented Dickey-Fuller test on each panel, aggregate the p-values

Specific aggregation is $-2\sum_{n}^{N} ln(p_n)$

Test statistic is distributed $\chi^2(2N)$

N here indicates number of panels

We're going to loop over the panels, perform the test, store the results

(You could also do this as a vectorized function...)

```{r, warning=F, message=F}
pvals <- NULL
for (i in unique(data$ccode)){
  try(dfuller <- adf.test(data[data$ccode==i, "LMILEX"], k = 1))
  pvals<-rbind(pvals, dfuller$p.value)
}

head(pvals)

#Choi's P-value
test_stat<- -2*sum(log(na.omit(pvals)))
test_stat 

#That's going to reject, but let's get a p-value for completeness
#Test is distributed chi2(2N) where N is panels

round(pchisq(test_stat, df=2*nrow(na.omit(pvals)), lower.tail = F), digits=3)
```
In the p-value aggregation chase, we are testing a specific null hypothesis that all the panels have a unit root against the alternative that at least one is stationary.

That might not be the most reassuring rejection; other tests have different nulls.

## First Differencing

The panel ADF didn't indicate a need for differencing, but let's to illustrate.

(Usual caveats here; first differencing is only appropriate for I(1) series, you need to test the differenced series for stationarity, etc.)


```{r}
#Difference these out to be safe
fd<-pdata.frame(data, index = c("ccode", "year"))
fd$d_lmilex=diff(fd$LMILEX)
fd$d_polyarchy=diff(fd$v2x_polyarchy)

fd %>%
  dplyr::select(c(ccode, year, d_lmilex, d_polyarchy)) %>%
  filter(ccode %in% seq(2,99,1)) %>%
  pivot_longer(cols = c(d_lmilex, d_polyarchy), values_to = "values")%>%
  ggplot(aes(y=values, x=year, color=as.factor(ccode)))+
  geom_point()+
  facet_wrap(~name, scales="free_y")
```

## Static

More details in @de2008taking

```{r}
static<-lm(d_lmilex~d_polyarchy, data=fd)
summary(static)

#Illustrate the impact of 1 SD change in polyarchy (temporary)
time<-seq(1,10,1)
dem<-c(rep(0,2), 
       sd(na.omit(fd$d_polyarchy)),
       rep(0, 7))
pred_dat<-cbind.data.frame(time,dem)
pred_dat$milex_hat<-pred_dat$dem*coef(static)[2]
pred_dat %>%
  ggplot(aes(x=time, y=milex_hat))+
  geom_line()
```
If the regression equation is $Y_t=\beta_0+\beta_1X_t+\epsilon_t$

The effect of $X_t$ on $Y_t$ is $\frac{\delta Y_t}{\delta X_t}$ which is $\beta_1$

Moving forward one period: $Y_{t+1}=\beta_0+\beta_1X_{t+1}+\epsilon_{t+1}$

The derivative of that regression with respect to $X_t$ is zero. 

## Finite Lag
```{r}
fd$lag_polyarchy=lag(fd$d_polyarchy)
fd$lag_lmilex=lag(fd$d_lmilex)

lag1<-lm(d_lmilex~d_polyarchy+lag_polyarchy, data=fd)
summary(lag1)

pred_dat<-pred_dat %>% 
  mutate(lag_dem=dplyr::lag(dem))
pred_dat$milex_hat<-pred_dat$dem*coef(lag1)[2]+pred_dat$lag_dem*coef(lag1)[3]
pred_dat %>%
  ggplot(aes(x=time, y=milex_hat))+
  geom_line()

```

If the regression equation is $Y_t=\beta_0+\beta_1X_t+\beta_2X_{t-1}+\epsilon_t$

The effect of $X_t$ on $Y_t$ is $\frac{\delta}{\delta X} \beta_0+\beta_1X_t+\beta_2X_{t-1}+\epsilon_t$ which is $\beta_1$

Moving forward one period: $Y_{t+1}=\beta_0+\beta_1X_{t+1}+\beta_2X_t+\epsilon_{t+1}$

The derivative of that regression with respect to $X_t$ is $\beta_2$

Derivative in periods beyond $t+1$ is zero as before.

## Lagged Dependent Variable Models

```{r}
ldv<-lm(d_lmilex~lag_lmilex+d_polyarchy, data=fd)
summary(ldv)

pred_dat[1:3,]$milex_hat<-coef(ldv)[3]*pred_dat[1:3,]$dem
pred_dat[4:10,]$milex_hat<-coef(ldv)[2]^(pred_dat[4:10,]$time-3)*coef(ldv)[3]*pred_dat[3,"dem"]
pred_dat %>%
  ggplot(aes(x=time, y=milex_hat))+
  geom_line()

#What about a permanent shift?
coef(ldv)
lrm<-coef(ldv)[3]*sd(na.omit(fd$d_polyarchy))/(1-coef(ldv)[2])
lrm
```

In the case of a temporary shock to X, the effect decays geometrically by assumption (forced by the assumption of the model)

To see that, let's consider the effect of $X_t$ at time $t+1$ and $t+2$

The general equation $Y_{t}=\beta_0+\alpha Y_{t-1}+\beta_1X_{t}+\epsilon_{t}$

At $t+1$ we have

$Y_{t+1}=\beta_0+\alpha Y_{t}+\beta_1X_{t+1}+\epsilon_{t+1}$

Substitute for $Y_t$ - drop out the error terms, constant for ease

$Y_{t+1}=\alpha(\alpha Y_{t-1}+\beta_1X_{t})+\beta_1X_{t+1}$

First derivative with respect to $X_t$ is $\alpha\times\beta_1$

This makes intuitive sense - $X_t$ affects $Y_t$ by $\beta_1$.  In turn, $Y_t$ affects $Y_{t+1}$ by $\alpha$, the extent to which $Y$ is serially correlated. 

Similarly at $t+2$

$Y_{t+2}=\beta_0+\alpha Y_{t+1}+\beta_1X_{t+2}+\epsilon_{t+2}$

Substitute for $Y_{t+1}$

$Y_{t+2}=\beta_0+\alpha (\alpha Y_{t}+\beta_1X_{t+1})+\beta_1X_{t+2}+\epsilon_{t+2}$

And again for $Y_{t}$

$Y_{t+2}=\beta_0+\alpha [\alpha(\alpha Y_{t-1}+\beta_1X_{t})+\beta_1X_{t+1}]+\beta_1X_{t+2}+\epsilon_{t+2}$

This is kind of ugly, but anything that's not multiplied by $X_t$ can go away, which is basically everything

We're left with $\alpha^2 \times \beta_1$

Generalizes: the effect of $X_t$ any arbitrary leads into the future is going to be $\alpha^d \times \beta_1$ where $d$ is however many leads we want. 

Note that the effect of $X_t$ decays because of the requirement that $\vert \alpha \vert < 1$ - otherwise the effect does not decay at all ($= 1$) or increases in time ($>1$).

### Serial correlation in error term
Haven't made too much of this until now.

In general, serial correlation in the error is not a big issue (no bias / inconsistency).

We do get bad SE/t-stats under serial correlation.

Two approaches:

1. Just use OLS and clean up the SEs to be robust to serial correlation (HAC SEs, PCSEs)

2. Try to better model the dynamics to eliminate serial correlation - for example with deeper lags of $Y$

But, with an LDV, serial correlation in the error creates bias.  (@achen2000lagged, @keele2006dynamic)

Need to be careful to ensure no serial correlation in error term of the model.

```{r}
pbgtest(d_lmilex~lag_lmilex+d_polyarchy, data=fd, model="pooling", order=1)

ldv<-plm(d_lmilex~lag_lmilex+d_polyarchy, data=fd, model="pooling")
res_df <- data.frame(residual = residuals(ldv), attr(residuals(ldv), "index"))

res_df<- res_df %>%
  left_join(fd, by=c("ccode", "year"))

res_df$lag_res <- lag(res_df$residual)

res_df<-plm::pdata.frame(res_df, index=c("ccode", "year"))

wooldridge <- plm(residual~lag_res, model="pooling", data=res_df)

wooldridge_stat <- (coef(wooldridge)[2]+.5) / (sqrt(diag(vcov(wooldridge))[2]))

round(pchisq(wooldridge_stat, 1, lower.tail = F), digits=3)

```

Mixed results - Breusch - Godfrey test fails to reject, Wooldridge test rejects.

# Dynamic Panel Models
If we assume unit effects that impact $Y$ (as in the FE/RE case), the LDV model is biased, even if we do not assume correlation between the unit effects and included variables (@wawro2002estimating is an accessible summary).

This is because the static individual effects are relegated to the error term.  The LDV is correlated with those static unit effects, so by putting it on the RHS we've created bias.

Unlike in the static case, the FE estimator is also biased.  This bias is sometimes referred to as Nickell bias after @nickell1981biases.

Intuition here is that the transformed LDV and error are correlated. The group mean of the LDV contains all realizations of $Y$ minus the final observation of the unit.  Any $Y_t$ is correlated with the appropriate error $e_t$, which is introduced into the error term by the group difference.

Bias is decreasing in $T$ not $N$ - so in short panels the bias can be very large.

## Anderson-Hsiao Estimator

1. First difference the regression equation to remove correlation between LDV and FE (note this is not group demeaning, which would prevent valid instruments.)

First differenced LDV is still correlated with first differenced error

But, this can be handled with standard IV approaches: 

2.  Instrument for $\Delta Y_{t-1}$

Deeper lags of $Y$ are valid instruments, specifically $Y_{t-2}$- no need to find an instrument (although if you have one that also works!).  However, this depends on no serial correlation in the errors.

In general, the AH estimator has been superseded by GMM estimators that build on the deeper lag approach, so it's typically not a named routine.  But we can estimate it by hand via standard 2SLS.

```{r}
fd$l2_lmilex=lag(fd$d_lmilex, 2)
ah<-ivreg(d_lmilex~lag_lmilex+d_polyarchy| l2_lmilex+d_polyarchy, data=fd )
summary(ah)


```

Note that we've only used ivreg, which is standard 2SLS estimation (not 2SLS with FE).  This is because we've swept out the unit effects via first differencing already.


# References