Regression Discontinuity Design

Goal of today

We learn how to estimate models with random effects and fixed effects.

The source code for this lab can be found here.

The data used for this lab can be found here. # Preparation ## Load packages

library(plm)
# install.packages("plm", dependencies = TRUE)

If you receive error messages, saying “Error in library(plm) : there is no package called ‘plm’” then delete the pound sign in the following line and execute it. Otherwise, move on.

rm(list = ls())       # delete everything in the memory
library(foreign)      # for read.dta
library(ggplot2)      # for graphics
library(stargazer)    # Regression table
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(effects)      # for effect
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(plm)          # for plm
# remotes::install_github("jhelvy/renderthis")
# remotes::install_github('rstudio/chromote')

Age and life satisfaction

Load the beatles data.

bt <- read.dta("beatles.dta")
# Take a look:
bt
##    year id   name age lsat
## 1  1968  1   John  28    8
## 2  1969  1   John  29    6
## 3  1970  1   John  30    5
## 4  1968  2   Paul  26    5
## 5  1969  2   Paul  27    2
## 6  1970  2   Paul  28    1
## 7  1968  3 George  25    4
## 8  1969  3 George  26    3
## 9  1970  3 George  27    1
## 10 1968  4  Ringo  28    9
## 11 1969  4  Ringo  29    8
## 12 1970  4  Ringo  30    6

In a hypothetical survey, we asked four people (John, Paul, George, and Ringo) to rate how satisfied they are on a scale from 0 to 10. We conducted this survey three times (1968, 1969, and 1970). The variable lsat records the respondents’ answers in each year. This is our DV. Other variables should be self-explanatory.

Scatterplot ignoring the panel structure

g <- ggplot(bt, aes(x = age, y = lsat)) + geom_point()
g <- g + xlab("Age") + ylab("Life satisfaction") + theme_bw()
g

It “appears” that there is a positive relationship between age and life satisfaction, implying that, as you get older, you tend to be more satisfied with life (to put it more accurately, you say you are more satisfied with life).

Now to add a regression line to the scatter plot, we use the geom_smooth function, as follows.

g <- g + geom_smooth(method = lm, alpha = 0)
g
## `geom_smooth()` using formula = 'y ~ x'

“method = lm” means that we ask R to fit a line, not a curve.
“alpha = 0” means that we ask R to make the confidence bands to be completely transparent. You can tweak them to see what happens.

g <- g + geom_smooth(method = lm, alpha = 0.5)
g
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

As we saw in the lecture, this apparent positive effect of age is just an artifact caused by ignoring the panel structure.

In order to create a scatterplot where data points for different people are given different colors, we modify the aes argument as follows.

g <- ggplot(bt, aes(x = age, y = lsat, colour = name)) + geom_point()
g <- g + xlab("Age") + ylab("Life satisfaction") + theme_bw()
g

# "colour = name" (NB: British spelling) means that we ask R to use different colors depending on the value of the "name" variable.

We have also learned before how to insert texts into a graph.

g <- ggplot(bt, aes(x = age, y = lsat, colour=name)) + geom_point()
g <- g + geom_text(aes(y = lsat-.5, label = name)) + theme_bw()
g <- g + ylab("Age") + xlab("Life satisfaction")
g <- g + theme(legend.position = "none")
g

We can clearly see that age actually has a negative, not positive, effect on lsat if we make a within-unit (over time) comparison.

Panel data regression: The plm package

The first step in using the plm package is to declare that the data set you have is a panel data set. We do this by using the plm.data function.

As we learned in the lecture, we supply the information about the cross-sectional unit and the time unit using the index argument.

bt.panel <- pdata.frame(bt, index = c("name", "year") )

Now we are ready to estimate models for panel data.

Model 1: Pooled model

fit.bt.pool <- plm(lsat ~ age, data = bt.panel, model = "pooling")

stargazer(fit.bt.pool, type = "text")
## 
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                         lsat            
## ----------------------------------------
## age                     0.690           
##                        (0.491)          
##                                         
## Constant               -14.322          
##                       (13.656)          
##                                         
## ----------------------------------------
## Observations             12             
## R2                      0.165           
## Adjusted R2             0.081           
## F Statistic      1.973 (df = 1; 10)     
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

As we learned in the lecture, the results should be identical to a simple linear regression results we’d obtain using the familiar lm function.

fit.bt.slr <- lm(lsat ~ age, data = bt)

stargazer(fit.bt.pool, fit.bt.slr, type = "text")
## 
## =====================================================
##                              Dependent variable:     
##                          ----------------------------
##                                      lsat            
##                             panel           OLS      
##                             linear                   
##                              (1)            (2)      
## -----------------------------------------------------
## age                         0.690          0.690     
##                            (0.491)        (0.491)    
##                                                      
## Constant                   -14.322        -14.322    
##                            (13.656)      (13.656)    
##                                                      
## -----------------------------------------------------
## Observations                  12            12       
## R2                          0.165          0.165     
## Adjusted R2                 0.081          0.081     
## Residual Std. Error                   2.612 (df = 10)
## F Statistic (df = 1; 10)    1.973          1.973     
## =====================================================
## Note:                     *p<0.1; **p<0.05; ***p<0.01

Model 2: FE model

To estimate a fixed-effects (FE) model, a.k.a., a within-effect model, we replace “pooling” with “within”, as follows.

fit.bt.fe <- plm(lsat ~ age, data = bt.panel, model = "within")

stargazer(fit.bt.pool, fit.bt.fe, type = "text")
## 
## =====================================================
##                        Dependent variable:           
##              ----------------------------------------
##                                lsat                  
##                     (1)                  (2)         
## -----------------------------------------------------
## age                0.690              -1.625***      
##                   (0.491)              (0.166)       
##                                                      
## Constant          -14.322                            
##                   (13.656)                           
##                                                      
## -----------------------------------------------------
## Observations         12                  12          
## R2                 0.165                0.932        
## Adjusted R2        0.081                0.893        
## F Statistic  1.973 (df = 1; 10) 95.919*** (df = 1; 7)
## =====================================================
## Note:                     *p<0.1; **p<0.05; ***p<0.01

The estimated coefficient for age is now negative and significant, as we would expect from the revised scatterplot.

We could obtain the same estimate for the effect of age (-1.625) by including individual-specific intercepts manually (adding fixed effects dummies).

bt $ John <- ifelse(bt $ name == "John", 1, 0) # create dummies to account for the grouping structure in the data, when we examine lsat ~ age
bt $ Paul <- ifelse(bt $ name == "Paul", 1, 0)
bt $ George <- ifelse(bt $ name == "George", 1, 0)
bt $ Ringo <- ifelse(bt $ name == "Ringo", 1, 0)

# model with four dummies (no intercept)
fit.bt.fe.2 <- lm(lsat ~ age + John + Paul + Ringo + George + 0, 
  data = bt)

# model with three dummies (with intercept)
fit.bt.fe.3 <- lm(lsat ~ age + John + Paul +  Ringo + George , 
  data = bt)

stargazer(fit.bt.fe, fit.bt.fe.2, fit.bt.fe.3, type = "text")
## 
## ===============================================================================================
##                                                     Dependent variable:                        
##                              ------------------------------------------------------------------
##                                                             lsat                               
##                                      panel                             OLS                     
##                                     linear                                                     
##                                       (1)                   (2)                    (3)         
## -----------------------------------------------------------------------------------------------
## age                                -1.625***             -1.625***              -1.625***      
##                                     (0.166)               (0.166)                (0.166)       
##                                                                                                
## John                                                     53.458***              8.542***       
##                                                           (4.819)                (0.628)       
##                                                                                                
## Paul                                                     46.542***              1.625***       
##                                                           (4.488)                (0.418)       
##                                                                                                
## Ringo                                                    54.792***              9.875***       
##                                                           (4.819)                (0.628)       
##                                                                                                
## George                                                   44.917***                             
##                                                           (4.322)                              
##                                                                                                
## Constant                                                                        44.917***      
##                                                                                  (4.322)       
##                                                                                                
## -----------------------------------------------------------------------------------------------
## Observations                          12                     12                    12          
## R2                                   0.932                 0.996                  0.981        
## Adjusted R2                          0.893                 0.993                  0.970        
## Residual Std. Error (df = 7)                               0.469                  0.469        
## F Statistic                  95.919*** (df = 1; 7) 327.335*** (df = 5; 7) 90.953*** (df = 4; 7)
## ===============================================================================================
## Note:                                                               *p<0.1; **p<0.05; ***p<0.01
stargazer(fit.bt.fe, fit.bt.fe.2, fit.bt.fe.3, type = "text", omit.stat = c("f"))
## 
## ==========================================================
##                                   Dependent variable:     
##                              -----------------------------
##                                          lsat             
##                                panel           OLS        
##                               linear                      
##                                 (1)       (2)       (3)   
## ----------------------------------------------------------
## age                          -1.625*** -1.625*** -1.625***
##                               (0.166)   (0.166)   (0.166) 
##                                                           
## John                                   53.458*** 8.542*** 
##                                         (4.819)   (0.628) 
##                                                           
## Paul                                   46.542*** 1.625*** 
##                                         (4.488)   (0.418) 
##                                                           
## Ringo                                  54.792*** 9.875*** 
##                                         (4.819)   (0.628) 
##                                                           
## George                                 44.917***          
##                                         (4.322)           
##                                                           
## Constant                                         44.917***
##                                                   (4.322) 
##                                                           
## ----------------------------------------------------------
## Observations                    12        12        12    
## R2                             0.932     0.996     0.981  
## Adjusted R2                    0.893     0.993     0.970  
## Residual Std. Error (df = 7)             0.469     0.469  
## ==========================================================
## Note:                          *p<0.1; **p<0.05; ***p<0.01

These three results are just three different ways of representing a single model.

(Adjusted R^2 is different because plm uses a slightly different method in penalizing models with additional variables.)

The fixest package

These days, feols function from the fixest package becomes more popular

library(fixest)

fit.bt.fe.4 <- feols(lsat ~ age | name, data = bt.panel, panel.id = ~ name)

You can do both FE and clustered standard errors together, making it more robust.

Unfortunately, stargazer hasn’t incorporated this package so we need to use other packages to export our results.

library(texreg)

screenreg(list(fit.bt.fe.4),
          caption = "FE-OLS models with clustered standard errors at individual level.", 
       label = "tab:ols",
       digits = 4,
       caption.above = T,
       include.nobs = TRUE,
       include.ci = TRUE, 
       include.rsquared = TRUE,
       include.adjrs = TRUE,
       include.rmse = TRUE)
## 
## =================================
##                        Model 1   
## ---------------------------------
## age                    -1.6250 **
##                        (0.1311)  
## ---------------------------------
## Num. obs.              12        
## Num. groups: name       4        
## R^2 (full model)        0.9811   
## R^2 (proj model)        0.9320   
## Adj. R^2 (full model)   0.9703   
## Adj. R^2 (proj model)   0.9223   
## =================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Result is the same with a slightly smaller se.

Model 3: RE model

To estimate a random-effects (RE) model, we simply replace “within” with “random”, as follows.

fit.bt.re <- plm(lsat ~ age, data = bt.panel, model = "random")

stargazer(fit.bt.pool, fit.bt.fe, fit.bt.re, type = "text")
## 
## ===============================================================
##                             Dependent variable:                
##              --------------------------------------------------
##                                     lsat                       
##                     (1)                  (2)             (3)   
## ---------------------------------------------------------------
## age                0.690              -1.625***       -1.172***
##                   (0.491)              (0.166)         (0.383) 
##                                                                
## Constant          -14.322                             37.364***
##                   (13.656)                            (10.724) 
##                                                                
## ---------------------------------------------------------------
## Observations         12                  12              12    
## R2                 0.165                0.932           0.483  
## Adjusted R2        0.081                0.893           0.431  
## F Statistic  1.973 (df = 1; 10) 95.919*** (df = 1; 7) 9.346*** 
## ===============================================================
## Note:                               *p<0.1; **p<0.05; ***p<0.01

Tests

To test if FE is necessary or pooled model is enough, we perform an F test

pFtest(fit.bt.fe, fit.bt.pool)
## 
## 	F test for individual effects
## 
## data:  lsat ~ age
## F = 100.9, df1 = 3, df2 = 7, p-value = 4.006e-06
## alternative hypothesis: significant effects

The p-value is very small, suggesting that FE is much better than the pooled model in terms of model fit.

To test if FE is necessary or RE is enough, we perform a “Hausman test” to compare FE and RE models.

phtest(fit.bt.fe, fit.bt.re)
## 
## 	Hausman Test
## 
## data:  lsat ~ age
## chisq = 1.7151, df = 1, p-value = 0.1903
## alternative hypothesis: one model is inconsistent

The p-value is greater than the conventional threshold, so we fail to reject the null hypothesis that FE and RE models are equivalent (and they are both much better than the pooled model). This means that we can use the random-effects model (which is more efficient), not the fixed-effects model (which is more flexible but less efficient) or the pooled model.

  • Efficiency: The RE model is typically more efficient (i.e., has smaller standard errors) when the individual-specific effects are uncorrelated with the independent variables.

  • Flexibility: The FE model is more flexible as it does not assume uncorrelated effects, but it is usually less efficient because it uses up more degrees of freedom by estimating effects for each cross-sectional unit.

Of course, looking at the regression table, the RE model also confirms the relationship we uncovered with the FE model: the effect of age on lsat is negative and significant.