Regression Discontinuity Design
Goal of today
We learn how to run logit regression and present substantive effect plots.
The source code for this lab can be found here.
The data used for this lab can be found here.
rm(list = ls()) # delete everything in the memory
cat("\014") # clear your Console
library(foreign) # for read.dta
library(ggplot2) # for graphics
library(stargazer) # Regression table
library(effects) # effect plots
library(sjPlot) # effect plots
library(gridExtra) # for plots
Load Titanic data
td <- read.dta("titanic3.dta")
summary(td)
## survived name pclass age
## Min. :0.000 Length:1309 Min. :1.000 Min. : 0.1667
## 1st Qu.:0.000 Class :character 1st Qu.:2.000 1st Qu.:21.0000
## Median :0.000 Mode :character Median :3.000 Median :28.0000
## Mean :0.382 Mean :2.295 Mean :29.8811
## 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:39.0000
## Max. :1.000 Max. :3.000 Max. :80.0000
## NA's :263
## child old female sibsp parch
## Adult:931 Min. :0.0000 Male :843 Min. :0.0000 Min. :0.000
## Child:115 1st Qu.:0.0000 Female:466 1st Qu.:0.0000 1st Qu.:0.000
## NA's :263 Median :0.0000 Median :0.0000 Median :0.000
## Mean :0.1052 Mean :0.4989 Mean :0.385
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :8.0000 Max. :9.000
## NA's :263
## alone fare cherbourg queenstown
## Min. :0.0000 Min. : 0.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.: 7.896 1st Qu.:0.0000 1st Qu.:0.00000
## Median :1.0000 Median : 14.454 Median :0.0000 Median :0.00000
## Mean :0.6035 Mean : 33.294 Mean :0.2066 Mean :0.09411
## 3rd Qu.:1.0000 3rd Qu.: 31.275 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :512.000 Max. :1.0000 Max. :1.00000
## NA's :1 NA's :2 NA's :2
## southampton
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6993
## 3rd Qu.:1.0000
## Max. :1.0000
## NA's :2
names(td)
## [1] "survived" "name" "pclass" "age" "child"
## [6] "old" "female" "sibsp" "parch" "alone"
## [11] "fare" "cherbourg" "queenstown" "southampton"
# Codebook:
#
# survived Passenger survived
# name Name of passenger
# pclass Passenger class
# age Age of passenger
# child Child (< 16 years old)
# old Old passenger (>= 50 years old)
# female Female passenger
# sibsp Number of siblings and spouses aboard
# parch Number of parents and children aboard
# alone Passenger travelled alone
# fare Passenger fare (in Pre-1970 British Pounds)
# cherbourg Embarked at Cherbourg (France)
# queenstown Embarked at Queenstown (Ireland)
# southampton Embarked at Southampton (UK)
Note: No passenger embarked at ports other than Cherbourg, Queenstown, or Southampton. That is, these three dummy variables can be thought of as an alternative representation of a three-value categorical variable.
Estimate logit and probit models
# Logit
logit.1 <- glm(survived ~ age + fare +
cherbourg + queenstown + southampton + 0, # again + 0 means we don't include the intercept.
data = td,
family = binomial(link = logit))
stargazer(logit.1, type = "text")
##
## =============================================
## Dependent variable:
## ---------------------------
## survived
## ---------------------------------------------
## age -0.018***
## (0.005)
##
## fare 0.011***
## (0.002)
##
## cherbourg 0.432**
## (0.220)
##
## queenstown -0.725**
## (0.350)
##
## southampton -0.372**
## (0.155)
##
## ---------------------------------------------
## Observations 1,043
## Log Likelihood -649.801
## Akaike Inf. Crit. 1,309.602
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Remember. We don’t include the intercept (+0), so we can have all three effects from the categorical variables estimated.
# Probit
probit.1 <- glm(survived ~ age + fare +
cherbourg + queenstown + southampton + 0,
data = td,
family = binomial(link = probit))
stargazer(logit.1, probit.1, type = "text")
##
## ==============================================
## Dependent variable:
## ----------------------------
## survived
## logistic probit
## (1) (2)
## ----------------------------------------------
## age -0.018*** -0.011***
## (0.005) (0.003)
##
## fare 0.011*** 0.006***
## (0.002) (0.001)
##
## cherbourg 0.432** 0.267**
## (0.220) (0.134)
##
## queenstown -0.725** -0.448**
## (0.350) (0.208)
##
## southampton -0.372** -0.228**
## (0.155) (0.095)
##
## ----------------------------------------------
## Observations 1,043 1,043
## Log Likelihood -649.801 -650.458
## Akaike Inf. Crit. 1,309.602 1,310.916
## ==============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Coefficients do differ between logit and probit. However, as we will see below, the two models describe essentially identical relationships.
Calculate effect plots
This is important. Whenever you can (as long as the model allows), always report effect plots in your results rather than just the regression table because these plots are more accessible.
So previously, we saw that the effect of age on Pr(survival)
is negative and statistically significant. Let’s see if this is substantively significant as well by calculating and plotting the marginal effect. And when I say marginal effects, I mean calculating the predicted probabilities of your y variable and see how they change with different values in your primary x variable.
# Logit
effect(term = "age", mod = logit.1)
##
## age effect
## age
## 0.2 20 40 60 80
## 0.5440882 0.4536030 0.3652326 0.2850953 0.2165431
# Probit
effect(term = "age", mod = probit.1)
##
## age effect
## age
## 0.2 20 40 60 80
## 0.5371519 0.4502535 0.3648564 0.2856988 0.2157928
As we can see, the predicted probabilities based on logit are almost identical to those based on probit.
From now on, we will only use logit model to avoid redundancy.
Calculate effect plots while holding other covariates at their constant
As we learned in the lecture, what we did above (reproduced below) is not entirely correct.
effect(term = "age", mod = logit.1)
##
## age effect
## age
## 0.2 20 40 60 80
## 0.5440882 0.4536030 0.3652326 0.2850953 0.2165431
If we don’t specify anything, R will hold the values of ALL the other independent variables (fare, cherbourg, queenstown, and southampton) at their mean value in calculating the predicted probabilities of survival for age = 20, 40, 60, and 80.
Holding the fare variable constant at its mean is fine, but holding the cherbourg variable constant at its mean is nonsensical. People used to care about this a lot. So I show you the right way to do these effect plots, though they are cared less these days.
First of all, the cherbourg variable can only take two values, 0 or 1. Setting it to its mean does not make much sense.
Second of all, the cherbourg variable is actually one of the three variables that collectively identify where a passenger embarked at. If cherboug is set equal to 1, for example, the queenstown variable and the southampton variable must be set equal to 0.
That is, we need to study the following three cases
- Case 1: cherbourg = 1, queenstown = 0, southampton = 0
- Case 2: cherbourg = 0, queenstown = 1, southampton = 0
- Case 3: cherbourg = 0, queenstown = 0, southampton = 1
while setting the fare variable at its representative (mean or median) value.
mean(td $ fare, na.rm = TRUE)
## [1] 33.29447
As we learned in the lecture, we use the given.values
option, as follows.
Cherbourg
effect(term = "age", mod = logit.1,
given.values = c(fare = 33.29447,
cherbourg = 1, queenstown = 0, southampton = 0) )
##
## age effect
## age
## 0.2 20 40 60 80
## 0.6894873 0.6070165 0.5170402 0.4259452 0.3396139
Queenstown
effect(term = "age", mod = logit.1,
given.values = c(fare = 33.29447,
cherbourg = 0, queenstown = 1, southampton = 0) )
##
## age effect
## age
## 0.2 20 40 60 80
## 0.4110253 0.3268067 0.2517568 0.1891005 0.1391381
Southampton
effect(term = "age", mod = logit.1,
given.values = c(fare = 33.29447,
cherbourg = 0, queenstown = 0, southampton = 1) )
##
## age effect
## age
## 0.2 20 40 60 80
## 0.4984137 0.4087149 0.3239054 0.2492750 0.1870819
We can see that the age variable indeed has a negative effect on Pr(survival)
for passengers, no matter where the passenger embarked at.
Plot marginal effects
Let’s plot the marginal effect of age on Pr(survival).
We will store the effects in objects.
Cherbourg
eff.age.cher <- effect(term = "age", mod = logit.1,
given.values = c(fare = 33.29447,
cherbourg = 1, queenstown = 0, southampton = 0) )
Queenstown
eff.age.quee <- effect(term = "age", mod = logit.1,
given.values = c(fare = 33.29447,
cherbourg = 0, queenstown = 1, southampton = 0) )
Southampton
eff.age.sout <- effect(term = "age", mod = logit.1,
given.values = c(fare = 33.29447,
cherbourg = 0, queenstown = 0, southampton = 1) )
Let’s plot them one by one.
Note: when plotting the effect output for limited DV models, make sure you use the type = “response” option. If you don’t do that, in the effect function the y-axis in the resulting graph may look really weird due to re-scaling. It doesn’t happen with the current example, but it will happen when analyzing the effect of the fare variable.
plot(eff.age.quee) # The y-axis looks odd due to rescaling.
plot(eff.age.quee, type = "response")
plot(eff.age.cher, type = "response")
plot(eff.age.sout, type = "response")
These three graphs do seem to differ a little, but it’s a bit difficult to compare them. Let’s use a common scale for the y-axis. We do this using the ylim option. We will set it equal to the entire range of Pr(y=1), which is between 0 and 1.
# Cherbourg
plot(eff.age.cher, type = "response",
ylim = c(0,1),
main = "Effect of age on survival (Cherbourg)")
# Queenstown
plot(eff.age.quee, type = "response",
ylim = c(0,1),
main = "Effect of fare on survival (Queenstown)")
# Southampton
plot(eff.age.sout, type = "response",
ylim = c(0,1),
main = "Effect of fare on survival (Southampton)")
Let’s save each of these graphs in an object so we can put them side by side to facilitate a comparison.
# Cherbourg
plot.cher <- plot(eff.age.cher, type = "response",
ylim = c(0,1),
main = "Effect of age on survival\n(Cherbourg)")
# Queenstown
plot.quee <- plot(eff.age.quee, type = "response",
ylim = c(0,1),
main = "Effect of fare on survival\n(Queenstown)")
# Southampton
plot.sout <- plot(eff.age.sout, type = "response",
ylim = c(0,1),
main = "Effect of fare on survival\n(Southampton)")
grid.arrange(plot.cher, plot.quee, plot.sout, ncol = 3)
sjPlot
I recommend using sjPlot
nowadays to generate your effect plots under the ggplot environment because they look nicer.
It automatically holds all other variables (the mean) at a constant value while varying the variable of interest in the plot.
plot_model(logit.1, type = "pred", terms = c("age [all]"), ci.lvl = 0.95, bpe.style = "line", robust = T, legend.title = "", vcov.type = c("HC3"), colors="bw") +
theme_bw()
Let’s make the plot at a publishable standard.
plot_model(logit.1, type = "pred", terms = c("age [all]"), ci.lvl = 0.95, bpe.style = "line", robust = T, legend.title = "", vcov.type = c("HC3"), colors="bw") + #,
theme_bw() +
ggtitle(element_blank()) +
xlab("Age of passengers") + ylab("Predicted probability of survival") +
labs(title = "Effect Plot") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 18))