Regression III Regression Discontinuity Designs

Size: px

Start display at page:

Download "Regression III Regression Discontinuity Designs"

Brett Chandler
5 years ago
Views:

1 Motivation Regression III Regression Discontinuity Designs Dave Armstrong University of Western Ontario Department of Political Science Department of Statistics and Actuarial Science (by courtesy) e: w: Often times, we want to use regression analysis to make causal statements. We can only do this if: All of our modeling assumptions hold. Including - independence between X and ". Normally, with observational data, these assumptions are unlikely to hold. Some research designs can leverage near-random assignment to make mimic an experimental situation. 1 / 64 2 / 64 Example: State-building in Vietnam US Government Metrics What were the effects of different military strategies on security, development, governance, civil society, etc... in Vietnam? Why can t we just do: Modernization = b 0 + b 1 Bombing + Z where each observation is a hamlet in Vietnam. + e The US DoD used several metrics to guide military strategy. Abatteryof169questionsaboutsecurity,politicsandeconomicswas combined using Bayes rule to identify a security score: S =[0, 5]. The mainframe wouldn t print out the continuous score, so they rounded it and printed out the rounded numbers. Identification on causal effects can be obtained by considering hamlets that are close on the continuous score, but get rounded into different categories (e.g., , ) 3 / 64 4 / 64

2 Discontinuity Figure We know that in the assignment of a score, a discontinuity exists at the rounding threshold. How can we estimate the effect of bombings, which are assigned largely based on the discontinuity? How do we know that effect is real and not some modeling artifact? What assumptions are needed to motivate this type of analysis? 5 / 64 6 / 64 Reference Sharp vs. Fuzzy RDD This lecture is based primarily on the working manuscript: Matias D. Cattaneo, Nicolás Idrobo & Rocío Titiunik (2017) A Practical Introduction to Regression Discontinuity Designs 7 / 64 8 / 64

3 Potential Outcomes Framework Extrapolation and RDD Each observation has two potential outcomes: Y i (0): the outcome observed under the control and Y i (1): the outcome observed under the treatment Comparing these two effects should give us a sense of the causal effect of avariable. However, we only observe one or the other of those things for each observation. ( E[Y E[Y i X i ]= i (0) X i ] if x < x E[Y i (1) X i ] if x x In the sharp design, there is no joint support over Y i (0) and Y i (1) Extrapolation is required to identify the causal effect. 9 / / 64 The Main Idea The fundamental idea is that the discontinuity can provide a measure of the causal impact if: Both E[Y i (0) X i = x] and E[Y i (1) X i = x] are both continuous in x at the discontinuity X i = x. where E[Y i (1) Y i (0) X = x] =lim x# x E[Y i X = x] SRD = lim x# x E[Y i X = x] lim E[Y i X = x] x" x lim E[Y i X = x] x" x 11 / 64 In the fuzzy design... Fuzzy Design Pr(Treated) changes at x, but not from 0 to 1 like in the sharp design. This could happen if everyone above x was eligible for the treatment, but only some took part. FRD = E[(D i(1) D i (0))(Y i (1) Y i (0)) X i = x] E[(D i (1) D i (0)) X i = x] = lim x# x E[Y i X i = x] lim x" x E[Y i X i = x] lim x# x E[D i X i = x] lim x" x E[D i X i = x] where D i (0) is the treatment take-up indicator for those assigned to the control group, and D i (1) is the treatment take-up indicator for those assigned to the treatment 12 / 64

4 Let s get Kinky Kink Designs The Kink RD tries to estimate first derivatives of the regression function rather than the function itself. SKRD = d dx E[Y i(1) Y i (0) X i = x] x= x d = lim x# x dx E[Y d i X i = x] lim x" x dx E[Y i X i = x] d dx FKRD = E[(D i(1) D i (0))(Y i (1) Y i (0)) X i = x] x= x d dx E[(D i(1) D i (0)) X i = x] x= x = lim x# x d dx E[Y d i X i = x] lim x" x dx E[Y i X i = x] d lim x# x dx E[D d i X i = x] lim x" x dx E[D i X i = x] 13 / / 64 Other Designs RD Effects are Local The difference between E[Y i (1) X ] and E[Y i (0) X ] is calculated at a single point ( x) along the support of X. The effect will not necessarily generalize as we move away from the threshold without strong (usually unjustified) assumptions about the regression function. Multi-cutoff Designs Multiple Score/Geographic Designs 15 / / 64

5 Could Be (but isn t) Useful library(readstata13) data <- read.dta13("polecon.dta") Y <- data$y X <- data$x Z <- data$z Z_X <- Z*X plot(y ~ X, xlab = "Islamic Victory", ylab = "Female High School Share") abline(v=0) Islamic Victory Female High School Share 17 / 64 Binning Estimator We can partition the observations into bins and then take the average y within bins to get a sense of how the discontinuity looks. Ȳ,j = 1 #{X i 2 B,j } X i:x i 2B,j Y i and Ȳ +,j = 1 #{X i 2 B +,j } X i:x i 2B +,j Y i 18 / 64 RD Plot library(rdrobust) out <- rdplot(y, X, nbins = c(20, 20), binselect = "esmv") RD Plot Y axis 19 / 64 Notes on the Previous Slide 1. The binning and global parametric model certainly make it easier to see what is happening with respect to the discontinuity. 2. Global polynomials are not necessarily great because they are known to be unstable in the tails and the tail is, by definition the place we re looking. 20 / 64

6 Binning Estimators Bins Example out = rdplot(y, X, binselect = 'es') out = rdplot(y, X, binselect = 'qs') RD Plot RD Plot Bins can be: Evenly Spaced (with different numbers of observations in each category) Quantile Spaced (with different distances between bin boundaries) Y axis X axis X axis There are a number of methods to optimally pick the number of bins. Y axis / / 64 Optimally Choosing Bins: IMSE Optimally Choosing Bins: Mimicking Variability Some optimize on Integrated Mean Squared Error (IMSE), so as to make the optimal tradeoff between bias and variance. Not always best because it could produce an overly smooth plot. Omitting the nbins argument and specifying binselect = 'es' or binselect = 'qs' will generate these optimal bins for evenly and quantile space bins, respectively. Bins can be chosen such that the variability in the binned means mimics variability in the raw data. Not overly smooth like the IMSE binned estimator. Generally results in more bins than the IMSE method.d ES bins can sometimes encourage binselect='esmv' and binselect='qsmv' will generate the mimicking variance estimators. 23 / / 64

7 Bins Example RD Plots out = rdplot(y, X, binselect = 'esmv') out = rdplot(y, X, binselect = 'qsmv') RD Plot RD Plot Y axis Y axis Good for illustration and investigation, but not for treatment effect. polynomials are too variable at the boundary points Use MV bins (both QS and ES side-by-side) to illustrate the design, with a global 4th or 5th order polynomial X axis X axis 25 / / 64 Continuity-based Approach Fundamentals Better for point estimates and inference of the treatment effect. Use polynomial methods local to the cutoff to model E[Y i X i = x] from either side and treat SRD as a parameter to be estimated. Either global polynomials (when all obs are used) or local polynomials (when only obs near cutoff are used) model the treatment effect. The running (X )variableisassumedtobecontinuousandsothere are few, if any, observations at X = x. To estimate E[Y i (1) X i = x] and E[Y i (0) X i = x], points near (but not at) the cutoff need to be used. The main point of interest and attention here is how the regression function is specified. Has huge effects on the robustness and credibility of the design and inference. The primary tool for estimating the effect is a low-order local polynomial regression. 27 / / 64

8 LPR in RDD Example: First-order LPR 1. Choose order of the polynomial. 2. Choose bandwidth h, such that only observations between [ x h, x + h] are used to fit the LPR. 3. In the LPR, use weights w i = K x i = x h. The intercept from this LPR is an estimate ˆµ + of ˆµ = E[Y i (1) X i = x]. 4. Estimate ˆµ of µ = E[Y i (0) X i = x]. 5. ˆ SRD =ˆµ + ˆµ. 29 / / 64 Choices to make in LPR Bias and Bandwidth Kernel - triangular kernel (with MSE optimal bandwidth selection) leads to a point-estimate with optimal MSE properties. Here, weight declines linearly moving away from x. Other common options are Uniform and Epanechnikov kernels, but results tend to be robust with respect to this choice. Polynomial Order - in an effort to make the appropriate bias-variance tradeoff, polynomialorderofp = 1orp = 2isusuallyrecommended with optimal bandwidth selection to maximize accuracy of the estimate. Most research relies on local linear regression. Bandwidth - automatically selected given the two choices above (more below) to make the appropriate bias-variance tradeoff. 31 / / 64

9 Optimal Bandwidth Choice Optimal BW Selection in R Generally chosen to minimize MSE: Bias 2 + Variance. The bias is found by relating the local linear estimator to the curvature of the of the unknown regression function and depends primarily on the (p + 1) th derivative of the function. The variance term is a function of density of the running variable around the cutoff (which is negatively related to variance) and the conditional variability of the estimate. Different bandwidths can be chosen on either side of the cutoff since the treatment effect is the difference between two one-sided estimates. Aregularizationtermisoftenincludedtopreventstrangebehavior when bias is nearly zero (i.e., when a global linear model fits well). summary(rdbwselect(y, X, kernel = 'triangular', p = 1, bwselect = 'msetwo')) Call: rdbwselect Number of Obs BW type msetwo Kernel Triangular VCE method NN Number of Obs Order est. (p) 1 1 Order bias (p) 2 2 ======================================================= BW est. (h) BW bias (b) Left of c Right of c Left of c Right of c ======================================================= msetwo ======================================================= Use the argument bwselect = 'mserd' for a single bandwidth across both regions. 33 / / 64 Using rdrobust to Calculate Treatment Effect Using rdrobust to Calculate Treatment Effect (2) summary(rdrobust(y, X, kernel = "triangular", p = 1, bwselect = "mserd")) Call: rdrobust Number of Obs BW type mserd Kernel Triangular VCE method NN Number of Obs Eff. Number of Obs Order est. (p) 1 1 Order bias (p) 2 2 BW est. (h) BW bias (b) rho (h/b) ============================================================================= Method Coef. Std. Err. z P> z [ 95% C.I. ] ============================================================================= Conventional [0.223, 5.817] Robust [-0.309, 6.276] ============================================================================= summary(rdrobust(y, X, kernel = "triangular", p = 1, bwselect = "msetwo")) Call: rdrobust Number of Obs BW type msetwo Kernel Triangular VCE method NN Number of Obs Eff. Number of Obs Order est. (p) 1 1 Order bias (p) 2 2 BW est. (h) BW bias (b) rho (h/b) ============================================================================= Method Coef. Std. Err. z P> z [ 95% C.I. ] ============================================================================= Conventional [0.243, 5.695] Robust [-0.245, 6.152] ============================================================================= 35 / / 64

10 RD Plot, Optimal Bandwidth Inference bandwidth <- rdrobust(y, X, kernel = 'triangular', p = 1, bwselect = 'mserd')$h_l out <- rdplot(y[abs(x)<=bandwidth], X[abs(X)<=bandwidth], p = 1, kernel = 'triangular') Y axis RD Plot X axis Inference is less straightforward here, for reasons similar to those we ve seen before. Bandwidth has been selected to make the optimal bias-variance tradeoff. An implication of this is that the model is almost necessarily mis-specified because the algorithm didn t minimize bias, but a combination of bias and variance. Cattaneo et al propose a robust, bias-corrected confidence interval for hypothesis testing. Centered around a bias-corrected parameter estimate Variance takes into account the variability in the bias-correction phase as well as sampling variability. 37 / / 64 Inference in Practice Including Covariates out <- rdrobust(y, X, kernel = "triangular", p = 1, bwselect = "mserd", all = TRUE) cbind(out$coef, out$ci) Coeff CI Lower CI Upper Conventional Bias-Corrected Robust Covariates can be included in the RD design with the covs argument in rdrobust. The estimate is only really considered a treatment effect if the covariates are determined and fixed before the assignment of the treatment. Covariates can reduce sampling variability without increasing bias in the best case scenario. Z = data[,c("vshr_islam1994", "partycount", "lpop1994", "merkezi", "merkezp", "subbuyuk", "buyuk")] outcov <- rdrobust(y, X, covs = Z, kernel = 'triangular', scaleregul = 1, p = 1, bwselect = 'mserd') cbind(outcov$coef, outcov$ci) Coeff CI Lower CI Upper Conventional Bias-Corrected Robust / / 64

11 Randomization Inference Approach The previous approach leveraged the assumption of continuity and smoothness of E[Y i (0) X i = x] and E[Y i (1) X i = x] at the cutoff to make inferences. Randomization inference views the RD design as a randomized experiment around the cutoff x. The sharp differences in treatment status at the cutoff resemble a randomized controlled trial at the cutoff. Units whose score value (values on the running variable) are in a small window around the cutoff can be analyzed as being from a randomly assigned experiment. Local randomization inference is particularly useful when the running variable is discrete or has relatively few points. It can be used as a robustness check for continuity based designs, but local randomization requires stronger assumptions. We assume that: Local Randomization Overview For points in a small window around the cutoff, W 0 =[ x w 0, x + w 0 ], status into treatment or control can be considered to be randomly assigned (aka as if random assignment). Not only is the assignment random, but the running variable in the window must be unrelated to the outcome. Similarity of RD and Experiments: 41 / / 64 Formalization Estimation and Inference In the strongest version, we assume: For X i 2 W 0, Y i (X i, T i )=Y i (T i ), the running variable only influences Y through the treatment indicator. In a weaker version, we could relax above to: (Y i (X i, T i ), X i, T i )=Ỹi(T i ), there exists a transformation for which the first condition mentioned above is true. Estimation could take the form of large-sample statistical estimators if there are lots of X i 2 W 0, but this is often not the case. Randomization inference has exact, finite-sample properties which makes it quite attractive for this case. Fisherian inference: Potential outcomes are non-stochastic (i.e., fixed, no random sampling assumed). H0 F : Y i(0) =Y i (1)8i Under the null, all outcomes are observed because for each observation the two outcomes are the same. 43 / / 64

12 Hypothetical Example of Fisherian Inference Distribution of Test Statistic under Null Imagine we have 5 units in W 0 and we randomly assign n W0,+ = 3units to the treatment and n W0, = n W0 n W0,+ = 2unitstothecontrol. Under full randomization, we could assume that n W0,+, and by extension n W0, are fixed and find all possible vectors t of the treatment and control that preserve the marginal distribution of T. In our example, there are 5 3 = 10 possible assignments to treatment and control. Assume that Y =(5, 2, 2, 5, 5) and that T =(1, 0, 0, 1, 1), then the observed difference in means is S obs = Ȳ + Ȳ = = 3. If complete enumeration of all possible outcomes is not feasibe, simulation can be used. 45 / / 64 Test Statistics Randomization Inference for a Regression Coefficient Fisherian inference is general and should work for any test statistic. Some other common choices for RD designs are: Kolmogorov-Smirnov (KS) statistics: S KS = sum ˆF 1 (y) ˆF 0 (y), the biggest absolute difference in the two empirical CDFs. Better than difference of means when departures from null are in other moments or quantiles. Wilcoxon rank sum statistic: S WR = P i:t i =1 Ry i where R y i is the outcome rank. S WR is not effected by the cardinal values of the outcome, only their ordering. library(mass) set.seed(493) X <- mvrnorm(100, c(0,0,0), matrix(c(1,.25,.25,.25,1,.25,.25,.25,1), ncol=3)) b <- c(.3, -1, 2) y <- X %*% b + rnorm(100, 0, 1.5) printcoefmat(summary(mod <- lm(y ~ X))$coef) Estimate Std. Error t value Pr(> t ) (Intercept) X * X e-10 *** X < 2.2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ranb <- NULL for(i in 1:2500){ X[,1] <- sample(x[,1], nrow(x), replace=f) ranb <- rbind(ranb, coef(update(mod))) } 2*(1-mean(coef(mod)[2] > ranb[,2])) [1] / / 64

13 Example Choosing the Window library(rdlocrand) rdrandinf(y, X, wl = -2.5, wr=2.5, seed = 50, reps=2500) Selected window = [-2.5;2.5] Running randomization-based test... Randomization-based test complete. Number of obs = 2629 Order of poly = 0 Kernel type = uniform Reps = 2500 Window = set by user H0: tau = 0 Randomization = fixed margins Cutoff c = 0 Left of c Right of c Number of obs Eff. number of obs Mean of outcome S.d. of outcome Window Finite sample Large sample Statistic T P> T P> T Power vs d = 4.27 Diff. in means $sumstats [,1] [,2] [1,] [2,] [3,] [4,] / 64 Some options: 1. Ad hoc or theoretically defined - both are different flavors of arbitrary. 2. Use pre-treatment covariates to select the window. Assumes that there exists a variable Z that is related to the running variable outside the window, but not inside the window. (without this assumption, the procedure breaks down) The effect of the treatment on Z, since it is pre-determined, is 0 by construction. 50 / 64 Data-driven Choice of Window Formalization of the Procedure 1. Identify H0 F : Z is unrelated to T or balanced on T. 2. Start with smallest possible window and test H0 F. 3. Continue to widen window until H0 F is rejected at a pre-specified significance level. 4. The chosen window is the largest one that continues to fail to reject H0 F. We need to choose the following things: Relevant Covariates Test Statistic Randomization mechanism Minimum n in smallest window Significance level 1. Start with a symmetric window of length 2w j, W j = X ± w j 2. Compute the test statistic either for each covariate individually or compute the omnibus test p-value. 3. Find the smallest p-vale p min and evaluate whether whether p min >. If yes, then fail to reject H 0 and increase the size of the window by a pre-specified step. If no, then use the window W j 1. The step procedure can be defined by a fixed length (wstep in R) or such that a certein number of observations is included (wobs in R). 51 / / 64

14 Window Selection in R Example Z <- data[, c("i89", "vshr_islam1994", "partycount", "lpop1994", "merkezi", "merkezp", "subbuyuk", "buyuk")] rdwinselect(x, Z, seed = 50, reps = 1000, wobs = 2) library(rdlocrand) rdrandinf(y, X, wl = -.944, wr=.944, seed = 50, reps=2500) Window selection for RD under local randomization Number of obs = 2629 Order of poly = 0 Kernel type = uniform Reps = 1000 Testing method = rdrandinf Balance test = diffmeans Cutoff c = 0 Left of c Right of c Number of obs st percentile th percentile th percentile th percentile Window length / 2 p-value Var. name Bin.test Obs<c Obs>=c i i i i i i i i merkezi i Recommended window is [-0.944;0.944] with 38 observations (17 below, 21 above). 53 / 64 Selected window = [-0.944;0.944] Running randomization-based test... Randomization-based test complete. Number of obs = 2629 Order of poly = 0 Kernel type = uniform Reps = 2500 Window = set by user H0: tau = 0 Randomization = fixed margins Cutoff c = 0 Left of c Right of c Number of obs Eff. number of obs Mean of outcome S.d. of outcome Window Finite sample Large sample Statistic T P> T P> T Power vs d = Diff. in means $sumstats [,1] [,2] [1,] [2,] [3,] [4,] / 64 Local Randomization or Continuity Approach? Validation Local randomization requires stronger assumptions than the continuity-based approach, thus one might use this approach to probe the conditions under which inference makes sense. The continuity-based approach requires reasonable data density around the cutoff. If this isn t the case, then the local randomization approach might be better. When the running variable is discrete (even potentially with lots of values, e.g., age in years), the local randomization approach could be better because there will be mass points with multiple observations. There are threats to validity with RD designs. If the cutoff is known to the observations ahead of time, this can threaten the validity of the RD design. Observations may try to actively manipulate their score if they are just below the cutoff. There are empirical tests aimed at evaluating the validity of the design. 1. continuity of the score density around the cutoff 2. null treatment effects on pre-treatment covariates and placebos 3. Look at regression function continuity at arbitrary alternative cutoffs. 55 / / 64

15 Density of the Running Variable Null Effects on Pre-treatment Covariates and Placebos If units don t have the ability to manipulate their score, then there should be similar data density on both sides of the cutoff. summary(rddensity(x)) Error in rddensity(x): could not find function "rddensity" If the effect is causal, then it should not be related to pre-treatment covariates or placebo conditions. Anything determined before the treatment counts as a pre-treatment covariate. Placebo outcomes are context-specific. 57 / / 64 Covariates and Placebos With Randomization Inference robs <- lapply(1:ncol(z), function(x)rdrobust(z[,x], X)) names(robs) <- colnames(z) t(round(sapply(robs, function(x)cbind(x$coef, x$ci)[3,]), 3)) Coeff CI Lower CI Upper i vshr_islam partycount lpop merkezi merkezp subbuyuk buyuk robs <- lapply(1:ncol(z), function(x)rdrandinf(z[,x], X, wl=-.944, wr=.944)) names(robs) <- colnames(z) t(round(sapply(robs, function(x)c(stat=x$obs.stat, pval=x$p.value)), 4)) stat pval i vshr_islam partycount lpop merkezi merkezp subbuyuk buyuk / / 64

16 Regression Function Continuity One of the assumptions we made before was that the regression functions are continuous at the cutoff for both treatment and control groups. treat <- which(x >= 0) contr <- which(x < 0) cutoffs <- seq(-5,5, by=1) cutoffs <- cutoffs[-which(cutoffs == 0)] res <- list() for(i in 1:length(cutoffs)){ if(cutoffs[i] < 0){ res[[i]] <- rdrobust(y[contr], X[contr], c=cutoffs[i]) } else { res[[i]] <- rdrobust(y[treat], X[treat], c=cutoffs[i]) } } cbind(cutoff = cutoffs, t(round(sapply(res, function(x) cbind(x$coef, x$ci)[3,]), 3))) cutoff Coeff CI Lower CI Upper [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] Sensitivity to Observations Close to Cutoff If there is potential for manipulation, it would be those observations closes to the cutoff who are most susceptible. Take them out and evaluate effect. rdrobust(y[abs(x) >= 0.25], X[abs(X) >= 0.25])[c("coef", "ci")] $coef Coeff Conventional Bias-Corrected Robust $ci CI Lower CI Upper Conventional Bias-Corrected Robust / / 64 Donut-hole Estimation Conclusion out <- t(sapply(seq(0, 1.25, by=.25), function(i) with(rdrobust(y[abs(x) >= i], X[abs(X) >= i]), c(coef=coef[3], ci[3,])))) out <- cbind(radius = seq(0, 1.25, by=.25), out) out radius coef CI Lower CI Upper [1,] [2,] [3,] [4,] [5,] [6,] The RDD approach can be valuable with the right data and question. Have to be careful that the causal effect is not a modeling artifact. Use data-driven tools to estimate appropriate bandwidth, window width, etc... Do sensitivity testing to make sure that your results are not sensitive to modeling choices 63 / / 64

Regression III Regression Discontinuity Designs

Motivation Regression III Regression Discontinuity Designs Dave Armstrong University of Western Ontario Department of Political Science Department of Statistics and Actuarial Science (by courtesy) e: dave.armstrong@uwo.ca