Bayesian nonparametric predictive approaches for causal inference: Regression Discontinuity Methods

Bayesian nonparametric predictive approaches for causal inference: Regression Discontinuity Methods George Karabatsos University of Illinois-Chicago ERCIM Conference, 14-16 December, 2013 Senate House, University of London Session ES38: Bayesian Nonparametric Regression Sunday 15.12.2013, 08:45-10:25 In collaboration with S.G. Walker Research is supported by NSF-MMS Grant SES-1156372.

Introduction: Outline I. Review causal inference framework (counterfactual) II. Randomized studies and non-randomized studies. The regression discontinuity (RD) design for non-randomized studies. (Thistlewaite & Campbell, 1960; Cook, 2008) Causal Modeling Framework: DAG and extended conditional independence (Dawid, 2002, 2010) III. Issues of current causal models for RD designs. IV. Propose a Bayesian nonparametric regression model for RD designs. Sharp RD design (full treatment compliance among subjects). For a fuzzy RD design (imperfect treatment compliance). V. Illustrate Bayesian nonparametric model on two real data sets. VI. Impact of new teacher education curriculum on student performance Impact of basic skills on teaching ability. Consider more recent work on RD-based causal inference, involving the restricted DP mixture of linear regressions model (Wade, Walker, & Petrone, 2013), which more directly exploits the local randomization feature of RD designs. Time permitting. 2

Introduction: Randomized Studies Causal inference: A basic aim of scientific research. Randomized studies: gold standard of causal inference (Rubin, 2008). Randomization ensures that the (pretreatment) covariate distribution does not differ between treatment subjects and non-treatment subjects. Then, any difference in treatment outcomes and non-treatment outcomes are only due to changes in the treatment variable, i.e., is attributable to the causal effect of the treatment on the outcome. A randomized study is often infeasible: financial, ethical, or timeliness. Regression Discontinuity (RD) Design Y Outcome R Assignment Variable A = 1(R > r 0 ) T Treatment assignment indicator Treatment receipt indicator. Sharp RD: Full compliance (A = T). Fuzzy RD: Imperfect compliance. For subjects with R observations located near r 0, treatments are as good as randomly assigned, under mild conditions. 3

Outcomes Y 35 30 Sharp RD Design Illustration Non-treatment (T = 0; R <.6) Treatment (T = 1; R >=.6) 25 20 15 10 5 0 At the cutoff of.6, the average jump size of Y from black line (control) to red line (treatment) is 4.3. 0 0.2 0.4 0.6 0.8 1 Assignment Variable R 4

RD Assumptions for Causal Inference Characterizing (RD) assumption: lim r r0 E(T r) lim r r0 E(T r) Sharp RD: f(t r) = Pr(T = t r) = t1(r r 0 ) + (1 t)1(r < r 0 ) Probabilistic DAG for sharp RD: R {, r 0 }, T {, 0, 1}, are intervention parameters. Y : general regime parameter that specifies the circumstances of Y; experimental conditions, environment, kind of subject, etc. DAG implies conditional independence properties: R T, Y R, T ( Y, R ) R, T, Y ( T, R ) R, T, Y Local stability ( SUTVA ): Y Ψ Y R = r 0, T, i.e., f(y r 0, t, ψ R ) = f(y r 0, t). In idle state, i.e., ψ R = ψ T =, joint p.d.f. is left at undisturbed state : f(r, t, y) = f(r)f(t r)f(y r, t). All previous CI assumptions imply a causal property: Y (Ψ T, Ψ R ) R = r 0, T An intervention regime (Ψ R = r 0, Ψ T = t 0 ), t 0 {0,1}, modifies f(r, t, y) to f(y r 0, t 0 ) = f(y r 0, t 0 ). Causal effect: comparison of functionals of f(y r 0, T = 1) and f(y r 0, T = 0). 5

RD Assumptions for Causal Inference Causal effect: comparison of functionals of f(y r 0, T = 1) and f(y r 0, T = 0). Conditioning on R = r 0 is motivated by the following assumption. Local Randomization (LR) (Lee, 2008): Each subject, described by all unobserved and observed pre-treatment covariates, W, has "imprecise control" over R, i.e., F R (r w) = Pr(R r w) is continuous in r at r 0, with 0 < F R (r 0 w) < 1. Then the p.d.f. of all observed pretreatment covariates, f(x w), is the same for all subjects just to the left and just to the right of the cutoff r 0. Estimate of causal effect of T on Y: Sharp RD: E(h{Y} r + ) E(h{Y} r ), for any chosen functional h{ }, where r+ denotes setting (R = r 0, 1(R r 0 ) = 1), and r denotes setting (R = r 0, 1(R r 0 ) = 0), as covariates in a regression model. Fuzzy RD (imperfect treatment compliance; f(t r) not point-mass): [E(h{Y} r + ) E(h{Y} r ) ] / [E(T r + ) E(T r ) ] under additional assumption of local exclusion restriction, i.e., conditionally on R = r 0, any effect of A on Y is only through T. 6

Standard Models for RD designs A standard model for sharp RD designs (e.g., Bloom, 2012): Y i = 0 + 1 (r i ) + 1(r i > r 0 ) + 2 (r i ) 1(r i > r 0 ) + i, i ~ N(0, 2 ) is the average causal effect of the treatment; 1 (r i ) and 2 (r i ) are each linear or polynomial effects of R. Estimate of causal effect ( ) can be easily biased by outliers. Local linear models (Fan & Gijbels, 1996) provide an outlier-resistant alternative (Imbens & Lemieux, 2008). A bandwidth parameter is chosen to assign higher weight to observations that are located close around the cutoff r 0 (Imbens & Lemieux, 2008). The local linear model has been extended to provide quantile regression, to provide causal effects in terms of quantiles (Frandsen et al. 2012). Local linear models can estimate the effect h in either sharp or fuzzy RD. However: Bandwidth choices only have large-sample justifications (Imbens & Kalyanaraman, 2012). Quantile regression method has the quantile-crossing problem. 7

Modeling RD designs For RD designs, a regression model is desired: That is flexible enough to make accurate predictions, while being able to capture r 0 -local effects. Accurate estimation of causal effects relies on a predictively-accurate regression model. That can provide coherent inferences of the causal effect of the treatment (versus the non-treatment), on the outcome Y, either in terms of the outcome s mean, variance, chosen quantiles, probability density, of Y. (i.e., for general functionals h{ } of Y). That would involve no quantile crossing problems. 8

IPMW Model for Sharp RD Designs f y i r i j j r, r j r r n y i j, j 2 j r i, r i, i 1,,n, r 0 1 r 2 1 r r 0 j 1 r r r exp 0 1 r 2 1 r r 0 1/2 j, j 2 N j, 2 IG j 2 1,b, 2 N 0, 0 2 Un 0,b b,, Ga b a 0,b 0, IPMW: Infinite-Probits Mixture Weights model (Karabatsos & Walker, 2012,EJS). 9

Density f(y r) Mixture weight j (r) IPMW Model (Karabatsos-Walker 12) f y r n y j, 2 j j r, r j r 0 1 r 2 1 r r 0 r exp 0 1 r 2 1 r r 0 1/2 1 (r) = 1/20 1 (r) = 1/2 Weights ω j ( (r), (r)) indicate how well r explains Y. (r) controls multimodality. If 2 0 or 2 0, then there is a regression discontinuity causal effect of T on p.d.f. of Y. 1 (r) = 1 1 (r) = 2 0.5 0.5 0.5 0.5 0-10 0 10 Index j 0-10 0 10 Index j 0-10 0 10 Index j 0-10 0 10 Index j 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0-10 0 10 y 0-10 0 10 y 0-10 0 10 y 0-10 0 10 10 y

Posterior Predictive Inference Fast MCMC sampling/estimation of posterior Π(ζ Data) i f(y i r i ) ( ), with = ((μ j, j2 ) j Z,, 2, β σ,, ) (Karabatsos & Walker, 2012). Inference focuses on the posterior predictive density: f n (y r, t) = f (y r, t)dπ(ζ Data). Sharp RD design: causal effect estimate: E n (h{y} r + ) E n (h{y} r ) Fuzzy RD design: causal effect estimate: {E n (h{y} r + ) E n (h{y} r )} / {E n (T r + ) E n (T r )} 11

Two Data Applications of the IPMW Model for Causal Inference Both data sets involve Sharp RD designs. Prior parameter specification: b = 5, for 2 ~ U( 0, b ). Same priors for all other model parameters, as before. 40K samples retained from 200K MCMC samples and 2K burn-in. For parameters of interest: -- Trace plots showed good mixing of model parameters. -- 95% Monte Carlo Confidence intervals half-widths were sufficiently small (near.00), according to the sub-sampling batch method (Flegal & Jones, 2011). 12

IPMW Data Application #1 A new teacher education curriculum, CTPP (Chicago Teacher Pipeline Partnership), was implemented at one of the four Chicago schools of education, starting the Fall of 2010. Data on n = 347 undergraduate math teaching candidates (90% female), who has just completed a course on how to teach algebra. Pre-CTPP and Post-CTPP data (Fall 2007 - Spring 2013). Dependent variable: Z-score, learning to teach Math assessment. Covariates: TimeF10 = (Year 2010.9)/10; CTPP = 1({Year 2010.9}>0). [2010.9 is Fall 2010 cutoff]. [treatment assignment indicator] IPMW Model Results: Standardized residuals ranged from.8 to.8. R-squared =.92. Posterior distribution of and slopes, for CTPP, each concentrate around zero. 13

Density p.d.f. (-- 95%) IPMW Data Application #1 CTPP = 0 (Blue) vs. 1 (Red) TimeF10 = 0 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0-4 -3-2 -1 0 1 2 3 4 Z_posttest The new curriculum, compared to the old curriculum, increased the LMT scores, in terms of shifting the density of LMT scores to the right. This shift corresponds to an increase in the mean (from.17 to.20), the 10%ile (-1.43 to -1.35), the median (.07 to.15), and corresponds to a variance decrease (1.78 to 1.69). 14

IPMW Data Application #2 Causal link between basic skills and teaching ability? (Gitomer et al. 2011, J Teacher Education). Data on n = 205 undergraduate teaching candidates, under CTPP. Dependent Variable: Haberman Z-score on urban teaching ability (persistence; organization & planning; values student learning; theory to practice; at risk students; approach to students; survive in bureaucracy; explains teacher success; explains student success; fallibility). Covariates: B240d10 = (min[reading, Language, Math, Write] 240) / 10. BasicPass = 1({B240d10 240} > 0) = 1(Pass Reading Test). IPMW Model Results: Standardized residuals ranged from 1.3 to 1.2. R-squared =.99. For BasicPass, posterior mean (s.d.) estimate of slope is 1.49 (s.d. = 1.54), and posterior mean (s.d.) estimate of the slope is -.04 (.49). 15

Density p.d.f. (-- 95%) IPMW Data Application #2 BasicPas = 0 (Blue) vs. 1 (Red) B240d10 = 0 0.14 0.12 Four clusters of students. 0.1 0.08 0.06 0.04 0.02 0-4 -3-2 -1 0 1 2 3 4 Z_haberman A detailed inspection revealed that passing the basic skills reading test causally increased the Haberman z-score, in terms of the mean (from.31 to.45), 25%ile (-.65 to -.62), 75%ile (1.30 to 1.43), and 95%ile (2.36 to 2.82). 16

Conclusions We proposed a Bayesian nonparametric regression model for RD designs. The model provides a way to estimate the causal effect of a treatment (versus non-treatment), in terms of the treatment s regression discontinuity effect on the entire density of the outcome variable. Through the analyses of real data, we showed how the model can be extended to provide a causal analysis of how a treatment variable impacts the full distribution of the outcomes, including mean, variance, quantiles, p.d.f., and so forth... The model can be easily extended for the analysis of discrete-valued or (left- and/or right- censored) outcomes. Manuscript A Bayesian Nonparametric Causal Model for Regression Discontinuity Designs : http://arxiv.org/abs/1311.4482 User-friendly software has been developed for the model. 17

X 1,,X n r, n, j, 2 k j n j 1 Restricted DP (rdp) Mixture model X r i = (1, r i ) j, 2 k j n j 1 ρ n = (s₁,,s n ) r n k n j 1 Normal x i r i j, 2 j i:s i j n k n n k n n kn! j 1 j j 2 Normal j 0, j 2 C 1 j 2 InverseGamma j 2 a,b 1 nj 1 s r 1 s r n observed pre-treatment covariate (or prognostic/propensity score) are vectors of the assignment variables (i = 1,,n); the k n n distinct values of parameters that are assigned to each of the n subjects, with k n random; random partition of the n observations; 1, 1 2,, n, n2 s i = j if ( i, i2 ) = ( j*, j 2* ) and n j = i 1{( i, i2 ) = ( j*, j 2* )} the permutation of the first n integers that rearranges (r₁,,r n ) in increasing order, as r r 1 r r n with corresponding values x r 1,,x r n and s r 1,,s r n of x and s 1,,s n The rdp has precision and Normal-InvGamma baseline distribution. 19

Posterior of Random Partition n x,r k n k n! k n 1 n j j 1 C C R j R j b a a n j /2 a b V j 2 /2 a n j/2 1 S r 1 S r n V j 2 r j r j W j r j r j, W j I j R j C R j R j j 1 R j, and r j R j 0, r i is vector of r i, and R j is matrix of r i = (1, r i ), for subjects in cluster j. Posterior is sampled by a RJ-MCMC algorithm, which either splits or merges a randomly-selected cluster. A Causal Inference Strategy for sharp RD: Identify the subject i = i 0 with observed r i nearest to the cutoff r 0. For each draw of the partition n from its posterior ( n x, r), find the cluster where that subject is located, and then within that cluster, use a two-sample test statistic to compare the outcomes (y i ) for treatment subjects (having r i > r 0 ) and the outcomes for non-treatment subjects (having r i < r 0 ). Average two-sample statistics over a large number of RJ-MCMC draws. 20

Statistic Non-Treatment Treatment sample size 103.1 (3, 190) 6.7 (2, 16) mean.37 (.07, 1.55) 1.23 (.97, 1.59) variance.76 (.01, 1.04).47 (.01, 0.85) interquartile range 1.17 (.18, 1.71).90 (.24, 1.41) skewness.11 (-1.26,.71).03 (.63, 0.82) kurtosis 2.69 (1.45, 3.41) 2.06 (1.00, 3.06) 1%ile 1.34 ( 2.20, 1.47).27 ( 0.65, 1.47) 10%ile.77 ( 1.35, 1.47).42 (.01, 1.47) 25%ile.20 ( 0.65, 1.47).74 (.30, 1.47) 50%ile.34 (.18, 1.47) 1.22 (1.00, 1.59) 75%ile.98 (.53, 1.65) 1.65 (1.47, 2.06) 90%ile 1.43 (1.24, 1.71) 2.14 (1.71, 2.77) 99%ile 2.04 (1.71, 2.42) 2.28 (1.71, 2.89) t-statistic 2.02 ( 4.21,.88) p-value:.19 (.00,.91) F test,variance 4.86 (.02, 34.45) p-value:.65 (.05,.98) Pr Y 1 Y 0 C r 0 0.70 (.21,.93) Pr Y 1 Y 0 C r 0 0.22 (.04,.67) KS test.28 (.05,.98) Basic skills example (again). rdp: = 1, vague N-IG baseline prior. Posterior mean (95% posterior credible interval). for various test statistics, in comparing treatment outcomes (y i ) vs. non-treatment outcomes, for the cluster of subjects around the cutoff r 0. 21

References Bloom, H. (2012). Modern regression discontinuity analysis. Journal of Research on Educational Effectiveness, 5, 43-82. Cattaneo, M., Frandsen, B., and Titiunik, R. (2013). Randomization Inference in the Regression Discontinuity Design: An Application to the Study of Party Advantages in the U.S. Senate. University of Michigan. February 19th. Unpublished manuscript. Cook, T. (2008). Waiting for life to arrive: A history of the regression discontinuity design in psychology, statistics and economics. Journal of Econometrics, 142, 636 654. Dawid, A. (2000). Causal inference without counterfactuals. Journal of the American Statistical Association, 95, 407-424. Dawid, A. (2002). Influence diagrams for causal modelling and inference. International Statistical Review, 70, 161-189. 22

References (continued) Dawid, A. (2010). Beware of the DAG! Journal of Machine Learning Research-Proceedings Track, 6, 59-86. Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. London: Chapman and Hall/CRC. Flegal, J.M., and Jones, G.L. (2011). Implementing Markov chain Monte Carlo: Estimating with confidence. In S.P. Brooks and A.E. Gelman and G.L. Jones and X.L. Meng (Eds.), Handbook of Markov Chain Monte Carlo, pp. 175-197. Boca Raton, FL: CRC Press. Frandsen, B., Frölich, M., and Melly, B. (2012). Quantile treatment effects in the regression discontinuity design. Journal of Econometrics, 168, 382-395. Gitomer, D.H., Brown, T.L., and Bonett, J. (2011). Useful signal or unnecessary obstacle? The role of basic skills tests in teacher preparation. Journal of Teacher Education, 62, 431-445. 23

References (continued) Hahn, J., Todd, P., and der Klaauw, W. V. (2001). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69, 201-209. Imbens, G. and Kalyanaraman, K. (2012). Optimal bandwidth choice for the regression discontinuity estimator. The Review of Economic Studies, 79, 933-959. Imbens, G. W. and Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. Journal of Econometrics, 142, 615-635. Kalli, M., Griffin, J., and S.G. Walker (2010). Slice Sampling Mixture Models. Statistics and Computing, 21, 93-105. Karabatsos, G. and Walker, S. (2012). Adaptive-modal Bayesian nonparametric regression. Electronic Journal of Statistics, 6, 2038-2068. Lee, D. (2008). Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics, 142, 675-697. 24

References (continued) Lee, D. and Lemieux, T. (2010). Regression Discontinuity Designs in Economics. The Journal of Economic Literature, 48, 281-355. Rubin, D.B. (2008). For objective causal inference, design trumps analysis. The Annals of Applied Statistics, 2, 808-840. Thistlewaite, D. and Campbell, D. (1960). Regression-discontinuity analysis: An alternative to the ex-post facto experiment. Journal of Educational Psychology, 51, 309-317. Wade, S., Walker, S.G., and Petrone, S. (2013, to appear) A Predictive Study of Dirichlet Process Mixture Models for Curve Fitting. Scandinavian Journal of Statistics. Wong, V., Steiner, P., and Cook, T. (2013). Analyzing Regression- Discontinuity Designs With Multiple Assignment Variables: A Comparative Study of Four Estimation Methods. Journal of Educational and Behavioral Statistics, 38, 107-141. 25