Causal Inference with Big Data Sets

Causal Inference with Big Data Sets Marcelo Coca Perraillon University of Colorado AMC November 2016 1 / 1

Outlone Outline Big data Causal inference in economics and statistics Regression discontinuity design Examples 1 Five-star ratings of nursing homes 2 Early intervention for very low-birth weight infants 3 Hospital readmission penalties Estimation issues and extensions 2 / 1

Big data What is big data? The new big data: remarkable advances in the ability to store and query very large data sets in real time Think of Uber, Google maps, Kayak Also advances in prediction methods: think Netlix recommendations and machine learning That same technology could be applied to decision making tools for physicians in real time through electronic health records Query speed is fundamental 3 / 1

Big data Big data in old-fashioned research Not immediate benefit in terms of causal inference and traditional research questions (no hypothesis testing in machine learning) In health economics and health services research, big data has been used for a long time Think of studies using Medicare claims: millions of enrollees with multiple claims per year using multiple years. That s very big data Causal problems are the same regardless of the data size; a lot of bad data is the same a little of bad data However, some observational methods are better suited for large data sets 4 / 1

Causal inference Causal inference Causal inference has been a constant in economics for almost 100 years: supply and demand curves (around 1918) New advances in the last 15 years in several fields A new language, notation, and a unified theory in the counterfactual framework (see Imbens and Rubin, 2015) Today: regression discontinuity design (RDD) Not a big data method per se, but in practice one needs large datasets to implement it 5 / 1

Regression discontinuity Thistlethwaite and Campbell Studied the impact of merit awards on future academic outcomes Awards allocated based on test scores. A test score greater than c (cutoff point) guaranteed receipt of scholarship Research question: What is the effect of receiving a scholarship on future grades or income? We cannot just compare students who received the scholarship to those who did not But... Thistlethwaite and Campbell realized they could compare individuals just above and just below the cutoff point 6 / 1

Regression discontinuity Simple idea Is there much difference among students scoring, say, 720 versus 740? Probably not (given some assumptions) Can be empirically verified: compare measured baseline characteristics of those just above and just below the cutoff point c We cannot compare unmeasured characteristics Perhaps the main reason RDD is becoming popular is that one can see the design working 7 / 1

Regression discontinuity Assumptions and validity The assignment mechanism is known and depends on a continuous variable (assignment variable is often called the running or forcing variable) The cutoff point is known (the probability of treatment jumps to 1 if test score > c) Key assumption: individuals cannot manipulate with precision their assignment variable Data can be analyzed as if it were a (conditionally) randomized experiment Limitation: the estimated treatment effect applies to those near the cutoff point (LATE) 8 / 1

Regression discontinuity Assumptions and validity Expressing the key assumption of RDD as the absence of manipulation with precision is common in economics because economists model the behavior of agents making choices Some authors write that validity depends on the assignment being arbitrary or that the assignment variable is measured with error (e.g. Moscoe et al, 2015) These are slightly different ways of saying the same but it s somewhat confusing Arbitrary: The cutoff point is not related to outcome it could be any other value (think of weight) Measured with error: If assignment variable is measured with error, then cutoff point c is essentially (somewhat) arbitrary 9 / 1

Regression discontinuity Manipulation Manipulation example 1: Test with few questions and plenty of time Manipulation example 2: DMV test to get a driving license Manipulation example 3: Eligibility criteria to obtain some benefit (say, income below 150% of Federal Poverty Level) Consequence: not comparable close to cutoff point (but somewhat testable) Some manipulation does NOT invalidate RDD. Precision and lack of relation between cutoff point and outcome are key to identify causal effects 10 / 1

Regression discontinuity Graphical Example Simulated data with c = 140 11 / 1

Regression discontinuity No effect 12 / 1

Regression discontinuity Sharp and fuzzy RDD Sharp RDD: Assignment or running variable completely determines treatment. A jump in the probability of treatment before and after cutoff point, from 0 to 1 Fuzzy RDD: Cutoff point increases the probability of treatment but running variable does not completely determines treatment Which brings us back to the world of instrumental variables; think of encouragement designs where treatment assignment is the instrument Unlike most instruments, one can still check if observations are similar near cutoff point Fuzzy RDD designs are underutilized 13 / 1

Examples Examples from literature Research question: Do the benefits of additional medical expenditures for at-risk newborns exceed their costs? Almond et al. (2010): Assignment variable is birth weight. Newborns weighting less than 1, 500 grams (about 3 pounds) receive more medical treatment Fuzzy design: low birth weight increases the probability of more medical treatment but does not completely determine treatment (first stage) Mortality as outcome (second stage) Restricted analysis to small window around threshold (85 grams) Clinical guidelines are a good place to find (fuzzy) discontinuities 14 / 1

Examples Examples from literature Do consumers respond to composite ratings of quality in nursing homes (Perraillon et al.,2016)? In 2008 the Centers for Medicare & Medicaid (CMS) released composite ratings of quality in nursing homes (1 to 5 stars) Overall stars are assigned based on deficiency data transformed into a points system Percentiles based on this score determine cutoff points to assign stars (running variable) Outcome: new admissions six months after the release of ratings Nursing homes that received an additional start gained more admissions, except for lower-rated nursing homes 15 / 1

Examples Assignment of stars based on scores 16 / 1

Examples Other examples Hospitals readmission penalties (with Rich Lindrooth) Anderson and Magruder (2012) and Lucas (2012): Yelp.com ratings have an underlying continuous score. Distribution determines cutoff points for 1 to 5 stars. Effect of an extra star on future reservations and revenue Anderson et al. (2012): Young adults lose their health insurance as they age (older than 18 and in college but different after ACA). Age changes the probability of having health insurance (fuzzy design) Card et al (2009): Does Medicare save lives? Older adults can get Medicare after age 65. They compared admissions to ER for people just below and just above 65. Found that patients over 65 get more services (first stage) and that mortality is reduced (second stage) 17 / 1

Examples Estimation: Parametric Simplest case is linear relationship between Y and X Y i = β 0 + β 1 T i + β 3 X i + ɛ i T i = 1 if subject i received treatment and T i = 0 otherwise. Often written as T i = 1(X i > c) or T i = 1 [Xi >c] X is the assignment variable Usually centered at cutoff point Y i = β 0 + β 1 T i + β 3 (X i c) + ɛ i. Treatment effect is given by β 1. E[Y T = 1, X = c] = β 0 + β 1 and E[Y T = 0, X = c] = β 0. E[Y T = 1, X = c] E[Y T = 0, X = c] = β 1. What about covariates? 18 / 1

Examples Need to model relationship between X and Y correctly What if nonlinear? Could result in a biased treatment effect if one assumes a linear model 19 / 1

Examples Main issues in estimation Modeling correctly the functional form of outcome and running variable Paper by Hahn, Todd, and Van der Klaauw (2001) clarified assumptions about RDD and framed estimation as a nonparametric problem Idea is to estimate a model that does not assume a functional form for the relationship between Y and X. The model is something like Y i = f (X i ) + ɛ i Emphasized using weighted local polynomial regression (instead of LOWESS) Problem: difficult to add covariates and outcome may not be normally distributed 20 / 1

Examples A window or all data? Should we use all data or just the data near a cutoff point? Typical bias-variance trade-off; the optimal bandwidth literature was never influential Common advice was to use high-order polynomials to control for the running variable until Gelman and Inbems (2014): Why High-order Polynomials Should not be Used in Regression Discontinuity Designs We argue that estimators for causal effects based on [higher order polynomials] can be misleading, and we recommend researchers do not use them, and instead use estimators based on local linear or quadratic polynomials... 21 / 1

Examples Extension, other issues Statistical tests for manipulation (clustering close to cutoff point) More than one assigning variable (e.g. scholarship based on verbal and quantitative tests) Discontinuities in geography Fuzzy RDD is underutilized 22 / 1

Examples Summary of method Developed to estimate causal treatment effects in non-experimental settings Good internal validity; some assumptions can be empirically verified Fairly convincing when it is possible to show balance close to cutoff point Treatment effects are local (LATE); limits external validity Relatively easy to estimate In practice it does require large datasets to have enough observations around a bandwidth Stata code and more slides: http://tinyurl.com/mcperraillon 23 / 1