PSC 504: Dynamic Causal Inference

Similar documents
Gov 2002: 13. Dynamic Causal Inference

PSC 504: Differences-in-differeces estimators

PSC 504: Instrumental Variables

Gov 2002: 4. Observational Studies and Confounding

Discussion of Papers on the Extensions of Propensity Score

Gov 2002: 5. Matching

e author and the promoter give permission to consult this master dissertation and to copy it or parts of it for personal use. Each other use falls

Telescope Matching: A Flexible Approach to Estimating Direct Effects

Methods for inferring short- and long-term effects of exposures on outcomes, using longitudinal data on both measures

Causal Inference Basics

G-ESTIMATION OF STRUCTURAL NESTED MODELS (CHAPTER 14) BIOS G-Estimation

Gov 2002: 9. Differences in Differences

Balancing Covariates via Propensity Score Weighting

Differences-in- Differences. November 10 Clair

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

Covariate Balancing Propensity Score for General Treatment Regimes

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Winter 2014 Instructor: Victor Aguirregabiria

Robust Estimation of Inverse Probability Weights for Marginal Structural Models

Dynamics in Social Networks and Causality

A Sampling of IMPACT Research:

Panel Data. March 2, () Applied Economoetrics: Topic 6 March 2, / 43

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Gov Multiple Random Variables

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Variable selection and machine learning methods in causal inference

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Difference-in-Differences Methods

Classification and Regression Trees

Use of Matching Methods for Causal Inference in Experimental and Observational Studies. This Talk Draws on the Following Papers:

Last few slides from last time

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1

h=1 exp (X : J h=1 Even the direction of the e ect is not determined by jk. A simpler interpretation of j is given by the odds-ratio

On the errors introduced by the naive Bayes independence assumption

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

Chapter 11. Regression with a Binary Dependent Variable

Propensity Score Methods for Causal Inference

SVM optimization and Kernel methods

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

A Decision Theoretic Approach to Causality

Weighting. Homework 2. Regression. Regression. Decisions Matching: Weighting (0) W i. (1) -å l i. )Y i. (1-W i 3/5/2014. (1) = Y i.

Research Design: Causal inference and counterfactuals

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

1 Impact Evaluation: Randomized Controlled Trial (RCT)

Ratio of Mediator Probability Weighting for Estimating Natural Direct and Indirect Effects

Causal Inference. Miguel A. Hernán, James M. Robins. May 19, 2017

Propensity Score Weighting with Multilevel Data

IP WEIGHTING AND MARGINAL STRUCTURAL MODELS (CHAPTER 12) BIOS IPW and MSM

ANALYTIC COMPARISON. Pearl and Rubin CAUSAL FRAMEWORKS

Targeted Maximum Likelihood Estimation in Safety Analysis

Statistical Analysis of Causal Mechanisms

Machine Learning Linear Classification. Prof. Matteo Matteucci

Econometrics with Observational Data. Introduction and Identification Todd Wagner February 1, 2017

Overfitting, Bias / Variance Analysis

Chapter 7: Sampling Distributions

CENTRAL LIMIT THEOREM (CLT)

Gov 2000: 9. Regression with Two Independent Variables

Rank preserving Structural Nested Distribution Model (RPSNDM) for Continuous

LECTURE 2: SIMPLE REGRESSION I

Lecture 4: Linear panel models

Statistical Analysis of Causal Mechanisms for Randomized Experiments

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248)

Propensity Score Analysis with Hierarchical Data

DAGS. f V f V 3 V 1, V 2 f V 2 V 0 f V 1 V 0 f V 0 (V 0, f V. f V m pa m pa m are the parents of V m. Statistical Dag: Example.

Classification. Chapter Introduction. 6.2 The Bayes classifier

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

AGEC 661 Note Fourteen

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing

TDA231. Logistic regression

Selection on Observables: Propensity Score Matching.

Quantitative Economics for the Evaluation of the European Policy

Telescope Matching: A Flexible Approach to Estimating Direct Effects *

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Balancing Covariates via Propensity Score Weighting: The Overlap Weights

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Harvard University. Rigorous Research in Engineering Education

Achieving Optimal Covariate Balance Under General Treatment Regimes

Simple Regression Model. January 24, 2011

Recitation Notes 6. Konrad Menzel. October 22, 2006

Exploratory Factor Analysis and Principal Component Analysis

The propensity score with continuous treatments

Principles Underlying Evaluation Estimators

16.400/453J Human Factors Engineering. Design of Experiments II

Estimation of Optimal Treatment Regimes Via Machine Learning. Marie Davidian

Estimation of Dynamic Regression Models

Comparative effectiveness of dynamic treatment regimes

Summary and discussion of The central role of the propensity score in observational studies for causal effects

Applied Statistics Lecture Notes

Bayesian non-parametric model to longitudinally predict churn

Undirected Graphical Models

arxiv: v1 [stat.me] 15 May 2011

Minimax risk bounds for linear threshold functions

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Lecture 7 Introduction to Statistical Decision Theory

A Measure of Robustness to Misspecification

University of California, Berkeley

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects

Transcription:

PSC 504: Dynamic Causal Inference Matthew Blackwell 4/8/203 e problem Let s go back to a problem that we faced earlier, which is how to estimate causal effects with treatments that vary over time. We could have a panel data situation, where we repeated measurements of the outcome, or it could be a situation where we see a back and forth between the covariates and the treatment and observed only one outcome. An example of the latter is negative advertising in campaigns: a campaign decides to go negative based on past polling, but that negativity may have an impact on future polling. Of course, the outcome is the nal vote share and that is only observed at the end of the campaign. What do we have to sort out here? One is what types of effects we are interested in. Earlier we saw that panel data regression models can only identify the contemporaneous effect of the treatment and cannot estimate the effect of treatment history. We ll see how to estimate these treatment history effects today. Notation Let s let A it = (A i,..., A it ) be the partial history of treatment up to time t. De ne X it similarly. Sometimes we ll need to reference a speci c instance of these variables and we ll use a t and x t to do so. We drop the t subscript to refer to the entire sequence: A i = (A i,..., A it ). Of course, we ll need potential outcomes and these could be functions of the entire treatment history: Y ( a). Note that this notation implies that the regimes are static, so that decision to go negative at time t is xed by the treatment history and does not change based on covariates that are evolving over time. We can refer more generally to treatment regimes which are like strategies in game theory: they are rules that dictate what actions/treatments units should take given a certain covariate history. We de- ne these as g(x t ), which takes values a t. Clearly the static treatment histories above are treatment regimes that are xed across covariates. Obviously we might have different outcomes if we follow different treatment regimes, even if those regimes are observationally equivalent for some people. us, we want to write potential outcomes for the regimes: Y i (g). Treatment regimes are complicated and generally fairly hard to deal with, but the temptation is fairly obvious: this are the kinds of effects people want to know about. In order to connect the potential outcomes and the observed data:

Y i = Y i (g) if A i = g(x i ). is says that if a unit s observed history is equal to the perscription of the treatment regime, then the observed outcome equals the potential outcome under that regime. g-computation We would like to estimate the effects of these regimes. Something like the follwing: τ(g, g ) = E[Y (g) Y (g )] In medical studies, the goal is o en to estimate the optimal regime, which is the following: arg max E[Y (g)], g Here, we are trying to nd the regime that maximizes the outcome (assuming the outcome is bene cial). Either of these estimands requires us to estimate the mean of the potential outcome under a given regime. How do we do that? In general, we are going to have to make an ignorability assumption, which we will call sequential ignorability: Y (g) A t X t = x t, A t = g(x t ) is assumption says that the potential outcome under some regime is independent of the treatment at time t, conditional on the past values of the covariates and the treamtent regime. Again, this is simiar to running a sequential experiment, where the randomization can depend on the past. We also have to assume positivity, which says that there are no deterministic treatments. at is, if Pr[A t = a t, X t = x t ] > 0, then Pr[A t = a t X t = X t, A t = a t ] > 0 But how do we calculate the marginal mean of the potential outcomes? e same way we have before, by marginalizing over the distribution of the covariates. Here it is more tricky because they vary over time. Jamie Robins came up with what he calls the g-computational formula for these types of marginal means. In general, it looks like this: T { E[Y (g)] = E[Y X = x, A = g(x)] f(xj X j = x j, A j = g(x j ))dµ(x j ) } x t x 0 j=0 Notice that the right hand side here only has observeable quantities. To get this, we had to invoke sequential ignorability. To see how this works, let s work with a simpler example: two time periods, with a binary covariate covariate between the two treatments, x. First note that, under consistency, we know that E[Y (g) A = g(x)] = E[Y A = g(x)]. 2

E[Y (g)] =E[Y (g) A = g ()] =E[Y (g) X =, A = g ()] Pr[X = A = g ()] + E[Y (g) X = 0, A = g (0)] Pr[X = 0 A = g (0)] =E[Y (g) X =, A 2 = g 2 (), A = g ()] Pr[X = A = g ()] + E[Y (g) X = 0, A 2 = g 2 (0), A = g (0)] Pr[X = 0 A = g (0)] =E[Y X =, A 2 = g 2 (), A = g ()] Pr[X = A = g ()] + E[Y X = 0, A 2 = g 2 (0), A = g (0)] Pr[X = 0 A = g (0)] e rst equal sign comes from sequential ignorability (the rst period is randomized), the second from the law of iterated expectations, the third from sequential ignorability (conditional on rst treatment and the covariate, the second treatment is random), and the last from consistency. Note that we can t just collapse the last line with the law of iterated expectations because the conditioning set for the outcome and the covariates are different. In the cross sectinal case this is true as well, but typically we average across the marginal distribution of X i. is is relatively easy because we can usually just use the empirical distribution of the data. Here, though, we need the distribution of the X it conditional on the past, which means we will almost certainly need a model for the relationship between the covariates and the past in addition to the model for the outcome. Ugh. Lots of modeling. Obviously, if the covariates are continuous, we are going to have to replace the sum over the distribution of the covariates with an integral, which is what we see in the g-computational formula. One approach is to write down models for the covariates and the outcome, then construct a likelihood and estimate the parameters of that likelihood and plug them into the g-formula. Robins has some work that shows that this is problematic for most situations because there may be no set of parameters for which if they are all zero, there is no effect. at is, it might be difficult to test the null hypothesis of no effect of the treatment regime. It might be possible to estimate effects in this situation, but this is relatively open question. ere are ways to reparameterize the distribution of the data to make progress and these models are usually called Structural Nested models. Marginal structural models With g-computation, we had to write down models for the covariates in addition to the outcomes, which is a large pain. Ideally, we would just be able to run a regression and read off the coefficients in a causal way. is is what marginal structural models are and we can use a weighting approach to estimate their parameters. A marginal structural model (MSM) is a model for the marginal mean of the potential outcome for a given treatment history (we re going to ignore treatment regimes for now). at is, it s a model of the form: E[Y (a)] = h(a; β) Here h is a link function and β are a set of parameters. ere are a lot of modeling choices we might make here. With a binary treatment variable, there are 2 T possible treatment histories. 3

IPTW Let s say we randomly assigned treatment histories. With the single-shot case, we could always nonparametrically estimate E[Y ()] and E[Y (0)] using simple means. Here, though, we are unlikely to observe anyone following any particular history. us, simple means like these are not going to work. We are going to need a model for the marginal potential outcome means. How should we model this? It could be that the number of treated periods is all that matters: E[Y (a)] = β 0 + β = T a t Or it could be that the effect varies over time: T E[Y (a)] = β 0 + β = T /2 a t + β 2 t t t=t /2+ In any case, we have to write a model down for the outcome here in terms of the treatment history. But how do we actually estimate these models? Can we use regression or matching to estimate the parameters? Unfortunately not. Imagine we estimate the following model: E[Y A, X] is model conditions on the these variables that are changing over time (we call them time-varying confounders). Imagine we are in the two-period case and we estimate this: E[Y A, A 2, X] = α 0 + α A + α 2 A 2 + α 3 X Here we conditioned on the covariate to remove omitted variable bias, but we have actually introduced post-treatment bias. If the effect of negativity early in the race ows through polls and we condition on polls, this is going to underestimate the effect of earlier negativity. From this point of view maybe we omit polls and estimate this model: E[Y A, A 2 ] = α 0 + α A + α 2 A 2 Do these parameters equal the causal parameters β? No, because now there is confounding between polling and going negative later in the race. us, polling is simultaneously post- and pre-treatment. Controlling for it induces post-treatment bias and omitting induces confounding bias. What can we do? We can instead use a weighting approach. Remember that regression and matching looked within subsets of the covariates to nd balance conditional on the covariates. Weighting instead reweighted the data to remove those imbalances. How do we weight? Well, we know that with a single-shot treatment, we can weight by the inverse of the propensity score: W i = A i Pr[A i = X i ] + A i Pr[A i = X i ] a t 4

Here, the propensity scores are more complicated because the treatment is more complicated. We have to weight by the probability of observing the entire history: Pr[A X]: W it = Pr(A it A it, X it ). us, in this case, we need to each unit by the probability of receiving the treatmeht history they did, conditional on the past. Let s look at this in the two-period example. Let s say we see a campaign that is positive in the rst period, then they are trailing, then they negative later in the race. e weights we would calculate would be: W i = Pr(pos ) Pr(neg 2 trail, pos ). Why does the weighting work? It balances the distribution of the data so that, in the reweighted data, any arrows pointing from X to the treatments are removed. us, in the reweighted data, there is no confounding, no backdoor paths, and no need to control for the covariates anymore. To see this, we can see how the weights affect the joint distribution of the observed data: W (A, X) f(y, A, X) f W (Y, A, X) = ω W (A, X) f(a X) = f(y A, X)f(X) ω = f(y A)f(X) f W (A) Here, ω is a normalizing constant. Basically, the weights balance the distribution of the the treatment with respect to the covariates, but don t change the relationship between the treatment and the outcome. is is exactly what we want. 5