Propensity Score Methods for Estimating Causal Effects from Complex Survey Data

Size: px

Start display at page:

Download "Propensity Score Methods for Estimating Causal Effects from Complex Survey Data"

Roland Robertson
6 years ago
Views:

1 Propensity Score Methods for Estimating Causal Effects from Complex Survey Data Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Robert D. Ashmead, M.S. Graduate Program in Biostatistics The Ohio State University 204 Dissertation Committee: Dr. Bo Lu, Advisor Dr. Rebecca Andridge Dr. Eloise Kaizar

2 c Copyright by Robert D. Ashmead 204

3 Abstract Propensity score based adjustments are popular in analyzing observational data. To obtain valid causal estimates, we often assume that the sample is a simple random sample from the population onterest or that the treatment effect is homogeneous across the population. When data from surveys with complex design are used, adhoc adjustments to incorporate survey weights are often applied without rigorous justification. In this dissertation, we propose a super population framework, which includes a pair of potential outcomes for every unit in the population, to streamline the propensity score analysis for complex survey data. Based on the proposed framework, we develop propensity score stratification, weighting, and matching estimators along with a new class of hybrid estimators and corresponding variance estimators that adjust for survey design features. Additionally, we argue that in this context we should estimate the propensity scores by a weighted logistic regression using the sampling weights. Various estimators are compared in simulation studies that calculate the bias, mean-squared error, and coverage of the estimators. As the treatment effect becomes more heterogeneous, the gains of adjusting for the survey design increase. Lastly, we demonstrate the proposed methods using a real data example that estimates the effect of health insurance on self-rated health for adults in Ohio who may be eligible for tax credits to purchase medical insurance from the healthcare insurance exchange. ii

4 For my family and friends. Thank you for your love and support. iii

5 Acknowledgments I would first like to thank my advisor, Dr. Lu, for his always helpful and steady guidance during the past few years. It has been a truly wonderful experience to learn from and collaborate with you. I would also like to acknowledge Dr. Andridge, Dr. Kaizar, and Dr. Stasny for their input, time, and encouragement as committee members. A special acknowledgment also goes to Dr. Nagaraja, who convinced me not give up on my PhD during my moment of crisis. Thanks to everyone at the Government Resource Center, especially Tim Sahr, Rachel Tumin, and Dan Weston for their support, advice, and friendship during the past several years. Thanks to my dear friends Dr. Chris and Robin Nau for your constant friendship, support, and countless Wednesday dinners over the past six years. Finally, a profound thanks to my parents. Without your love, support, and encouragment, I would not be where I am today. iv

6 Vita Novermber 2, Born - Grand Rapids, Michigan Education B.A. Mathematics, College of Wooster, Wooster, Ohio M.S. Statistics, The Ohio State University, Columbus, Ohio Professional Experience Graduate Teaching Associate, The Ohio State University, Columbus, Ohio Summer Researcher, Eunice Kennedy Shriver National Institute of Child Health and Development, Bethesda, Maryland Fellow, Leadership Education in Neurodevelopmental Disabilites, Columbus, Ohio 202-Present Graduate Research Specialist, Ohio Colleges of Medicine Government Resource Center, Columbus, Ohio Fields of Study Major Field: Biostatistics v

7 Table of Contents Page Abstract Dedication Acknowledgments Vita List of Tables ii iii iv v ix List of Figures x. Introduction and Literature Review The Potential Outcomes Framework Population and Sample Average Treatment Effects The Propensity Score Common Propensity Score Methods Propensity Score Matching Propensity Score Stratification Propensity Score Inverse Probability Weighting Variance Estimators for Matching, Stratification, and IPW Estimators 2.5. Matching Estimators Stratification Estimators IPW Estimators Complex Survey Sampling Literature Review of Propensity Score Methods with Complex Survey Data vi

8 2. General Framework, Weighting and Stratification Estimators A General Framework Sampling Assumptions Stratification Estimators Weighting Estimators Estimating the Propensity Score Evaluating Balance Variance Estimators Variance of Stratification Estimators Variance of Weighting Estimators Estimators for P AT T Stratification Estimators Weighting Estimators Simulation Study Tables and Figures Stratification-Weighting Hybrid Estimators Hybrid Estimators in the Non-Survey Context Simulation Hybrid Estimators in the Survey Context Evaluating Balance Simulation Tables and Figures Matching Estimators Point Estimators :K Survey Matching Estimator Survey Full Matching Estimator Decomposition of the Estimators Bias of the Estimators Variance Estimators Evaluating Balance Simulation Studies Tables and Figures Example: Health Reform Impact on Individual Health Outcomes Introduction Data Methods vii

9 5.4 Results Conclusions Tables and Figures Discussion and Future Work Discussion Limitations Future Work Cluster Designs Missing Data Identifying Heterogeneity Bibliography viii

10 List of Tables Table Page 2. Weighting and stratification simulation results with β = β str Weighting and stratification simulation results with β = β mod Mean number of propensity score strata used in propensity score stratification and hybrid estimators with stratification rules in a non-survey context Simulation results for propensity score stratification, weighting, and hybrid estimators when the propensity score model is misspecified in a non-survey context Point estimates, estimated variance, and estimated 95% confidence intervals for the effect of health insurance on self-rated health as a continuous variable Point estimates, estimated variance, and estimated 95% confidence intervals for the effect of health insurance on the prevalence of fair/poor self-rated health ix

11 List of Figures Figure Page 2. Graphical representation of the general potential outcomes framework Graphical representation of a typical potential outcomes framework Estimated propensity scores using weighted and unweighted logistic regression from one iteration of the simulation study Percent bias of propensity score stratification, weighting, and regression estimators Mean squared error of propensity score stratification, weighting, and regression estimators Coverage of the 95% confidence intervals of the propensity score stratification and weighting estimators Percent bias of propensity score stratification, weighting, and hybrid estimators with variable number of propensity score strata in a nonsurvey context MSE of propensity score stratification, weighting, and hybrid estimators with variable number of propensity score strata in a non-survey context MSE of propensity score stratification, weighting, hybrid, and regression estimators with stratification rules in a non-survey context Coverage of 95% confidence intervals created by propensity score stratification, weighting, and hybrid estimators with variable number of propensity score strata in a non-survey context x

12 3.5 Coverage of 95% confidence intervals created by propensity score stratification and hybrid estimators with stratification rules in a non-survey context MSE of propensity score stratification, weighting, and hybrid estimators with variable number of propensity score strata in survey context with γ = γ str Coverage of 95% confidence intervals created by propensity score stratification, weighting, and hybrid estimators with variable number of propensity score strata in a survey context with γ = γ str Percent bias of propensity score weighting and matching estimators for P AT E MSE of propensity score weighting and matching estimators for P AT E Coverage of 95% confidence intervals created by propensity score weighting and matching estimators for P AT E Histogram of the estimated propensity scores in a sample from one iteration of the P AT T simulation Percent bias and MSE of propensity score matching estimators for P AT T Coverage of 95% confidence intervals created by propensity score matching estimators for P AT T The smoothed distributions and box and whiskers plot of the estimated propensity score for treated and control groups Estimated propensity scores using unweighted and weighted logistic regression The balance of treated and control groups before and after propensity score weighting using weights implied by ˆ ipw2.svy, using propensity scores from weighted logistic regression The balance of treated and control groups before and after propensity score weighting using weights implied by ˆ ipw2.svy, using propensity scores from a unweighted logistic regression xi

13 5.5 The balance of treated and control groups before and after stratification using ˆ s.svy The balance of treated and control groups before and after stratification and weighting using ˆ h.ipw2.svy The balance of treated and control groups before and after matching using ˆ t matchk.svy The balance of treated and control groups before and after matching using ˆ t matchfull.svy xii

14 Chapter : Introduction and Literature Review A major advantage of randomized studies over observational studies is that the randomization mechanism balances the distribution of the covariates across treatment groups. This includes both observed and unobserved covariates. In contrast, associations between the treatment exposure and the covariates may be present in an observational study leading to imbalance in the distribution of the covariates across treatment groups. If the covariates are also associated with the outcome, then without proper adjustment, confounding bias may occur when attempting to make inference about the treatment effect. There exist many methods that attempt to adjust for confounding in observational data to get an unbiased estimate of the treatment effect, including matching on the covariates, stratifying on the covariates, weighting, and regression modeling. The propensity score, defined as the conditional probability of receiving treatment given observed covariates, was introduced by Rosenbaum and Rubin [36] and provides an approach to correcting the imbalance of observed covariates between treatment groups. The propensity score serves as a dimension reduction tool which summarizes treatment-assignment-related covariate information into a scalar value. Matching, stratifying, and weighting using the propensity score are popular methods for estimating treatment effects.

15 This dissertation develops new methods for estimating population average treatment effects using propensity scores when data is collected from a complex probability sample of a population. Specifically, methods based on weighting, stratifying, and matching by the estimated propensity score are examined. The remaining sections of this chapter give a review of the potential outcomes framework, propensity scores, propensity score estimators, complex sample surveys, and the existing literature relevant to our proposed methods. Next, Chapter 2 discusses our proposed general framework for potential outcomes in a complex probability survey, weighting and stratification estimators, and their respective variance estimators. We demonstrate the properties of theses estimators in a simulation study. Chapter 3 proposes and investigates hybrid stratification-weighting estimators for both the complex survey and non-survey context. We illustrate how these estimators are more robust against misspecification of the propensity score through a simulation. In Chapter 4 we propose propensity score matching estimators for use with complex survey data and investigate their properties through a simulation. Chapter 5 consists of a real data example which uses our proposed methods to estimate the effect of health insurance on self-rated health. Lastly, in Chapter 6, we discuss the conclusions, limitations, and future directions of this research.. The Potential Outcomes Framework The potential outcomes framework provides a way of thinking about the effect of a treatment or intervention on an individual, unit, sample, or population that is both intuitive and mathematically convenient. A treatment can be broadly defined depending on the area of study. Some examples are taking a drug (e.g. aspirin), 2

16 a health behavior (e.g. smoking cigarettes), or a program (e.g. having health insurance). The potential outcomes framework compares what the outcome would be under the treatment with what the outcome would be under some other treatment or lack of treatment. This comparison defines the treatment effect. Consider a treatment or intervention with two levels of exposure, defined generically as treatment and control. Let Z be an indicator of the observed treatment exposure, so that Z = indicates receiving treatment and Z = 0 indicates receiving control. Define a pair of potential outcomes (Y z=, Y z=0 ) for each individual which represent the outcomes for that individual if he or she were to receive treatment (Y z= ), and if he or she were to receive control (Y z=0 ). The potential outcome framework is also known as counterfactual framework in psychological literature. This framework allows us to define a causal effect for an individual. The treatment exposure has a causal effect for individual i if Yi z= Yi z=0 [2]. The individual causal effect can be measured by comparisons of the potential outcomes such as Yi z= Yi z=0 or Y z= i /Yi z=0. In real life we only can ever observe one of the potential outcomes for each individual, which leads to what Holland calls the fundamental problem of causal inference [7]. We are never able to observe the treatment effect for a unit because we only ever observe one potential outcome for each unit at the same time. Instead ondividual causal effects, we focus on average population causal effects. The treatment exposure has a causal effect in the population of N units if E[Y z= ] E[Y z=0 ] [2], where E[Y z= ] = N N Y z= i and E[Y z=0 ] = N N Y z=0 i are the average of the potential outcomes of units in the population onterest. Here we borrow the notation of expectation to denote the population mean. From this point on we simply refer to the average population causal effect as the average 3

17 causal effect or average treatment effect. To quantify the strength of the causal effect we use effect measures such as. E[Y z= ] E[Y z=0 ], 2. E[Y z= ]/E[Y z=0 ], or 3. (P [Y z= = )]/P [Y z= = 0)])/(P [Y z=0 = )]/P [Y z=0 = 0)]) depending on the type of outcome variable in our study and how we wish to interpret the effect [2]. Here we focus exclusively on estimating the average treatment effect difference E[Y z= ] E[Y z=0 ]. This measure is meaningful both when Y is a continuous variable, and when it is a binary variable. Define the average treatment effect AT E as AT E = E[Y z= ] E[Y z=0 ]. (.) The average treatment effect AT E can be thought of as the average difference in the outcome if all units received treatment rather than if all units had received control. A slightly different estimand that is often used is the average treatment effect in the treated [6], which is defined as AT T = E[Y z= Z = ] E[Y z=0 Z = ]. (.2) The average treatment effect in the treated is essentially the treatment effect restricted to the treated subpopulation. Similarly, we can define the average treatment effect in the control AT C = E[Y z= Z = 0] E[Y z=0 Z = 0]. (.3) 4

18 These estimands, AT E, AT T, and AT C, will only differ when the treatment effect is heterogeneous. For example, assume that a binary variable X is an effect modifier, and the distribution of X differs between the treated and control populations. This implies that the distribution of X is different between the overall, treated, and control populations, which implies that AT E, AT C, and AT T will differ. The choice of which estimand(s) is onterest depends on the particular research question and interpretation of the results of the study that the researcher is seeking. In general, in this dissertation we will focus on average treatment effects with a smaller amount of attention given to the average treatment effect in the treated. When using the treatment effect difference, such as in (.), the results for AT T apply directly to AT C because we can simply switch the meaning of treatment and control. Assume we collect data, either experimental or observational, and define Y as the observed outcome. We can write Y as a combination of the potential outcomes and the observed treatment exposure, mainly Y = Y z= Z + Y z=0 ( Z). (.4) The problem in estimating (.) from the observed data is that while we may be able to observe E[Y Z = ] = E[Y z= Z = ] and E[Y Z = 0] = E[Y z=0 Z = 0], these are not necessarily equal to E[Y z= ] and E[Y z=0 ], respectively. In an observational study, Z is not assigned randomly to individuals, and thus may be associated with the potential outcomes. In contrast, a randomized study guarantees that (Y z=, Y z=0 ) Z, where denotes statistical independence. Under this condition E[Y Z = ] = E[Y z= ] and E[Y Z = 0] = E[Y z=0 ]. Now consider a set of covariates associated with both the potential outcomes and the treatment assignment. These covariates are typically called confounders. Let us 5

19 write these as X. If we are able to identify all confounders and measure them in X, then the potential outcomes are independent of the treatment assignment conditional on X. We write this condition as (Y z=, Y z=0 ) Z X. (.5) Condition (.5) implies that among individuals with the same value of X, treatment assignment is independent of the potential outcomes. In other words, the treatment assignment is at random for those who share a value of X. It then follows from (.5) that E {E[Y Z =, X]} = E {E[Y z= X]} = E[Y z= ], and E {E[Y Z = 0, X]} = E {E[Y z=0 X]} = E[Y z=0 ], allowing us to estimate (.). We emphasize that (.5) is an un-testable assumption, and is commonly referred to as the assumption of strongly ignorable treatment assignment [36]. Similarly, we can get quantities needed for (.2) and (.3) by seeing that under condition (.5) E X Z=0 {E[Y Z =, X]} = E X Z=0 { E[Y z= Z = 0, X] } = E[Y z= Z = 0], and E X Z= {E[Y Z = 0, X]} = E X Z= { E[Y z=0 Z =, X] } = E[Y z=0 Z = ]..2 Population and Sample Average Treatment Effects Since we are ultimately interested in the complex probability survey setting, and not a simple random sample, it is important to consider differences between the sample and the population. One such difference to consider is the estimand. Consider a population of N units of which we sample n. Let S be the set of n units in our sample and let i =, 2,..., N index the population units. Define the population average treatment effect as 6

20 P AT E = E[Y z= ] E[Y z=0 ] = N ( N Yi z= Yi z=0 ) (.6) which is the difference in potential outcomes averaged across the N units in the population [2]. This is the treatment effect in the population, rather than in our sample. The sample average treatment effect is defined as ( ) SAT E = E[Y z= S = ] E[Y z=0 S = ] = Yi z= Yi z=0, (.7) n which is the difference in potential outcomes averaged across the n units in the sample. Similarly if we are concerned with average treatment effect in the treated we define i S P AT T = E[Y z= Z = ] E[Y z=0 Z = ] = N N Z i (Y z= i Y z=0 i ), and (.8) SAT T = E[Y z= Z =, S = ] E[Y z=0 Z =, S = ] = n i S Z i (Y z= i Y z=0 i ), (.9) where N and n are the number of treated units in the population and sample respectively. For the average treatment effect in the control we define P AT C = E[Y z= Z = 0] E[Y z=0 Z = 0] = N ( Z i )(Yi z= Yi z=0 ), and (.0) N 0 SAT C = E[Y z= Z = 0, S = ] E[Y z=0 Z = 0, S = ] = ( Z i )(Yi z= Yi z=0 ), (.) n 0 i S 7

21 where N 0 and n 0 are the number of control units in the population and sample respectively. While the SAT E, SAT T, and SAT C are completely valid estimands and circumstances exist where they would be onterest, most of the time we are ultimately interested in estimating the corresponding population level quantities. When estimating P AT E, P AT T, or P AT C we worry about two situations that may result in bias: sample selection and treatment imbalance [2]. Treatment imbalance refers to differences in the distribution of confounders between treatment and control groups. Sample selection refers to differences in the distribution of the treatment effect between the sample and the population, meaning for example SAT E P AT E. Sample selection bias can only occur when we have a heterogeneous treatment effect. If the treatment effect is a function of a variable X, and the distribution of X differs between our sample and population, then not correcting for the sample selection will lead to a biased estimate. Little and Rubin think about this idea in terms of a non-ignorable selection mechanism [25], meaning that the selection mechanism can be ignored if the sampled potential outcomes have the same distribution as the potential outcomes not in the sample. When attempting to estimate P AT E, P AT T, or P AT C with survey data, it is necessary to adjust for any differences between the sample and the population (sample selection) using the sampling weights and differences in covariate distributions of treated and control groups (treatment imbalance) using a method such as propensity scores. In the context of a population-based sample survey, the assumption of a homogeneous treatment effect seems unwarranted in most cases, especially when dealing with human subjects. Kurth et al. [22] and Lunt et al. [28] show that the average 8

22 effect estimate is highly sensitive to the propensity score technique in the presence of treatment effect heterogeneity because of the implied distribution of each method..3 The Propensity Score Propensity score methods are a useful way of summarizing confounders X and can be used to estimate the causal effect of a treatment in observational studies. The propensity score is defined as the probability of treatment given the observed covariates. We write the propensity score as e(x) = P (Z = X). A necessary assumption is that the propensity score is bounded away from 0 and. Intuitively, if the propensity score is equal to zero or one for some observations, then it is difficult to argue that both potential outcomes exist. The basis of all propensity score methodology is a result of the work of Rosenbaum and Rubin [36], who showed that under condition (.5) (Y z=, Y z=0 ) Z e(x) and (.2) X Z e(x). (.3) Equation (.2) implies that treatment assignment is at random among individuals who share the same propensity score. The major advantage of conditioning on the propensity score rather than X is in dimensionality. The set of variables X may potentially contain tens or hundreds of variables while e(x) is only a single variable. It then follows from (.2) that E {E[Y Z =, e(x)} = E { E[Y z= e(x)] } = E[Y z= ], and E {E[Y Z = 0, e(x)]} = E { E[Y z=0 e(x)] } = E[Y z=0 ], 9

23 allowing us to estimate the average treatment effect (.). Similarly following (.2) E e(x) Z=0 {E[Y Z =, e(x)]} = E e(x) Z=0 { E[Y z= Z = 0, e(x)] } = E[Y z= Z = 0], and E e(x) Z= {E[Y Z = 0, e(x)]} = E X Z= { E[Y z=0 Z =, e(x)] } = E[Y z=0 Z = ], which allows us to estimate (.2) and (.3). A result of (.3) is the distribution of the covariates X will be balanced in the sense of having equivalent distributions amongst those units in the treatment and control groups who share the same propensity score. In finite samples this balance is only approximate, and checking the balance is an important step in analysis to show that the propensity score is adequately modeled. Balance is a characteristic of our sample, not the population. However, analysis methods that we apply to our sample data may imply different distributions of X than the original sample through weighting observations, or simply choosing a matched subset of our sample. In these circumstances, the balance should be evaluated in these implied sample distributions rather than in the original sample. The propensity score is generally not observed and therefore must be estimated from the observed data. Typically in practice we assume that e(x) follows a parametric model such as a logistic regression model or a probit regression model, though other modeling choices are possible. Let ê be the estimated propensity score from our model. 0

24 .4 Common Propensity Score Methods Three of the most common types of propensity score methods are matching, stratifying, and weighting with variations of each. Each method has its own advantages and disadvantages, though in practice matching seems to be the most commonly used..4. Propensity Score Matching Matching is a very appealing method because it is intuitive to non-researchers. We simply match units who are similar. Instead of matching treatment and control observations together based on one or several covariates, we create matched pairs or groups based on the estimated propensity score ê. Once the matches are created, a number of analysis techniques can be used. While matched pairs or sets have the same or similar propensity scores, they do not necessarily have the exact same covariate pattern. Instead, the distribution of the covariates of entire set of matches should be balanced between the treatment and control groups. Multiple matching designs are possible including :K matching, matching with replacement, and full matching. Assuming there are fewer treated units then control units, in :K matching, K control units are matched to each treated unit by the propensity score. Increasing K leads to a more efficient estimator given that the additional controls are good matches. Several methods can be used to estimate the treatment effect from the matched pairs or sets, the simplest being the mean of the pair/set differences which can be written as ˆ t matchk = n Yj n Ȳ j 0, (.4) j=

25 where j indexes the n matched sets and Ȳ j 0 is the average of of the K control units in matched set j [7]. Not all control observations are included in (.4), and as a result (.4) estimates AT T rather than AT E. Matching with replacement attempts to utilize all observations in the data by matching each observation, treatment or control, with M observations of the opposite treatment assignment. The estimator can be written as ˆ matchrep = n n (Ŷi Ŷi(0)) () = n n ( (2Z i ) + K ) M(i) Y i, (.5) M where { Ŷ i () = M j J M (i,θ) Y j, if Z i = 0, Y i, if Z i =, { Yi, if Z i = 0, Ŷ i (0) = j J M (i,θ) Y j, if Z i =, M J M (i, θ) is the set of M observations in the treatment group opposite to i with similar propensity scores, and K M (i) is the number of times that unit i is used as a match []. This estimator, ˆ matchrep, estimates AT E, but a simple modification can be made to estimate AT T [3]. That estimator can be written as ˆ t matchrep = n Z i = ( Y i Ŷi(0) ) = n n ( Z i ( Z i ) K ) M(i) Y i, (.6) M Full matching uses all sampled units and creates matched sets which consist of one treated unit and multiple controls, or one control unit and multiple treated units with the number of units in a matched set being variable. This estimator can be written as 2

26 ( ˆ matchfull = n J j n j nj n Z Z ji Y ji ji j= nj Z ji n j ) ( Z ji )Y ji, (.7) where the data are divided into J mutually exclusive sets with similar propensity scores. Y ji denotes the ith observation in the jth matched set. ˆ matchfull estimates AT E. Replacing the number of observations each matched set represents with the number of treated observations will instead give us an estimator of AT T. It can be written as ˆ t matchfull = n ( n J j n j nj Z Z ji Y ji ji j= nj Z ji n j ) ( Z ji )Y ji, (.8) Rank-based tests like the Wilcoxon signed-rank test can be used to test for a treatment effect in matched samples as well. The Hodges-Lehmann estimator based on the signed-rank test can estimate the point estimate of the treatment effect. In order to carry out any of these matching designs, we first must find good matches. A number of algorithms can be used to complete this task [38]. These include nearest neighbor matching, calipers, optimal matching, kernel matching, and combinations of each. In nearest neighbor matching units from the treated group are randomly ordered, and then in order, the control unit(s) with the nearest propensity score is chosen as a match. This is continued until all units are matched. Nearest neighbor matching can result in bad matches if there are no units with a similar propensity scores available. To guard against bad matches a caliper can be used which sets the maximum possible distance between two matches. If there is no control unit within the caliper of the treated unit, then the treated unit is not used in the analysis. Instead of finding the best match one at a time, optimal matching minimizes 3

27 an objective function across the entire matched set, such as the total distance between matches. It is more computationally expensive then nearest neighbor matching, but it should result in better matches. Kernal matching estimates the missing potential outcome for each unit in the treated group by taking a weighted average of nearby observations from the control group. The weights are based on the distances, the density function, and the bandwidth chosen..4.2 Propensity Score Stratification Stratification, also called subclassification, can be thought of as a coarser matching procedure. After estimation of the propensity score for all observations, the observations are split into K strata based on their estimated propensity scores. The treatment effect is estimated within each stratum by treatment and control means and then averaged across strata. The estimator of AT E is given by ˆ s = K k= ( nk ) [ n k n n Z i Y i I(ê i Q k ) n 0k ] n ( Z i )Y i I(ê i Q k ), (.9) where n is the the sample size, n k is the sample size in stratum k =,..., K, n k and n 0k are the number of treatment and control observations in stratum k, and Q k defines the kth stratum. If we are instead interested in estimating AT T, replace n k with n k and n with n. This estimator can be written as ˆ t s = K k= ( nk n ) [ n k n Z i Y i I(ê i Q k ) n 0k ] n ( Z i )Y i I(ê i Q k ), (.20) Stratification works on the principle that within each stratum we have strongly ignorable treatment assignment, so that treatment assignment is at random within a stratum. This is typically only at best approximately true. In practice K = 5 4

28 strata defined by the quintiles of the estimated propensity score is often used due to a result by Rosenbaum and Rubin [37] that found stratifying on the quintiles eliminates approximately 90% of the bias. That result has been shown empirically to not be correct for all circumstances, even very reasonable ones [27]. In fact, the bias can be quite a lot higher, over 25% in some simulations [27]. Instead of stratifying on the quintiles, or other quantiles that create equal frequencies in strata, Myers and Louis investigate creating optimal stratifications, as a function of MSE [30]. They find that strata formed by equal frequency (quantiles of the propensity score) do as well or nearly as well as optimal stratification in almost all cases of their simulation. They conclude that the added analytical cost of calculating optimal strata, the additional estimation, and assumptions needed are not worth the small gains in MSE. They also recommend increasing the number of strata until the overall estimate remains relatively constant or until the estimated variance greatly increases. The number of strata used has a bias-variance tradeoff relationship depending on the degree of confounding. In an un-confounded situation, the best solution in terms of MSE would be to have a single stratum, because this minimizes the variance. However, as the degree of confounding increases, having a larger number of strata to control the bias is the best solution. In their simulation Lunceford and Davidian found that increasing the number of strata from 5 to 0 reduced the bias by about 65% while keeping the variance roughly constant [27]. While creating strata strictly by equal frequency leaves us with the same number of total observations in each stratum, it does not guard against having too few treated or control observations in a stratum. This is a particular problem towards the extremes of the propensity score distribution. For instance, when the propensity 5

29 score is equal to 0. we expect 9 control observations for every treated observation. This could lead to problems in the propensity score stratum that contains the observations with the smallest propensity scores it is not large enough. There may be too few treated observations to properly estimate the variance, and additionally the few treated observations would be extremely influential in determining the estimated treated effect in that propensity score strata. An extension of the stratification technique that seeks to reduce within-stratum confounding is to model the outcome through a regression model within each propensity score stratum and average the model-assisted estimated treatment effects across strata as before. The estimator can be written as ˆ sr = K k= ( nk ) [ n k n n I(ê i Q k ) { m (k) (, X i, ˆα (k) ) m (k) (0, X i, ˆα (k)}], (.2) where m (k) (Z, X, ˆα (k) ) is the regression outcome model in the kth stratum with estimated coefficients α (k). ˆ sr is a consistent estimator of AT E when we are able to correctly model the outcome, but adding the modeling component leaves us open to model misspecification [27]. The advantage of ˆ sr over trying to directly model the outcome through a regression model is that ˆ sr is less affected by model misspecification [27]..4.3 Propensity Score Inverse Probability Weighting A third type of common propensity score method is inverse probability weighting (IPW). IPW uses the inverse of the propensity score as a weight similarly to the way the Horvitz-Thompson estimator uses the inverse probability of selection as a weight [9]. 6

30 Recall equation (.4), which defined Y = Y z= Z + Y z=0 ( Z). Also note that [ ] [ ] Z( Z) = 0. These equations imply that E = E, and assuming (.2) we show that ZY e(x) ZY z= e(x) Similarly E E [ ] { [ ZY I(Z = )Y z= ]} = E E Y z=, X e(x) e(x) { Y z= [ ] = E e(x) E } I(Z = ) Y z=, X { } Y z= = E e(x) e(x) = E [ Y z=]. [ ] ( Z)Y = E [Y z=0 ], which leads to an estimate of ( e(x)) AT E given by ˆ ipw = n n Z i Y i ê i n n ( Z i )Y i ( ê i ) If we are instead interested in AT T, we can estimate it by (.22) ˆ t ipw = n n Z i Y i n n ( Z i )Y i ê i. (.23) ( ê i ) Assuming (.2) these weights give an unbiased estimator of AT T because E[ZY ] = E[ZY Z = ]P (Z = ) + E[ZY Z = 0]P (Z = 0) = E[Y z= Z = ] and ( n ), n 7

31 [ ] ( Z)Y e(x) E = E e(x) = E { Y z=0 e(x) } = E { E[Y z=0 X, Z = ]P (Z = X) } { Y z=0 e(x) e(x) E [ I(Z = ) X, Y z=0] } {E[Y = z=0 X, Z = ] } P (Z = X)P (X = x)dx { } = Y z=0 P (Y z=0 = y X, Z = )dy P (Z = X)P (X = x)dx { } = Y z=0 P (Y z=0 P (Z = ) = y X, Z = )dy P (Z =, X = x) P (Z = ) dx { } = P (Z = ) Y z=0 P (Y z=0 = y X, Z = )dy P (X = x Z = )dx = P (Z = ) Y z=0 P (Y z=0 = y, X = x, Z = )dydx = P (Z = ) E[Y z=0 X, Z = ]P (X = x Z = )dx = n n E[Y z=0 Z = ]. A second version of the IPW estimator of AT E is given by ˆ ipw2 = ( n Z i ê i ) n Z i Y i ê i ( n Z i ê i ) n ( Z i )Y i ( ê i ), (.24) which divides the weighted sums of treated and control outcomes by the sum of their respective weights rather than by n. This result is motivated by the fact that E[Z/e(X)] = E[( Z)/( e(x))] =, and performs better than (.22) in practice [27]. The corresponding estimator of AT T is given by by ( n n ) ˆ t ipw2 = (n ) ( Z i )ê i n Z i Y i ê i ( Z i )Y i ê i, (.25) ( ê i ) The estimators (.22) and (.24) fall in a more general class of estimators of AT E defined by the solutions to the equations 8

32 n { ( )} Zi (Y i µ ) Zi e i + η = 0, and e i e i n { ( )} ( Zi )(Y i µ 0 ) Zi e i η 0 = 0. (.26) e i e i Assuming the propensity score is known and setting (η 0, η ) = (µ 0, µ ) gives ˆ ipw, while (η 0, η ) = (0, 0) yields ˆ ipw2. While ˆ ipw is an unbiased estimator of AT E, in general this class of estimators is not unbiased; however, they are consistent when the propensity score is correctly specified [33]. This can be seen from a combination of Slutsky s theorem and the law of large numbers. Finding the values (η, η 0 ) that minimize the large sample variance for members of this class leads to ˆ ipw3 [27], which is defined as where ˆ ipw3 = { n { n Z i ê i ( C ) } n ( Z i Y i C ) ê i ê i ê i ) } n Z i ê i ( C 0 ê i ( Z i )Y i ( ê i ) ( C ) 0, (.27) ê i C = n {(Z i ê i )/ê i } n {(Z i ê i )/ê i } 2, and The terms n C 0 = {(Z i ê i )/( ê i )} n {(Z i ê i )/( ê i )}. 2 ( ) ( ) C ê i and C 0 ê i reduce or increase the weight of each observation proportionately by how Z i /ê i deviates from its expectation of one. This results in numerical stability, especially with smaller sample sizes. The more general class of estimators that estimate AT T can be written as solutions to the equations 9

33 n n {Z i (Y i µ )} = 0, and { ( Zi )(Y i µ 0 )e i e i η 0 e i with η 0 = µ implying (.23) and η 0 = 0 implying (.25). ( )} Zi e i = 0, (.28) e i Minimizing the large sample variance with respect to η 0 yields the following estimator of ˆη 0 This leads to the estimator [ ] ( Z)(Y µ0 )e 2 / ˆη 0 = E E ( e) 2 [ (Z ) ] 2 e e 2 e where ˆ t ipw3 = (n ) { n n Z i Y i ( Z i )ê i ê i ( C ) } 0ê i n ( ( Z i )ê i Y i C ) 0ê i, (.29) ê i ( ê i ) ê i n C 0 = {(Z i ê i )ê i /( ê i )} n {(Z i ê i )ê i /( ê i )}. 2 In general inverse probability weights can be thought of similarly to survey sampling weights in that each observed treated unit counts for itself and /(ê i ) other units not assigned to the treatment group, and each control unit counts for itself and /( ê i ) other units not assigned to the control group when estimating AT E. Likewise each observed treated unit counts for itself, and each control unit counts for itself and ê i /( ê i ) other units when estimating AT T. Inverse probability weighting seeks to create a pseudo-population in which the distribution of X is balanced between treatment and control units. When we are trying to estimate the average 20

34 treatment effect, this pseudo-population s distribution of X should reflect the distribution of X in the whole sample. When estimating the average treatment effect in the treated, the pseudo-population s distribution of X should reflect the distribution of X in the treated group. Other weighting estimators are possible, many of which attempt to model the outcome parametrically. Robins et al. [33] propose a doubly robust estimator which utilizes a regression model of the outcome much in the same way as ˆ sr. The doubly robust estimator of AT E is defined as ˆ dr = n n Z i Y i (Z i ê i )m (X i, ˆα ) ê i n n ( Z i )Y i (Z i ê i )m 0 (X i, ˆα 0 ) ê i, (.30) where m (X, ˆα ) and m 0 (X, ˆα 0 ) are the predicted outcomes for the treatment and control groups specified by regression models. When both the propensity score model and the outcome model are correctly specified, ˆ dr will have the smallest large sample variance of any weighting-type estimator [33]. It is also doubly robust in the sense that it is consistent if either the propensity score model or the outcome model is correctly specified..5 Variance Estimators for Matching, Stratification, and IPW Estimators.5. Matching Estimators Opinions on how to best estimate the standard error of matching estimators vary in the literature with the main argument being if we should treat our matches as individually matched sets or as overall matched groups [42] [5]. With the exclusion 2

35 of the non-parametric matching estimators, all the matching estimators discussed earlier can be motivated by either of these two philosophies. When treated as fixed matches, the simplest variance estimation strategy is to estimate the variance of the matched set differences using the typical variance estimator, though matching with replacement creates problems with this approach. Additionally this approach does not take into account the variance due to the estimation of the propensity score or the matching process. The motivation for treating the matches as matched groups instead ondividuals is that matching on the propensity score does not guarantee matched units have the same covariate distribution. Instead, they should be approximately balanced overall. The matching process is a way to make comparable groups, and the actual matches are irrelevant. Instead of comparing group means of the matched sample groups, some authors recommend making further adjustments for confounders such as using a regression model to estimate the treatment effect [5]. This of course comes with the risk of model mis-specification, but also provides a way of estimating standard errors and a way to accommodate effect modifiers. Lechner (200) proposes the variance estimator for matching with replacement V ar( ˆ AT T ) = N V ar(y z= Z = ) + ( j I 0 (w j ) 2 ) (N ) 2 V ar(y z=0 Z = 0) (.3) where N is the number of matched treated individuals and w j is the number of times individual j from the control group has been matched [23]. Abadie and Imbens (2006) propose an estimator of the variance of their estimator ˆ matchrep (.5), which is written as 22

36 V ar( ˆ matchrep ) = n (Ŷ n 2 i z= Ŷ i z=0 ˆ matchrep ) 2 [ + n (KM ) 2 ( ) ( ) ] (i) 2M KM (i) + ˆσ 2 (X n 2 i, Z i ), (.32) M M M where M is the number of matches per unit, K M (i) is the number of times unit i is used as a match, and ˆσ 2 (X i, Z i ) is the estimated conditional variance given by ˆσ 2 (X i, Z i ) = ( J Y i J + J 2 J Y lm(i)), where J is a fixed number of similar units and l m (i) is the mth closest unit to unit i among units with the same Z value []. A final strategy for estimating the variance of matching estimators is bootstrapping, which seeks to estimate the sampling distribution of the matching estimator. For each bootstrap sample, the propensity score is estimated, matches are made, and an estimate created. Though time intensive, bootstrapping seems appealing as a way to include the variance of the propensity score estimation and matching procedures. However, Abadie and Imbens show that bootstrapping does not lead to valid confidence intervals for nearest-neighbor matching with a fixed number of matches in general [2]..5.2 Stratification Estimators m= Typically in practice, the variance of a stratification estimator is estimated by treating the within-propensity-score stratum effect estimates as independent estimates of the treatment effect, and then pooling the variance across strata [27]. Within a propensity score stratum, the treatment and control estimators are assumed to be 23

37 independent. The total variance is the sum of the variance of the two estimators. For example, take the stratified estimator ˆ S from (.9). A variance estimate is where var( ˆ ˆ s ) = K ( nk ) 2 ˆσ 2 k (.33) n s 2 k = n k and n ȳ k = ˆσ k 2 = n k s2 k + n 0k s2 0k, and I(ê i Q k )(z i y i ȳ k ) 2 and s 2 0k = n 0k n I(ê i Q k )z i y i and ȳ 0k = If we are interested in estimating AT T instead, simply use n I(ê i Q k )(( z i )Y i ȳ 0k ) 2 n I(ê i Q k )( z i )y i. var( ˆ ˆ t s) = K ( nk For the model-assisted estimator ˆ sr, a variance estimator is var( ˆ ˆ sr ) = K n ) 2 ˆσ 2 k (.34) ( nk ) 2 ˆσ 2 n k, (.35) but now ˆσ 2 k is the estimate of the variance of the regression estimator in the kth stratum. These variance estimators have been found to do well in many situations, though possibly slightly conservative in certain scenarios [27] [45]. Williamson et al. [45] investigate the impact oncluding the propensity score estimates in the variance estimator by treating the propensity score parameters, strata boundaries, fraction of the population in each stratum, and the stratified treatment effect as solutions to a set of estimating equations. They derive the large sample marginal variance of the stratified estimator through M-estimation theory, although they conclude that in 24

38 real-life situations this derivation is too complex and bootstrapping should be used instead..5.3 IPW Estimators The variance of IPW estimators can be estimated by viewing the estimates as solutions to a set of estimating equations and using the resulting empirical sandwich estimator of variance. Hernan and Robins [3] recommend fitting a generalized estimating equations model with an independent working covariance matrix to estimate the variance. Define W as the n n weight matrix where W ij = 0 i j, and ( ) W ii = w i = z i w + ( z i )w0 z i (ê i ) + ( z i )( ê i ), where w = n z i/ê i and w 0 = n ( z i)/( ê i ). Also define A = (, z), a n 2 matrix, and η = (µ 0, ), where z is the vector of observed treatment exposures. Then the IPW estimator ˆ ipw2 (.24) can be computed using the weighted least-squares estimator ˆη = (ˆµ 0, ˆ ) = (A W A) A W y, where y is the vector of observed outcomes. The GEE empirical sandwich estimator for the variance of ˆη is given by V ar( ˆη) = {(A W A) (A W ˆV W A)(A W A) }, where ˆV is an empirical estimate of the covariance of y. The GEE model implies that we estimate ˆV by the n n diagonal matrix of squared residuals, R, where R ii = r 2 i = (y i A i ˆη)2. 25

39 The GEE model variance estimate for ˆ ipw2 reduces to V ar( ˆ n ipw2 ) = ri 2 wi 2. (.36) While this estimator of variance is routinely used, it is often far too conservative [45]. Lunceford and Davidian [27] include the propensity score model in the estimating equations, which makes the estimation more exact. The theory of M-estimation implies that for IPW estimators ˆ, n /2 ( ˆ ) converges in distribution to a N(0, Σ) variable. M-estimators are solutions to the vector equation n ψ(y i, θ) = 0, where Y i,..., Y n are independent (but not necessarily identically distributed) variables, θ is a p-dimensional parameter, and ψ is a known (p )-function that does not depend on i or n [4]. Under suitable regularity conditions ˆθ is asymptotically multivariate normal as n, with mean θ 0, the true value of θ and variance (/n)v(θ 0 ), where V(θ 0 ) = A(θ 0 ) B(θ 0 ) { A(θ 0 ) }, A(θ 0 ) = E [ ψ(y, θ) ], and θ θ=θ0 B(θ 0 ) = E [ψ(y, θ 0 )ψ(y, θ 0 ) ]. If we assume for the moment that the true propensity scores are known, or equivalently that the coefficients of true propensity score model are known, then for example ˆ ipw, ψ(y i, ˆθ) = Z iy i e i + ( Z i)y i e i ˆ. 26

40 It then follows that A(θ 0 ) =, and [ (ZY ) ] 2 ( Z)Y B(θ 0 ) = E + e e ˆ 0 [ (ZY ) z= 2 ( ) ] ( Z)Y z=0 2 = E + + e e ˆ ˆ 0 E [ Z(Y z= ) 2 = E + ( Z)(Y ] z=0 ) 2 + e 2 ( e) ˆ ˆ 2 0 [ (Y z= ) 2 = E + (Y ] z=0 ) 2 e e ˆ 2 0. This implies that the large sample variance of ˆ ipw is [ (Y Σ z= ) 2 ipw = E e + (Y ] z=0 ) 2 e ˆ 2 0. Similarly the large-sample variance for ˆ ipw2 and ˆ ipw3 are [ (Y Σ z= µ ) 2 ipw2 = E e + (Y ] z=0 µ 0 ) 2 e [ ZY z= e i + and ] z=0 ( Z)Y e i [ (Y Σ z= µ ) 2 ipw3 = E + (Y ] ( ) z=0 µ 0 ) 2 Y µ + η E e e e ( ) Y0 µ 0 + η 0 E + 2η η 0, e where µ = E[Y z= ], µ 0 = E[Y z=0 ], η = E{Z(Y µ )/e 2 }/E{(Z e) 2 /e 2 }, and η 0 = E{( Z)(Y µ 0 )/( e) 2 }/E{(Z e) 2 /( e) 2 }. Now, instead assume that we estimate the propensity scores using the logistic regression model e(x, β) = { + exp( X β)} and estimate β by maximum likelihood by solving n ψ β (Z i, X i, β) = n Z i e(x i, β) e(x i, β)( e(x i, β)) β e(x i, β) = 0. (.37) 27

arxiv: v1 [stat.me] 15 May 2011

arxiv: v1 [stat.me] 15 May 2011 Working Paper Propensity Score Analysis with Matching Weights Liang Li, Ph.D. arxiv:1105.2917v1 [stat.me] 15 May 2011 Associate Staff of Biostatistics Department of Quantitative Health Sciences, Cleveland