Propensity Score Weighting with Multilevel Data

Similar documents
Propensity Score Analysis with Hierarchical Data

Balancing Covariates via Propensity Score Weighting

Balancing Covariates via Propensity Score Weighting: The Overlap Weights

Weighting. Homework 2. Regression. Regression. Decisions Matching: Weighting (0) W i. (1) -å l i. )Y i. (1-W i 3/5/2014. (1) = Y i.

Assess Assumptions and Sensitivity Analysis. Fan Li March 26, 2014

Propensity Score Methods for Causal Inference

What s New in Econometrics. Lecture 1

Selection on Observables: Propensity Score Matching.

Causal Inference Basics

Combining multiple observational data sources to estimate causal eects

Overlap Propensity Score Weighting to Balance Covariates

Covariate Balancing Propensity Score for General Treatment Regimes

Balancing Covariates via Propensity Score Weighting

Data Integration for Big Data Analysis for finite population inference

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

multilevel modeling: concepts, applications and interpretations

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

Causal Inference with Big Data Sets

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Causal inference in multilevel data structures:

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

The Role of the Propensity Score in Observational Studies with Complex Data Structures

arxiv: v1 [stat.me] 15 May 2011

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

Flexible Estimation of Treatment Effect Parameters

Gov 2002: 4. Observational Studies and Confounding

Imbens/Wooldridge, IRP Lecture Notes 2, August 08 1

Weighting Methods. Harvard University STAT186/GOV2002 CAUSAL INFERENCE. Fall Kosuke Imai

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing

Vector-Based Kernel Weighting: A Simple Estimator for Improving Precision and Bias of Average Treatment Effects in Multiple Treatment Settings

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Estimating the Marginal Odds Ratio in Observational Studies

Quantitative Economics for the Evaluation of the European Policy

An Introduction to Causal Analysis on Observational Data using Propensity Scores

Multi-level Models: Idea

Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Propensity Score Methods, Models and Adjustment

DATA-ADAPTIVE VARIABLE SELECTION FOR

Deductive Derivation and Computerization of Semiparametric Efficient Estimation

High Dimensional Propensity Score Estimation via Covariate Balancing

Targeted Maximum Likelihood Estimation in Safety Analysis

Combining Difference-in-difference and Matching for Panel Data Analysis

University of California, Berkeley

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Extending causal inferences from a randomized trial to a target population

The Impact of Measurement Error on Propensity Score Analysis: An Empirical Investigation of Fallible Covariates

THE DESIGN (VERSUS THE ANALYSIS) OF EVALUATIONS FROM OBSERVATIONAL STUDIES: PARALLELS WITH THE DESIGN OF RANDOMIZED EXPERIMENTS DONALD B.

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Propensity Score Methods for Estimating Causal Effects from Complex Survey Data

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

CompSci Understanding Data: Theory and Applications

ANALYTIC COMPARISON. Pearl and Rubin CAUSAL FRAMEWORKS

Causal Inference Lecture Notes: Selection Bias in Observational Studies

Correlated and Interacting Predictor Omission for Linear and Logistic Regression Models

New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation

The problem of causality in microeconometrics.

An Introduction to Causal Mediation Analysis. Xu Qin University of Chicago Presented at the Central Iowa R User Group Meetup Aug 10, 2016

Causal Mechanisms Short Course Part II:

Use of Matching Methods for Causal Inference in Experimental and Observational Studies. This Talk Draws on the Following Papers:

PROPENSITY SCORE MATCHING. Walter Leite

Bootstrapping Sensitivity Analysis

Comment: Understanding OR, PS and DR

A Sampling of IMPACT Research:

Guanglei Hong. University of Chicago. Presentation at the 2012 Atlantic Causal Inference Conference

Telescope Matching: A Flexible Approach to Estimating Direct Effects *

Estimating Causal Effects from Observational Data with the CAUSALTRT Procedure

Gov 2002: 5. Matching

University of California, Berkeley

Robustness to Parametric Assumptions in Missing Data Models

Implementing Matching Estimators for. Average Treatment Effects in STATA

Specification Errors, Measurement Errors, Confounding

Covariate selection and propensity score specification in causal inference

Econometric Analysis of Cross Section and Panel Data

Investigating mediation when counterfactuals are not metaphysical: Does sunlight exposure mediate the effect of eye-glasses on cataracts?

A comparison of weighted estimators for the population mean. Ye Yang Weighting in surveys group

Variable selection and machine learning methods in causal inference

Subgroup Balancing Propensity Score

arxiv: v1 [stat.me] 8 Jun 2016

The propensity score with continuous treatments

For more information about how to cite these materials visit

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

NBER WORKING PAPER SERIES USE OF PROPENSITY SCORES IN NON-LINEAR RESPONSE MODELS: THE CASE FOR HEALTH CARE EXPENDITURES

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

Imbens/Wooldridge, Lecture Notes 1, Summer 07 1

Gov 2002: 13. Dynamic Causal Inference

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Introduction to Statistical Analysis

Bayesian Inference for Sequential Treatments under Latent Sequential Ignorability. June 19, 2017

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Econometrics with Observational Data. Introduction and Identification Todd Wagner February 1, 2017

Propensity Score for Causal Inference of Multiple and Multivalued Treatments

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Causal Inference with Measurement Error

Marginal, crude and conditional odds ratios

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Transcription:

Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum

Introduction In comparative effectiveness studies, the goal is usually to estimate the causal (treatment) effect. In noncausal descriptive studies, a common goal is to conduct an unconfounded comparison of two populations. Population-based observational studies are increasingly important sources both types of studies. Proper adjustment for differences between treatment groups is crucial to both descriptive and causal analyses. Regression has long been the standard method. Propensity score (Rosenbaum and Rubin, 1983) is a robust alternative to regression adjustment, applicable to both causal and descriptive studies.

Multilevel data Propensity score has been developed and applied in cross-sectional settings. Data in medical care and health policy research are often multilevel. Subjects are grouped in natural clusters, e.g., geographical area, hospitals, health service provider, etc. Significant within- and between-cluster variations. Ignoring cluster structure often leads to invalid inference. Inaccurate standard errors. Cluster-level effects could be confounded with individual-level effects. Hierarchical regression models provide a unified framework to study multilevel data.

Propensity score methods for multilevel data Propensity score methods for multilevel data have been less explored. Arpino and Mealli (2011) investigated propensity score matching in multilevel settings. We focus on propensity score weighting strategies: 1. Investigate the performance of different modeling and weighting strategies in the presence of model misspecification due to cluster-level confounders 2. Clarify the differences and connections between causal and unconfounded descriptive comparisons.

Notations Setting: (1) two-level structure, (2) treatment assigned at individual level. Cluster: h = 1,..., H. Cluster sample size: k = 1,..., n h ; total sample size: n = h n h. Treatment" (individual-level): Z hk, 0 control, 1 treatment. Covariates: X hk = (U hk, V h ), U hk individual-level; V h cluster-level. Outcome: Y hk.

Controlled descriptive comparisons Assignment": a nonmanipulable state defining membership in one of two groups. Objective: an unconfounded comparison of the observed outcomes between the groups Estimand: average controlled difference (ACD) - the difference in the means of Y in two groups with balanced covariate distributions. π ACD = E X [E(Y X, Z = 1) E(Y X, Z = 0)]. Examples: comparing outcomes among populations of different races; or of patients treated in two different years.

Causal comparisons Assignment: a potentially manipulable intervention. Objective: causal effect - comparison of the potential outcomes under treatment versus control in a common set of units Assuming SUTVA, each unit has two potential outcomes: Y hk (0), Y hk (1). Estimand: average treatment effect (ATE) π ATE = E[Y (1) Y (0)], Examples: evaluating the treatment effect of a drug, therapy or policy for a given population. Under the assumption of nonconfoundedness", (Y (0), Y (1)) Z X, we have π ATE = π ACD.

Propensity score Propensity score: e(x) = Pr(Z = 1 X). Balancing property - balancing propensity score also balances the covariates of different groups. X Z e(x). Using propensity score - two-step procedure: Step 1: estimate the propensity score, e.g., by logistic regression. Step 2: estimate the treatment effect by incorporating (matching, weighting, stratification, etc.) the estimated propensity score. Propensity score is a less parametric alternative to regression adjustment.

Propensity score weighting Foundation of propensity score weighting: [ ] ZY (1 Z )Y E = π ACD = π ATE = π, e(x) 1 e(x) Inverse-probability (Horvitz-Thompson) weight: { 1 e(x w hk = hk ), for Z hk = 1 1 1 e(x hk ), for Z hk = 0. HT weighting balances (in expectation) the weighted distribution of covariates in the two groups.

Step 1: Propensity score models Marginal model-ignoring multilevel structure: logit(e hk ) = δ 0 + X hk α. Fixed effects model - adding cluster-specific main effect δ h : logit(e hk ) = δ h + U hk α. Key: the cluster membership is a nominal covariate. δ h absorbs the effects of both observed and unobserved cluster-level covariates V h. Estimates a balancing score without knowledge of Vh, but might lead to larger variance than the propensity score estimated under a correct model with fully observed V h. The Neyman-Scott problem: inconsistent estimates of δ h, α given large number of small clusters.

Step 1: Propensity score models Random effects model: logit(e hk ) = δ h + X hk α, with δ h N(δ 0, σ 2 δ ). Due to the shrinkage of random effects, V h need to be included. Pros: borrowing information across clusters, works better with many small clusters. Cons: produce a biased estimate if δ h are correlated with the covariates. Does not guarantee balance within each cluster. Random slopes can be incorporated.

Step 2: Estimators for the ACD or the ATE Two types of propensity-score-weighted estimators: 1. Nonparametric: applying the weights directly to the observed outcomes. 2. Parametric: applying the weights to the fitted outcomes from a parametric model (doubly-robust estimators).

Nonparametric Estimators Marginal estimator - ignore clustering: ˆπ ma = Z hk =1 Y hk w hk w 1 Z hk =0 Y hk w hk w 0, where w hk is the HT weight using the estimated propensity score, and w z = h,k:z hk =z w hk for z = 0, 1. Cluster-weighted estimator: (1) obtain cluster-specific ATE; and (2) average over clusters: ˆπ cl = h w hˆπ h. where ˆπ h = zhk =1 k h Y hk w hk w h1 h w h where w hz = z hk =z k h w hk for z = 0, 1. zhk =0 k h Y hk w hk, w h0

Parametric doubly-robust (DR) estimators DR estimator (Robins and colleagues): replace the observed outcome but fitted outcomes from a parametric model. ˆπ dr = Î hk /n, h,k where [ Z hk Y hk Î hk = ê hk (Z hk ê hk )Ŷ 1 hk ê hk ] [ (1 Z hk )Y hk 1 ê hk + (Z hk ê hk )Ŷ 0 hk 1 ê hk Ŷ z : the fitted outcome from a parametric model in group z. Double-robustness: in large samples, ˆπ dr is consistent if either the p.s. model or the outcome model is correct, but not necessarily both. SE in these estimators can be obtained via the delta method. ],

DR estimators: outcome model Marginal model - ignore clustering: Y hk = η 0 + Z hk γ + X hk β + ɛ hk, Fixed effects model - adjust for cluster-level main effects η h and covariates: Y hk = η h + Z hk γ + U hk β + ɛ hk, Random effects model - assume cluster-specific effects η h N(0, σ 2 η) Y hk = η h + Z hk γ + X hk β + ɛ hk. Different DR estimators: combinations of different choices of propensity score model and outcome model.

Large sample properties We investigate the large-sample bias of the nonparametric estimators in the presence of intra-cluster influence, in the setting with no observed covariates. n hz : number of subjects in Z group in cluster h. n z = h n hz, n h = n h1 + n h0, n = n 1 + n 0. Assume a Bernoulli assignment mechanism with varying rates by cluster: Z hk Bernoulli(p h ). (1) Assume an outcome model with cluster-specific random intercepts η h and constant treatment effect π, Y hk = η h +Z hk π+τd h +ɛ hk, η h N(η 0, σ 2 η), ɛ hk N(0, σ 2 ɛ ) (2) where d h = n h1 /n h is the cluster-specific proportion treated.

Nonparametric marginal estimator with marginal p.s. Violation to unconfoundness and/or SUTVA. Under marginal propensity score model, ê hk = n 1 /n, for all units. Then for the nonparametric marginal estimator: ˆπ ma ma = π + τ V V 0 + h η h ( n h1 n 1 n z hk =1 h0 ɛ hk ) + ( n 0 n 1 h,k z hk =0 h,k ɛ hk n 0 ), where V = n h (dh 2 (n 1/n) 2 ): weighted sample variance of {d h }; V0 = (n 1 n 0 )/n 2 : maximum of V, attainable if each cluster is assigned to either all treatment or all control (treatment assigned at cluster level). Expectation of the last two terms are asymptotically 0.

Nonparametric marginal estimator with marginal p.s. Therefore, under the generating model (1) and (2), the large sample bias of ˆπ ma ma is τv/v 0. Interpreting the result: 1. τ measures the variation in the outcome generating mechanism between clusters; 2. V/V 0 measures the variation in the treatment assignment mechanism between clusters. Both are ignored in ˆπ ma ma, introducing bias. We also analyze the non-parametric estimators that accounting for clustering in at least one stage - all lead to consistent estimates. Conclusion: with violations to unconfoundedness due to unobserved cluster effects, (1) ignoring clustering in both stages of p.s. weighting leads to biased estimates, (2) but accounting for clustering in at least one stage leads to consistent estimates.

Simulation Design The asymptotics do not apply to large number of small clusters". The Neyman-Scott problem: MLEs for δ h and α are inconsistent due to the growing number of parameters (clusters). The primary role of p.s. is to balance covariates - even inconsistent p.s. estimate may still work well, as long as it balances covariates. Simulated under balanced designs: (i) Small number of large clusters: H = 30, n h = 400, (ii) Large number of median clusters: H = 200, n h = 20, (iii) Large number of small clusters: H = 400, n h = 10. Additional goal: investigate the impact of an unmeasured cluster-level confounder in more realistic settings with observed covariates.

Simulation Design Two individual-level covariates: U 1 N(1, 1); U 2 Bernoulli(0.4). One cluster-level covariate V : (i) uncorrelated with U: V N(1, 2); (ii) correlated with the p.s. model: V h = 1 + 2δ h + u, u N(0, 0.5). True treatment assignment: logit Pr(Z hk = 1 X hk ) = δ h + X hk α, True parameters: δ h N(0, 1) and α = ( 1,.5, α 3 ). True (potential) outcome model - random effects model with interaction (η h N(0, 1), γ h N(0, 1), ɛ hk N(0, σ 2 y)): Y hk = η h + X hk β + Z hk (V h κ + γ h ) + ɛ hk, True parameters: κ = 2 and β = (1,.5, β 3 ).

Simulation Design Under each scenario, 500 replicates are generated. For each simulation, 1. Calculate the true value of π by h,k [Y hk(1) Y hk (0)]/n from the simulated units. 2. Estimate propensity score using the three models with U 1, U 2, omitting the cluster-level V. 3. Fit the three outcome models with U 1, U 2, omitting the cluster-level V. 4. Obtain the nonparametric estimates and the DR estimates for π from the above. For comparison, estimates from true models (benchmark) with U 1, U 2, V are also calculated. Random effects models are fitted via the lmer function in R package lme4 (penalized likelihood estimation).

Simulation Results nonparametric doubly-robust marginal cluster 1 benchmark marginal fixed random (H, n h ) bias mse bias mse bias mse bias mse bias mse bias mse BE.15.22.07.10.029.037.23.33.10.13.029.037 (30,400) MA 4.46 4.50.27.32.023.029 4.46 4.51.28.34.024.030 FE.15.22.07.10.030.039.23.35.10.14.030.039 RE.17.23.07.09.029.037.25.34.10.13.029.037 BE.49.54.16.19.041.052.82.88.13.19.044.055 (200,20) MA 4.46 4.48.29.32.036.045 4.46 4.48.16.20.046.057 FE.54.59.12.15.046.058.67.75.14.18.048.060 RE 1.24 1.25.13.16.040.050 1.59 1.61.12.15.044.054 BE.59.63.20.23.043.053 1.06 1.10.13.17.052.065 (400,10) MA 4.49 4.50.31.33.038.048 4.49 4.51.16.18.067.081 FE 1.00 1.03.12.15.047.059 1.25 1.29.15.18.055.068 RE 1.84 1.85.17.20.040.051 2.32 2.33.13.16.056.069

Simulation Summaries 1. Estimators that ignore clustering in both propensity score and outcome models have much larger bias and RMSE than all the others. 2. The choice of the outcome model appears to have a larger impact on the results than the choice of the p.s. model. 3. Among the DR estimators, the ones with the benchmark outcome model and the ones with the random effects outcome model perform the best. 4. Given the same propensity score model, the nonparametric cluster-weighted estimator generally gives smaller bias than the DR estimator with fixed effects outcome model 5. When the number of cluster increases and the size of each cluster reduces, bias and RMSE increases considerably in all the estimators. The largest increase is observed in the nonparametric marginal estimators.

Application: racial disparity in health care Disparity: racial differences in care attributed to operations of health care system. Breast cancer screening data in different health insurance plans are collected from the Centers for Medicare and Medicaid Services (CMS). Focus on comparison between whites and blacks. Focus on the plans with at least 25 whites and 25 blacks: 64 plans with a total sample size of 75012. Subsample 3000 subjects from large (>3000) clusters to restrict impact of extremely large clusters, resulting sample size 56480.

Racial disparity data Treatment Z hk : black race (1=black, 0=white). Outcome Y hk : receive screening for breast cancer or not. Goal: investigate racial disparity in breast cancer screening - unconfounded descriptive comparison. Cluster level covariates V h : geographical code, non/for-profit status, practice model. Individual level covariates X hk : age category, eligibility for medicaid, poor neighborhood. Analysis: fit the three p.s. and the three outcome models, obtain estimates.

Balance Check: different propensity score models Figure : Histogram of cluster-specific weighted proportions of black enrollees. unweighted marginal p.s. model Frequency 0 4 8 12 Frequency 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 cluster specific proportion of blacks 0.0 0.2 0.4 0.6 0.8 1.0 cluster specific proportion of blacks fixed effects p.s. model random effects p.s. model Frequency 0 10 25 Frequency 0 10 25 0.0 0.2 0.4 0.6 0.8 1.0 cluster specific proportion of blacks 0.0 0.2 0.4 0.6 0.8 1.0 cluster specific proportion of blacks All p.s. models suggest: living in a poor neighborhood, being eligible for Medicaid and enrollment in a for-profit insurance plan are significantly associated with black race.

Results weighted doubly-robust marginal clustered marginal fixed random marginal -4.96 (.79) -1.73 (.83) -4.43 (.85) -2.15 (.41) -1.65 (.43) fixed -2.49 (.92) -1.78 (.81) -1.93 (.82) -2.21 (.42) -1.96 (.41) random -2.56 (.91) -1.78 (.82) -2.00 (.44) -2.22 (.39) -1.95 (.39) Table : Average controlled difference in percentage of the proportion of getting breast cancer screening between blacks and whites. All estimators show the rate of receipt breast cancer screening is significantly lower among blacks than among whites with similar characteristics. Ignoring clustering in both stages doubled the estimates from analyses that account for clustering in at least one stage. Between-cluster variation is large. DR estimates have smaller SE.

Conclusion Propensity score is a powerful tool to balance covariates, for both causal and descriptive comparisons. We introduce and compare several propensity score weighting methods for multilevel data. Using analytical derivations and simulations, we show that 1. Ignoring the multilevel structure in both stages of p.s. weighting generally leads to severe bias for ATE/ACD. 2. Exploiting the multilevel structure, either parametrically or nonparametrically, in at least one stage can greatly reduce the bias. In complex observational data, correctly specifying the outcome model may be challenging. Propensity score methods are more robust. Interference between units in a cluster may exist (SUTVA does not hold), further research is needed.