Propensity Score Weighting with Multilevel Data

Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum

Introduction In comparative effectiveness studies, the goal is usually to estimate the causal (treatment) effect. In noncausal descriptive studies, a common goal is to conduct an unconfounded comparison of two populations. Population-based observational studies are increasingly important sources both types of studies. Proper adjustment for differences between treatment groups is crucial to both descriptive and causal analyses. Regression has long been the standard method. Propensity score (Rosenbaum and Rubin, 1983) is a robust alternative to regression adjustment, applicable to both causal and descriptive studies.

Multilevel data Propensity score has been developed and applied in cross-sectional settings. Data in medical care and health policy research are often multilevel. Subjects are grouped in natural clusters, e.g., geographical area, hospitals, health service provider, etc. Significant within- and between-cluster variations. Ignoring cluster structure often leads to invalid inference. Inaccurate standard errors. Cluster-level effects could be confounded with individual-level effects. Hierarchical regression models provide a unified framework to study multilevel data.

Propensity score methods for multilevel data Propensity score methods for multilevel data have been less explored. Arpino and Mealli (2011) investigated propensity score matching in multilevel settings. We focus on propensity score weighting strategies: 1. Investigate the performance of different modeling and weighting strategies in the presence of model misspecification due to cluster-level confounders 2. Clarify the differences and connections between causal and unconfounded descriptive comparisons.

Notations Setting: (1) two-level structure, (2) treatment assigned at individual level. Cluster: h = 1,..., H. Cluster sample size: k = 1,..., n h ; total sample size: n = h n h. Treatment" (individual-level): Z hk, 0 control, 1 treatment. Covariates: X hk = (U hk, V h ), U hk individual-level; V h cluster-level. Outcome: Y hk.

Controlled descriptive comparisons Assignment": a nonmanipulable state defining membership in one of two groups. Objective: an unconfounded comparison of the observed outcomes between the groups Estimand: average controlled difference (ACD) - the difference in the means of Y in two groups with balanced covariate distributions. π ACD = E X [E(Y X, Z = 1) E(Y X, Z = 0)]. Examples: comparing outcomes among populations of different races; or of patients treated in two different years.

Causal comparisons Assignment: a potentially manipulable intervention. Objective: causal effect - comparison of the potential outcomes under treatment versus control in a common set of units Assuming SUTVA, each unit has two potential outcomes: Y hk (0), Y hk (1). Estimand: average treatment effect (ATE) π ATE = E[Y (1) Y (0)], Examples: evaluating the treatment effect of a drug, therapy or policy for a given population. Under the assumption of nonconfoundedness", (Y (0), Y (1)) Z X, we have π ATE = π ACD.

Propensity score Propensity score: e(x) = Pr(Z = 1 X). Balancing property - balancing propensity score also balances the covariates of different groups. X Z e(x). Using propensity score - two-step procedure: Step 1: estimate the propensity score, e.g., by logistic regression. Step 2: estimate the treatment effect by incorporating (matching, weighting, stratification, etc.) the estimated propensity score. Propensity score is a less parametric alternative to regression adjustment.

Propensity score weighting Foundation of propensity score weighting: [ ] ZY (1 Z )Y E = π ACD = π ATE = π, e(x) 1 e(x) Inverse-probability (Horvitz-Thompson) weight: { 1 e(x w hk = hk ), for Z hk = 1 1 1 e(x hk ), for Z hk = 0. HT weighting balances (in expectation) the weighted distribution of covariates in the two groups.

Step 1: Propensity score models Marginal model-ignoring multilevel structure: logit(e hk ) = δ 0 + X hk α. Fixed effects model - adding cluster-specific main effect δ h : logit(e hk ) = δ h + U hk α. Key: the cluster membership is a nominal covariate. δ h absorbs the effects of both observed and unobserved cluster-level covariates V h. Estimates a balancing score without knowledge of Vh, but might lead to larger variance than the propensity score estimated under a correct model with fully observed V h. The Neyman-Scott problem: inconsistent estimates of δ h, α given large number of small clusters.

Step 1: Propensity score models Random effects model: logit(e hk ) = δ h + X hk α, with δ h N(δ 0, σ 2 δ ). Due to the shrinkage of random effects, V h need to be included. Pros: borrowing information across clusters, works better with many small clusters. Cons: produce a biased estimate if δ h are correlated with the covariates. Does not guarantee balance within each cluster. Random slopes can be incorporated.

Step 2: Estimators for the ACD or the ATE Two types of propensity-score-weighted estimators: 1. Nonparametric: applying the weights directly to the observed outcomes. 2. Parametric: applying the weights to the fitted outcomes from a parametric model (doubly-robust estimators).

Nonparametric Estimators Marginal estimator - ignore clustering: ˆπ ma = Z hk =1 Y hk w hk w 1 Z hk =0 Y hk w hk w 0, where w hk is the HT weight using the estimated propensity score, and w z = h,k:z hk =z w hk for z = 0, 1. Cluster-weighted estimator: (1) obtain cluster-specific ATE; and (2) average over clusters: ˆπ cl = h w hˆπ h. where ˆπ h = zhk =1 k h Y hk w hk w h1 h w h where w hz = z hk =z k h w hk for z = 0, 1. zhk =0 k h Y hk w hk, w h0

Parametric doubly-robust (DR) estimators DR estimator (Robins and colleagues): replace the observed outcome but fitted outcomes from a parametric model. ˆπ dr = Î hk /n, h,k where [ Z hk Y hk Î hk = ê hk (Z hk ê hk )Ŷ 1 hk ê hk ] [ (1 Z hk )Y hk 1 ê hk + (Z hk ê hk )Ŷ 0 hk 1 ê hk Ŷ z : the fitted outcome from a parametric model in group z. Double-robustness: in large samples, ˆπ dr is consistent if either the p.s. model or the outcome model is correct, but not necessarily both. SE in these estimators can be obtained via the delta method. ],

DR estimators: outcome model Marginal model - ignore clustering: Y hk = η 0 + Z hk γ + X hk β + ɛ hk, Fixed effects model - adjust for cluster-level main effects η h and covariates: Y hk = η h + Z hk γ + U hk β + ɛ hk, Random effects model - assume cluster-specific effects η h N(0, σ 2 η) Y hk = η h + Z hk γ + X hk β + ɛ hk. Different DR estimators: combinations of different choices of propensity score model and outcome model.

Large sample properties We investigate the large-sample bias of the nonparametric estimators in the presence of intra-cluster influence, in the setting with no observed covariates. n hz : number of subjects in Z group in cluster h. n z = h n hz, n h = n h1 + n h0, n = n 1 + n 0. Assume a Bernoulli assignment mechanism with varying rates by cluster: Z hk Bernoulli(p h ). (1) Assume an outcome model with cluster-specific random intercepts η h and constant treatment effect π, Y hk = η h +Z hk π+τd h +ɛ hk, η h N(η 0, σ 2 η), ɛ hk N(0, σ 2 ɛ ) (2) where d h = n h1 /n h is the cluster-specific proportion treated.

Nonparametric marginal estimator with marginal p.s. Violation to unconfoundness and/or SUTVA. Under marginal propensity score model, ê hk = n 1 /n, for all units. Then for the nonparametric marginal estimator: ˆπ ma ma = π + τ V V 0 + h η h ( n h1 n 1 n z hk =1 h0 ɛ hk ) + ( n 0 n 1 h,k z hk =0 h,k ɛ hk n 0 ), where V = n h (dh 2 (n 1/n) 2 ): weighted sample variance of {d h }; V0 = (n 1 n 0 )/n 2 : maximum of V, attainable if each cluster is assigned to either all treatment or all control (treatment assigned at cluster level). Expectation of the last two terms are asymptotically 0.

Nonparametric marginal estimator with marginal p.s. Therefore, under the generating model (1) and (2), the large sample bias of ˆπ ma ma is τv/v 0. Interpreting the result: 1. τ measures the variation in the outcome generating mechanism between clusters; 2. V/V 0 measures the variation in the treatment assignment mechanism between clusters. Both are ignored in ˆπ ma ma, introducing bias. We also analyze the non-parametric estimators that accounting for clustering in at least one stage - all lead to consistent estimates. Conclusion: with violations to unconfoundedness due to unobserved cluster effects, (1) ignoring clustering in both stages of p.s. weighting leads to biased estimates, (2) but accounting for clustering in at least one stage leads to consistent estimates.

Simulation Design The asymptotics do not apply to large number of small clusters". The Neyman-Scott problem: MLEs for δ h and α are inconsistent due to the growing number of parameters (clusters). The primary role of p.s. is to balance covariates - even inconsistent p.s. estimate may still work well, as long as it balances covariates. Simulated under balanced designs: (i) Small number of large clusters: H = 30, n h = 400, (ii) Large number of median clusters: H = 200, n h = 20, (iii) Large number of small clusters: H = 400, n h = 10. Additional goal: investigate the impact of an unmeasured cluster-level confounder in more realistic settings with observed covariates.

Simulation Design Two individual-level covariates: U 1 N(1, 1); U 2 Bernoulli(0.4). One cluster-level covariate V : (i) uncorrelated with U: V N(1, 2); (ii) correlated with the p.s. model: V h = 1 + 2δ h + u, u N(0, 0.5). True treatment assignment: logit Pr(Z hk = 1 X hk ) = δ h + X hk α, True parameters: δ h N(0, 1) and α = ( 1,.5, α 3 ). True (potential) outcome model - random effects model with interaction (η h N(0, 1), γ h N(0, 1), ɛ hk N(0, σ 2 y)): Y hk = η h + X hk β + Z hk (V h κ + γ h ) + ɛ hk, True parameters: κ = 2 and β = (1,.5, β 3 ).

Simulation Design Under each scenario, 500 replicates are generated. For each simulation, 1. Calculate the true value of π by h,k [Y hk(1) Y hk (0)]/n from the simulated units. 2. Estimate propensity score using the three models with U 1, U 2, omitting the cluster-level V. 3. Fit the three outcome models with U 1, U 2, omitting the cluster-level V. 4. Obtain the nonparametric estimates and the DR estimates for π from the above. For comparison, estimates from true models (benchmark) with U 1, U 2, V are also calculated. Random effects models are fitted via the lmer function in R package lme4 (penalized likelihood estimation).

Simulation Results nonparametric doubly-robust marginal cluster 1 benchmark marginal fixed random (H, n h ) bias mse bias mse bias mse bias mse bias mse bias mse BE.15.22.07.10.029.037.23.33.10.13.029.037 (30,400) MA 4.46 4.50.27.32.023.029 4.46 4.51.28.34.024.030 FE.15.22.07.10.030.039.23.35.10.14.030.039 RE.17.23.07.09.029.037.25.34.10.13.029.037 BE.49.54.16.19.041.052.82.88.13.19.044.055 (200,20) MA 4.46 4.48.29.32.036.045 4.46 4.48.16.20.046.057 FE.54.59.12.15.046.058.67.75.14.18.048.060 RE 1.24 1.25.13.16.040.050 1.59 1.61.12.15.044.054 BE.59.63.20.23.043.053 1.06 1.10.13.17.052.065 (400,10) MA 4.49 4.50.31.33.038.048 4.49 4.51.16.18.067.081 FE 1.00 1.03.12.15.047.059 1.25 1.29.15.18.055.068 RE 1.84 1.85.17.20.040.051 2.32 2.33.13.16.056.069

Simulation Summaries 1. Estimators that ignore clustering in both propensity score and outcome models have much larger bias and RMSE than all the others. 2. The choice of the outcome model appears to have a larger impact on the results than the choice of the p.s. model. 3. Among the DR estimators, the ones with the benchmark outcome model and the ones with the random effects outcome model perform the best. 4. Given the same propensity score model, the nonparametric cluster-weighted estimator generally gives smaller bias than the DR estimator with fixed effects outcome model 5. When the number of cluster increases and the size of each cluster reduces, bias and RMSE increases considerably in all the estimators. The largest increase is observed in the nonparametric marginal estimators.

Application: racial disparity in health care Disparity: racial differences in care attributed to operations of health care system. Breast cancer screening data in different health insurance plans are collected from the Centers for Medicare and Medicaid Services (CMS). Focus on comparison between whites and blacks. Focus on the plans with at least 25 whites and 25 blacks: 64 plans with a total sample size of 75012. Subsample 3000 subjects from large (>3000) clusters to restrict impact of extremely large clusters, resulting sample size 56480.

Racial disparity data Treatment Z hk : black race (1=black, 0=white). Outcome Y hk : receive screening for breast cancer or not. Goal: investigate racial disparity in breast cancer screening - unconfounded descriptive comparison. Cluster level covariates V h : geographical code, non/for-profit status, practice model. Individual level covariates X hk : age category, eligibility for medicaid, poor neighborhood. Analysis: fit the three p.s. and the three outcome models, obtain estimates.

Balance Check: different propensity score models Figure : Histogram of cluster-specific weighted proportions of black enrollees. unweighted marginal p.s. model Frequency 0 4 8 12 Frequency 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 cluster specific proportion of blacks 0.0 0.2 0.4 0.6 0.8 1.0 cluster specific proportion of blacks fixed effects p.s. model random effects p.s. model Frequency 0 10 25 Frequency 0 10 25 0.0 0.2 0.4 0.6 0.8 1.0 cluster specific proportion of blacks 0.0 0.2 0.4 0.6 0.8 1.0 cluster specific proportion of blacks All p.s. models suggest: living in a poor neighborhood, being eligible for Medicaid and enrollment in a for-profit insurance plan are significantly associated with black race.

Results weighted doubly-robust marginal clustered marginal fixed random marginal -4.96 (.79) -1.73 (.83) -4.43 (.85) -2.15 (.41) -1.65 (.43) fixed -2.49 (.92) -1.78 (.81) -1.93 (.82) -2.21 (.42) -1.96 (.41) random -2.56 (.91) -1.78 (.82) -2.00 (.44) -2.22 (.39) -1.95 (.39) Table : Average controlled difference in percentage of the proportion of getting breast cancer screening between blacks and whites. All estimators show the rate of receipt breast cancer screening is significantly lower among blacks than among whites with similar characteristics. Ignoring clustering in both stages doubled the estimates from analyses that account for clustering in at least one stage. Between-cluster variation is large. DR estimates have smaller SE.

Conclusion Propensity score is a powerful tool to balance covariates, for both causal and descriptive comparisons. We introduce and compare several propensity score weighting methods for multilevel data. Using analytical derivations and simulations, we show that 1. Ignoring the multilevel structure in both stages of p.s. weighting generally leads to severe bias for ATE/ACD. 2. Exploiting the multilevel structure, either parametrically or nonparametrically, in at least one stage can greatly reduce the bias. In complex observational data, correctly specifying the outcome model may be challenging. Propensity score methods are more robust. Interference between units in a cluster may exist (SUTVA does not hold), further research is needed.