Combining multiple observational data sources to estimate causal eects

Size: px

Start display at page:

Download "Combining multiple observational data sources to estimate causal eects"

Jonah Barnett
5 years ago
Views:

1 Department of Statistics, North Carolina State University Combining multiple observational data sources to estimate causal eects Shu Yang* Joint work with Peng Ding UC Berkeley May 23, 2018 Atlantic Causal Inference Conference

outcome Y = Y (A) treatment effects: Y (1) Y (0) and E{Y (1) Y (0)} A key assumption unconfoundedness or ignorability A {Y (1), Y (0)}

2 A textbook setup: one observational study 1 Causal inference is a central goal of research in the social, health, and economic sciences One main statistical approach is the potential outcomes framework (Rubin, 1974) binary treatment A, potential outcomes Y (1) and Y (0), observed outcome Y = Y (A) treatment effects: Y (1) Y (0) and E{Y (1) Y (0)} A key assumption unconfoundedness or ignorability A {Y (1), Y (0)} observed covariates Causal effects can be identified and estimated using regression imputation, (augmented) inverse probability weighting, and matching

3 A modern setup: multiple data sources 2 Multiple data sources are increasingly available New opportunities: more information New challenges: eg varying level of confounder information, different sampling schemes and data structures, various modes of output, and so on Motivating Example: chronic obstructive pulmonary disease herpes zoster Yang et al (2011): positive association 2005 longitudinal health insurance database in Taiwan (big main data) missing confounders: cigarette smoking and alcohol consumption Lin and Chen (2014): causal analysis with more confounders 2005 national health interview survey in Taiwan (validation data) regression analysis Goal: a general framework to estimate causal effects combining big main data with unmeasured confounders smaller validation data with supplementary information on these confounders

4 Two common types of studies 3 Classical two-phase samples (Neyman, 1938; Cochran, 2007) some variables (eg A, X, and Y ) may be cheaper some variables (eg U) are more expensive first phase: easy-to-obtain variables measured for all units second phase: expensive variables measured for a validation subsample large literature, eg, Breslow et al (1988, 1997ab), Schill et al (1997) and Breslow et al (2003) for (logistic) regression Combining multiple big data sources validation data with full information on (A, X, U, Y ) external big data with only (A, X, Y ) (eg from electronic health records, claims databases, disease data registries, and census data) Chatterjee et al (2016, 2017) for parametric regression analyses

5 Notation: two data sources 4 Notation binary treatment A {0, 1}, potential outcomes Y (0) and Y (1), observed outcome Y = Y (A) pretreatment covariates (X, U), where X is fully observed, but U may not be fully observed Two observed data sources with nested two-phase structure (for illustration) main data O1 = {(A i, X i, Y i ) : i S 1 } with size n 1 = S 1 validation data O2 = {(A j, X j, U j, Y j ) : j S 2 } with size n 2 = S 2 Covariates Treatment Outcomes X U A Y (1) Y (0) Y Main 1 1? Validation data n 2 0? (O 2 ) n 2 + 1? 1? data (O 1 ) n 1? 0?

6 Sampling and parameter of interest 5 S 1 for the main data is a random sample from a super-population {Ai, X i, U i, Y i (0), Y i (1) : i S 1 } are IID S 2 for the validation data is a simple random sample from S 1 {Aj, X j, U j, Y j (0), Y j (1) : j S 2 } are also IID later relax S2 to be a general probability sample Estimand of interest is eg average causal effect (ACE) τ = E{Y (1) Y (0)} drop the indices i and j in the expectation because of IID

7 Assumptions for identification 6 Assumption (Ignorability) Y (a) A (X, U) for a = 0 and 1 It implies that for a = 0, 1 P{A = 1 X, U, Y (a)} = P(A = 1 X, U) = e(x, U) and E{Y (a) X, U} = E{Y (a) A = a, X, U} = E(Y A = a, X, U) = µ a (X, U) Assumption (Overlap) There exist constants c 1 and c 2 such that with probability 1, the propensity score is bounded, ie, 0 < c 1 e(x, U) c 2 < 1

8 Assumptions for estimation 7 Outcome distribution and propensity score are often unknown Assumption (Outcome model) The parametric model µ a (X, U; β a ) is a correct specification for µ a (X, U), for a = 0, 1; ie, µ a (X, U) = µ a (X, U; β a), where β a is the true model parameter, for a = 0, 1 Assumption (Propensity score model) The parametric model e(x, U; α) is a correct specification for e(x, U); ie, e(x, U) = e(x, U; α ), where α is the true model parameter Consistency of different estimators requires different assumptions

9 Commonly-used estimators: regression, weighting (based on validation data only) 8 Example (Regression imputation) ˆτ reg,2 = 1 µ 1 (X j, U j ; n ˆβ 1 ) µ 0 (X j, U j ; ˆβ 0 ) 2 j S 2 consistent if outcome model is correctly specified Example (Inverse probability weighting, IPW) ˆτ IPW,2 = 1 n 2 j S 2 A j Y j e(x j, U j ; ˆα) 1 (1 A j )Y j n 2 1 e(x j, U j ; ˆα) j S 2 consistent if propensity score is correctly specified

10 AIPW (based on validation data only) 9 Example (Augmented inverse probability weighting) The AIPW estimator is ˆτ AIPW,2 = n 1 2 j S 2 ˆτ AIPW,2,j, where ˆτ AIPW,2,j = A j Y j e(x j, U j ; ˆα) A j e(x j, U j ; ˆα) e(x j, U j ; ˆα) (1 A j)y j 1 e(x j, U j ; ˆα) { µ 1 (X j, U j ; ˆβ 1 ) µ 0 (X j, U j ; ˆβ 0 ) } Doubly robust: consistent if outcome or propensity score is correct Locally efficient if both outcome and propensity score are correct ˆτ reg,2, ˆτ IPW,2 and ˆτ AIPW,2 are regular asymptotically linear (RAL): ˆτ 2 τ = n 1 2 j S 2 ψ(a j, X j, U j, Y j )

11 Matching (based on validation data only) 10 Imputing counterfactual outcomes of unit j via matching Matching based on V (eg, (X, U)), fixed M, with replacement Example (Matching) ˆτ (0) mat,2 = n 1 2 {Ŷj(1) Ŷj(0)} j S 2 Abadie and Imbens (2006): biased if matching on p-dimensional variable (p 2) bias corrected estimator ˆτ mat,2 asymptotically linear and Normal by Martingale theory ˆτ mat,2 τ = n 1 2 j S 2 ψ mat,j not regular (functional forms are not smooth for fixed numbers of matches)

12 A general strategy for efficient estimation combining main and validation data 11 Validation data (credible): consistent but inefficient estimators Main big data (powerful): large sample size but error-prone large sample size, some information naively apply existing estimators: error-prone How to leverage both data to improve efficiency? A simple idea: ( n 1/2 2 ) ˆτ 2 τ ˆτ 2,ep ˆτ 1,ep { ( d v2 Γ N 0 L+1, T Γ V )} ˆτ2 : a consistent estimator from the validation data ˆτ2,ep and ˆτ 1,ep : two error-prone estimators with the same bias asymptotically Normal: applies to all the estimators reviewed before consistent variance estimators ˆv2, ˆΓ and ˆV

13 Strategy: eliminate conditional bias and improve efficiency 12 If the sampling distribution holds exactly, then n 1/2 2 (ˆτ 2 τ) n 1/2 2 (ˆτ 2,ep ˆτ 1,ep ) N 2 Γ T V 1 (ˆτ 2,ep ˆτ 1,ep ), v 2 Γ T V 1 Γ }{{}}{{} conditional bias conditional variance n1/2 Correction for conditional bias ˆτ = ˆτ 2 ˆΓ T ˆV 1 (ˆτ 2,ep ˆτ 1,ep ) More efficient estimator achieving conditional variance n 1/2 2 (ˆτ τ) N (0, v 2 Γ T V 1 Γ) Asymptotic variance estimator: ˆv = (ˆv 2 ˆΓ T ˆV 1ˆΓ)/n2

14 Some remarks on our strategy 13 ˆτ is asymptotically optimal among linear combinations {ˆτ2 + λ T (ˆτ 2,ep ˆτ 1,ep ) : λ R L } (eg Fuller 2009) the class of estimators {ˆτ = f (ˆτ 2, ˆτ 1,ep, ˆτ 2,ep ) : f (x, y, z) is smooth and ˆτ is consistent for τ} Error-prone estimators do not need to be consistent for τ ˆτd,ep (d = 1, 2) can be vector in R L the only requirement is ˆτ2,ep ˆτ P 1,ep 0 Choice of ˆτ d,ep (d = 1, 2) increasing the dimension will increase the asymptotic efficiency of ˆτ increasing the dimension may harm the finite sample properties suggestion: ˆτd,ep is of the same type as ˆτ 2

15 Compare to the literature 14 Survey calibration weighting (eg Deville and Sarndal, 1992; Fuller, 2009) Generalized method of moments (eg Imbens and Lancaster 1994) Constraint empirical likelihood (eg Chen and Sitter 1999) Regression analyses of two-phase sampling (eg Chatterjee et al 2016) Optimality issues (eg Deville and Sarndal 1992) Advantages of our strategy simple, requires only standard software for existing methods can deal with estimators not from moment conditions, eg, matching does not require a correct model specification of U given (A, X, Y ) coupled with a unified wild bootstrap procedure for inference

16 An application 15 Exposure A: chronic obstructive pulmonary disease (COPD) causes systematic inflammation dysregulates a patient s immune function Outcome Y : development of herpes zoster (HZ) Main data used by Yang et al (2011): without U smoking and alcohol consumption 2005 Longitudinal Health Insurance Database in Taiwan 8, 486 subjects having COPD (A = 1) and 33, 944 subjects not (A = 0) Validation data used by Lin and Chen (2014) 2005 National Health Interview Survey in Taiwan comparable to the main study sample 244 subjects diagnosed of COPD and 904 subjects not

17 Average causal effect estimation 16 Conclusions Est SE 95% CI ˆτ reg, ( 00047, 00402) ˆτ reg,2&reg (00109, 00200) ˆτ AIPW, ( 00044, 00402) ˆτ AIPW,2&AIPW (00109, 00203) ˆτ IPW, ( 00048, 00398) ˆτ IPW,2&IPW (00108, 00202) ˆτ mat, ( 00101, 00273) ˆτ mat,2&mat ( 00011, 00183) combining the main and validation data improves efficiency matching estimator has least improvement using two-phase sampling (similar phenomenon in simulation) on average, COPD increases the prob of HZ by 155% Caveat: causal interpretation relies on the assumption that all confounders are measured in validation data

18 More general two-phase sampling 17 Let I i be the indicator of selecting unit i into the validation data Ii is the missing data indicator of U i We have assumed that S 2 is a simple random sample from S 1 ie I (A, X, U, Y ) or U is missing completely at random (MCAR) We now relax it to allow a more general sampling Assumption {(I i, A i, X i, U i, Y i ) : i S 1 } are IID S 2 is selected from S 1 with a known inclusion probability π = P(I = 1 A, X, U, Y ) > c for some positive constant c We can allow π to depend on unknown parameters

19 More general two-phase sampling 18 Obtain ˆτ 1,ep as before from S 1 Obtain ˆτ 2 and ˆτ 2,ep using the weighted procedures with sampling weight π 1 j for unit j in S 2 Theorem Under certain regularity conditions, the joint asymptotic normality holds for the sampling weighted estimator ( ) { ( n 1/2 ˆτ 2 τ v2 Γ 2 N 0 ˆτ 2,ep ˆτ L+1, T )}, 1,ep Γ V The proposed estimator ˆτ = ˆτ 2 ˆΓ T ˆV 1 (ˆτ 2,ep ˆτ 1,ep )

20 Connection with the missing data literature 19 Proposition ˆτ has an asymptotic linear form n 1/2 1 (ˆτ τ) = n 1/2 1 i S 1 { Ii π i s(a i, X i, U i, Y i ) ( ) } Ii 1 κ(a i, X i, Y i ) π i where s(a i, X i, U i, Y i ) is ψ(a i, X i, U i, Y i ) for RAL estimators and ψ mat,i for the matching estimator, and a similar definition applies to the φ i term in κ(a i, X i, Y i ) = ΓV 1 φ i Robins et al (1994) discussed optimality: κ opt (A, X, Y ) = E{s(A, X, U, Y ) A, X, Y } To obtain the optimal estimator, we need a model P(U A, X, Y )

21 Summary 20 Combining big data to improve efficiency of the population ACE based on gold-standard validation data where: treatment assignment is ignorable sampling selection is ignroable More data fusion problems for causal inference Covariate shift and misalignment Sampling selection bias Versions of treatment Unmeasured confounding complex data structure

22 Thank you!

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan