Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

1 / 32 Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design Changbao Wu Department of Statistics and Actuarial Science University of Waterloo (Joint work with Min Chen and Mary Thompson) November 13, 2015

2 / 32 1 Empirical Likelihood for Two-sample Problems 2 Survey Samples for Observational Studies 3 Concluding Remarks

3 / 32 1 Empirical Likelihood for Two-sample Problems 2 Survey Samples for Observational Studies 3 Concluding Remarks

4 / 32 (1) Two Independent Samples Two groups: treatment vs control Response Y: Y 1 for treatment and Y 0 for control Sample data on Y: { Y11,, Y 1n1 } Covariates might be involved µ 1 = E(Y 1 ) and µ 0 = E(Y 0 ) and F 1 (t) = P(Y 1 t) and F 0 (t) = P(Y 0 t) S 1 (t) = P(Y 1 > t) = 1 F 1 (t) S 0 (t) = P(Y 0 > t) = 1 F 0 (t) { Y01,, Y 0n0 }

5 / 32 (1) Two Independent Samples Inference on treatment effect: θ = µ 1 µ 0 Inference on survival functions (distribution functions): Y 1 is stochastically larger than Y 0 if S 1 (t) > S 0 (t) for t > 0. Empirical likelihood methods for two-sample problems (Wu and Yan, 2012): Two independent samples with no missing values Two independent samples with responses missing at random Two independent survey samples with no missing values

6 / 32 (2) Pretest-Posttest Studies A very popular approach in medical and social sciences Measure changes resulting from a treatment or an intervention A version of two-sample problems Design I: Paired comparison (less popular one) Select a random sample of n units from the target population; Measure the response Y on all units BEFORE and AFTER the treatment/intervention.

7 / 32 (2) Pretest-Posttest Studies Design II: (commonly used method) A random sample of n units is taken from the target population Measures on certain baseline variables Z are obtained for ALL n individuals (pretest measures) Among the n units, n 1 are randomly selected and assigned to treatment ; Values of the response variable Y are obtained (posttest measures) The other n 0 = n n 1 units are assigned to control ; Values of the response variable Y are also obtained

8 / 32 (2) Pretest-Posttest Studies Y 1 : response under treatment Y 0 : response under control The available data (n = n 1 + n 0 ) i 1 2 n 1 n 1 + 1 n 1 + 2 n Z Z 1 Z 2 Z n1 Z n1 +1 Z n1 +2 Z n Y 1 Y 11 Y 12 Y 1n1 Y 0 Y 0(n1 +1) Y 0(n1 +2) Y 0n Two distinct features of the design: Response Missing-by-Design Availability of baseline information for all n units The Z variables follow the same distributions for both groups (due to randomization)

9 / 32 (2) Pretest-Posttest Studies A two-sample problem with unique features Parameter of primary interest: θ = µ 1 µ 0 Test H 0 : θ = 0 vs H 1 : θ 0 (or θ > 0) Test H 0 : F 1 = F 0 vs H 1 : F 1 < F 0 (OR H 0 : S 1 = S 0 vs H 1 : S 1 > S 0 ) Question: How to effectively use the baseline information and the feature of missing-by-design?

10 / 32 (3) Two-sample Problems for Observational Studies Baseline information (Z) collected for all n units Each unit is assigned to either treatment or control (Missing-by-Design) Assignments to treatment or control are not randomized: Example 1. Patients self-selection of treatment among two alternative choices. Example 2. Voluntary participation in a school smoking-intervention education program. Example 3. Modes of survey data collection: Web versus telephone interview.

11 / 32 An EL Approach to Pretest-Posttest Studies (Huang, Qin and Follmann, JASA, 2008) Parameter of interest: θ = µ 1 µ 0 Find the EL estimators of µ 1 and µ 0 separately Estimate θ by ˆθ = ˆµ 1 ˆµ 0 Estimate the standard error of ˆθ through a bootstrap method Inference on θ = µ 1 µ 0 using (ˆθ θ)/se(ˆθ) Finding ˆµ 1 (and ˆµ 0 ) is the main focus of the HQF paper

12 / 32 An Imputation-based Two-Sample EL Approach Why imputation? More efficient use of baseline information! i 1 2 n 1 n 1 + 1 n 1 + 2 n Z Z 1 Z 2 Z n1 Z n1 +1 Z n1 +2 Z n Y 1 Y 11 Y 12 Y 1n1 Y 0 Y 0(n1 +1) Y 0(n1 +2) Y 0n Regression modelling: Y 1i = Z T i β 1 + ɛ 1i, i = 1,, n, (1) Y 0i = Z T i β 0 + ɛ 0i, i = 1,, n, (2)

13 / 32 An Imputation-based Two-Sample EL Approach Let R i = 1 if i is under treatment, R i = 0 otherwise, ˆβ 1 = ˆβ 0 = ( n ) 1 n R i Z i Z T i R i Z i Y 1i, i=1 i=1 ( n ) 1 n (1 R i )Z i Z T i (1 R i )Z i Y 0i i=1 i=1 Regression imputation: Y 1i = Z T i ˆβ 1, i = n 1 + 1,, n Y 0i = Z T i ˆβ 0, i = 1,, n 1

14 / 32 Imputation-basede EL Inference on θ = µ 1 µ 0 Two augmented samples after imputation: {Ỹ 1i = R i Y 1i + (1 R i )Y 1i, i = 1,, n} {Ỹ 0i = (1 R i )Y 0i + R i Y 0i, i = 1,, n}. Each sample has an enlarged sample size at n = n 1 + n 0 The imputed samples are no longer independent Inferences are under the assumed regression models (the imputation model)

15 / 32 1 Empirical Likelihood for Two-sample Problems 2 Survey Samples for Observational Studies 3 Concluding Remarks

16 / 32 Two Independent Surveys Two independent survey samples S 1 and S 0 from the same population Two (possibly different) designs: d 1i, i S 1 ; d 0i, i S 0 d 1i = d 1i k S 1 d 1k and d0i = d 0i k S 0 d 0k Response variables Y 1 and Y 0 ; Survey sample data: } } {(y 1i, z 1i ), i S 1 and {(y 0i, z 0i ), i S 0 Parameter of interest: θ = µ y1 µ y0 Design-based estimator of θ: ˆθ 1 = ˆµ y1 ˆµ y0 = i S1 d 1i y 1i i S0 d 0i y 0i

17 / 32 Two-sample Pseudo Empirical Likelihood Two-sample pseudo EL function l(p, q) = 1 d1i log(p 1i ) + 1 d0i log(q 0i ) 2 2 i S 1 i S 0 Normalization constraints i S 1 p 1i = 1 and i S 0 q 0i = 1 Parameter constraint p 1i y 1i q 0i y 0i = θ i S 1 i S 0 Additional constraints on z 1 and z 0 depending on what s available

18 / 32 The ITC Four Country Survey (ITC4) The International Tobacco Control Policy Evaluation Project (The ITC Project) ITC Four Country Survey: Waves 1-7 by telephone interview ITC Four Country Survey: Wave 8, respondents chose to complete the survey either by telephone interview or self-administered web survey Total number of respondents at wave 8: 4507 Number of respondents by telephone: 2709 Number of respondents by web: 1798 The research question: Examine the effect of two different modes for data collection Y 1 : response by web (treatment) Y 0 : response by telephone (control)

19 / 32 Mode Effect in Survey Data Collection Survey sample S of size n; design weights d i, i = 1,, n. Units i = 1,, n 1 chose treatment Units i = n 1 + 1,, n chose control Let R i = 1 if unit i chose treatment, R i = 0 if unit i chose control, i = 1,, n Baseline information z i observed for all i = 1,, n Response variable missing-by-design: Y 1i observed for i = 1,, n 1 Y 0i observed for i = n 1 + 1,, n Point estimate and confidence interval for θ = µ y1 µ y0

20 / 32 Mode Effect Under randomization to treatment and control: E(Y 1 R = 1) E(Y 0 R = 0) = E(Y 1 ) E(Y 0 ) = θ With self-selection of treatment and control: E(Y 1 R = 1) E(Y 1 ), E(Y 0 R = 0) E(Y 0 ) Ignorable treatment assignment (Rosenbaum and Rubin, 1983): (Y 1, Y 0 ) and R are independent given Z Test ignobility using a two-phase sampling technique? (Chen and Kim, 2014)

21 / 32 Mode Effect: Propensity Score Adjustment (PSA) Treatment assignment (self-selection) depends only on Z P(R = 1 Y 1, Y 0, Z = z) = P(R = 1 Z = z) = r(z) Fit a feasible model to obtain the propensity scores ˆr i = ˆr(z i ), i = 1,, n Available data {(y 1i, z i ), i = 1,, n 1 } and { } (y 0i, z i ), i = n 1 + 1,, n Point estimator for θ = µ y1 µ y0 using PSA: ˆθ 2 = n 1 i=1 d i ˆr i y 1i n i=n 1 +1 d i 1 ˆr i y 0i

22 / 32 Pseudo EL Under Propensity Score Adjustment Design weights d i = P(i S), i = 1,, n; d i = d i / k S d k Pseudo empirical likelihood function and constraints l(p, q) = n 1 i=1 n 1 d i ˆr i log(p i ) + p i = 1, i=1 n 1 p i y 1i i=1 n j=n 1 +1 n j=n 1 +1 n j=n 1 +1 d j 1 ˆr j log(q j ) (3) q j = 1 (4) q j y 0j = θ (5) ˆp(θ) and ˆq(θ): Maximizer of (3) under constraints (4) and (5) The maximum ( pseudo EL ) estimator ˆθ 3 : Maximize l ˆp(θ), ˆq(θ) w.r.t. θ

23 / 32 A Simulation Study N = 20, 000; n = 200; Single stage PPS sampling x i1 Bernoulli(0.5); x i2 U[0, 1]; x i3 0.5 + 2 exp(1) π i x i3 ; max π i / min π i = 45 Linear models for the responses y i0 and y i1 : y ik = β 0k + β 1k x i1 + β 2k x i2 + β 3k x i3 + ε i, i = 1, 2,..., N Logistic regression model for R i : r i = P(R i = 1 x i ) ( ri ) log = γ 0 + γ 1 x i1 + γ 2 x i2 + γ 3 x i3, 1 r i i = 1, 2,..., N Three point estimators of θ = µ y1 µ y0 : ˆθ1, ˆθ2, ˆθ3 B = 1000 repeated simulation runs

24 / 32 A Simulation Study Table : Absolute Relative Bias (ARB, %) and Mean Square Error (MSE) θ = µ y1 µ y0 = 1 θ = µ y1 µ y0 = 2 E(R) ˆθ1 ˆθ2 ˆθ3 ˆθ1 ˆθ2 ˆθ3 0.50 ARB 3.2 2.4 2.8 1.4 0.3 0.6 MSE 0.2 0.1 0.1 0.3 0.1 0.1 0.60 ARB 42.9 17.9 10.6 26.7 8.7 4.9 MSE 0.4 0.5 0.2 0.6 0.5 0.2 0.70 ARB 104.8 67.2 14.1 61.6 33.8 6.4 MSE 1.5 12.1 0.7 2.0 12.1 0.7

25 / 32 Post-stratification by Propensity Score With self-selection of treatment and control, the distributions of Z R = 1 and Z R = 0 tend to be different Matching by propensity score is an effective way to balance the distribution of Z between the treated and the untreated (Rosenbaum and Rubin, 1983, 1984) Order the units based on fitted propensity scores ˆr (1) ˆr (2) ˆr (n) Form K strata based on suitable cut-off of the propensity scores (Popular choice: K = 5) Units within the same stratum have similar values of propensity scores

26 / 32 Post-stratification by Propensity Score Post-stratified samples: S = Q 1 Q K Estimated stratum weights ( ) ( ) Ŵ k = / d i, k = 1,, K Population means µ y1 = i Q k d i i S K W k µ y1k and µ y0 = k=1 K W k µ y0k k=1 Treatment and control groups within Q k = S 1k S 0k : R i = 1 if i S 1k and R i = 0 if i S 0k

27 / 32 Post-stratification by Propensity Score For each Q k, the distributions of Z over S 1k and S 0k are approximately the same Balance diagnostics tools are available (Austin, 2008, 2009) The post-stratified estimator of θ: ˆµ y1k = ˆθ = K k=1 Ŵ k ( ˆµ y1k ˆµ y0k ) i S 1k d i y 1i i S 1k d i, ˆµ y0k = i S 0k d i y 0i i S 0k d i Post-stratification provides an effective way of using baseline information on Z

28 / 32 Using Z in Pseudo EL Through Additional Constraints The pseudo EL function for the post-stratified samples l = K Ŵ k i S 1k dik log(p ik ) + k=1 k=1 K Ŵ k i S 0k dik log(q ik ) Constraints p ik = 1, q ik = 1, k = 1,, K i S 1k i S 0k K K Ŵ k p ik y 1i Ŵ k q ik y 0i = θ k=1 i S 1k k=1 i S 0k K K Ŵ k p ik z i = Ŵ k q ik z i i S 1k i S 0k k=1 k=1

29 / 32 Using Z in Pseudo EL Through Imputation For each Q k = S 1k S 0k : (y 1i, z i ), i S 1k plus z i, i S 0k (y 0i, z i ), i S 0k plus z i, i S 1k Build a model using {(y 1i, z i ), i S 1k } Predict y 1i for i S 0k using z i, i S 0k Build a model using {(y 0i, z i ), i S 0k } Predict y 0i for i S 1k using z i, i S 1k Using the two imputed stratified samples for pseudo EL inference

30 / 32 1 Empirical Likelihood for Two-sample Problems 2 Survey Samples for Observational Studies 3 Concluding Remarks

31 / 32 Concluding Remarks Confidence intervals or hypothesis tests using the EL ratio statistic have better performances than the conventional normal theory methods Performances of imputation-based EL approach to pretest-posttest studies depend on two crucial conditions: Prediction power of the baseline variables Z Reliability of the model used for imputation Linear regression models are convenient, but kernel regression models can also be used Efficient computational procedures for EL are available for practical implementations We are currently conducting further simulation studies on mode effects in survey data collection

32 / 32 Acknowledgement This talk is partially based on Min Chen s PhD dissertation research at the University of Waterloo Chen, M., Wu, C. and Thompson, M.E. (2015). An Imputation Based Empirical Likelihood Approach to Pretest-Posttest Studies. The Canadian Journal of Statistics, 43, 378 402. Chen, M., Wu, C. and Thompson, M.E. (2015). Mann-Whitney Test with Empirical Likelihood Methods for Pretest-Posttest Studies. Revised for Journal of Nonparametric Statistics. Wu, C. and Yan, Y. (2012). Weighted Empirical Likelihood Inference for Two-sample Problems. Statistics and Its Interface, 5, 345 354.