Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Similar documents
Empirical Likelihood Methods for Pretest-Posttest Studies

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Causal Inference Basics

An Introduction to Causal Analysis on Observational Data using Propensity Scores

Selection on Observables: Propensity Score Matching.

Empirical Likelihood Methods for Sample Survey Data: An Overview

Fractional Imputation in Survey Sampling: A Comparative Review

Data Integration for Big Data Analysis for finite population inference

Propensity Score Methods for Causal Inference

Propensity Score Weighting with Multilevel Data

Robustness to Parametric Assumptions in Missing Data Models

Empirical Likelihood Inference for Two-Sample Problems

Covariate Balancing Propensity Score for General Treatment Regimes

arxiv: v1 [stat.me] 15 May 2011

Combining Non-probability and Probability Survey Samples Through Mass Imputation

New Developments in Nonresponse Adjustment Methods

Recent Advances in the analysis of missing data with non-ignorable missingness

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

MISSING or INCOMPLETE DATA

Propensity Score Analysis with Hierarchical Data

Causal Inference for Mediation Effects

Combining multiple observational data sources to estimate causal eects

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

DATA-ADAPTIVE VARIABLE SELECTION FOR

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

PROPENSITY SCORE MATCHING. Walter Leite

Generalized Pseudo Empirical Likelihood Inferences for Complex Surveys

What s New in Econometrics. Lecture 1

Empirical Likelihood Methods

Empirical likelihood methods in missing response problems and causal interference

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk

Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing

Semiparametric Regression Based on Multiple Sources

MISSING or INCOMPLETE DATA

Introduction to Statistical Analysis

Advising on Research Methods: A consultant's companion. Herman J. Ader Gideon J. Mellenbergh with contributions by David J. Hand

Variable selection and machine learning methods in causal inference

Statistical Data Mining and Machine Learning Hilary Term 2016

Survival Analysis for Case-Cohort Studies

DOUBLY ROBUST NONPARAMETRIC MULTIPLE IMPUTATION FOR IGNORABLE MISSING DATA

Classification. Chapter Introduction. 6.2 The Bayes classifier

Difference-in-Differences Methods

A comparison of weighted estimators for the population mean. Ye Yang Weighting in surveys group

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics

Mann-Whitney Test with Adjustments to Pre-treatment Variables for Missing Values and Observational Study

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

Master s Written Examination

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Nuisance parameter elimination for proportional likelihood ratio models with nonignorable missingness and random truncation

Survival Analysis Math 434 Fall 2011

Jong-Min Kim* and Jon E. Anderson. Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris

Strategy of Bayesian Propensity. Score Estimation Approach. in Observational Study

Extending causal inferences from a randomized trial to a target population

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Statistics 135 Fall 2008 Final Exam

Propensity score modelling in observational studies using dimension reduction methods

Outline of GLMs. Definitions

The propensity score with continuous treatments

Master s Written Examination - Solution

Guanglei Hong. University of Chicago. Presentation at the 2012 Atlantic Causal Inference Conference

Propensity Score Methods for Estimating Causal Effects from Complex Survey Data

Balancing Covariates via Propensity Score Weighting

Parametric fractional imputation for missing data analysis

Local Polynomial Wavelet Regression with Missing at Random

Topics and Papers for Spring 14 RIT

Model comparison and selection

Likelihood-based inference with missing data under missing-at-random

AN INSTRUMENTAL VARIABLE APPROACH FOR IDENTIFICATION AND ESTIMATION WITH NONIGNORABLE NONRESPONSE

Lecture 12: Application of Maximum Likelihood Estimation:Truncation, Censoring, and Corner Solutions

UNIVERSITÄT POTSDAM Institut für Mathematik

A note on multiple imputation for general purpose estimation

Propensity-Score Based Methods for Causal Inference in Observational Studies with Fixed Non-Binary Treatments

Propensity score adjusted method for missing data

Test of Association between Two Ordinal Variables while Adjusting for Covariates

Missing covariate data in matched case-control studies: Do the usual paradigms apply?

OUTCOME REGRESSION AND PROPENSITY SCORES (CHAPTER 15) BIOS Outcome regressions and propensity scores

Two-phase sampling approach to fractional hot deck imputation

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1

Semiparametric Regression Based on Multiple Sources

Causal Sensitivity Analysis for Decision Trees

High Dimensional Propensity Score Estimation via Covariate Balancing

Implementing Matching Estimators for. Average Treatment Effects in STATA

Linear models and their mathematical foundations: Simple linear regression

Introduction An approximated EM algorithm Simulation studies Discussion

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

On the covariate-adjusted estimation for an overall treatment difference with data from a randomized comparative clinical trial

Probabilistic Index Models

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Sampling Techniques. Esra Akdeniz. February 9th, 2016

Bounds on Causal Effects in Three-Arm Trials with Non-compliance. Jing Cheng Dylan Small

Lecture 5: Poisson and logistic regression

Ratio of Mediator Probability Weighting for Estimating Natural Direct and Indirect Effects

Cox s proportional hazards model and Cox s partial likelihood

The Effective Use of Complete Auxiliary Information From Survey Data

Transcription:

1 / 32 Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design Changbao Wu Department of Statistics and Actuarial Science University of Waterloo (Joint work with Min Chen and Mary Thompson) November 13, 2015

2 / 32 1 Empirical Likelihood for Two-sample Problems 2 Survey Samples for Observational Studies 3 Concluding Remarks

3 / 32 1 Empirical Likelihood for Two-sample Problems 2 Survey Samples for Observational Studies 3 Concluding Remarks

4 / 32 (1) Two Independent Samples Two groups: treatment vs control Response Y: Y 1 for treatment and Y 0 for control Sample data on Y: { Y11,, Y 1n1 } Covariates might be involved µ 1 = E(Y 1 ) and µ 0 = E(Y 0 ) and F 1 (t) = P(Y 1 t) and F 0 (t) = P(Y 0 t) S 1 (t) = P(Y 1 > t) = 1 F 1 (t) S 0 (t) = P(Y 0 > t) = 1 F 0 (t) { Y01,, Y 0n0 }

5 / 32 (1) Two Independent Samples Inference on treatment effect: θ = µ 1 µ 0 Inference on survival functions (distribution functions): Y 1 is stochastically larger than Y 0 if S 1 (t) > S 0 (t) for t > 0. Empirical likelihood methods for two-sample problems (Wu and Yan, 2012): Two independent samples with no missing values Two independent samples with responses missing at random Two independent survey samples with no missing values

6 / 32 (2) Pretest-Posttest Studies A very popular approach in medical and social sciences Measure changes resulting from a treatment or an intervention A version of two-sample problems Design I: Paired comparison (less popular one) Select a random sample of n units from the target population; Measure the response Y on all units BEFORE and AFTER the treatment/intervention.

7 / 32 (2) Pretest-Posttest Studies Design II: (commonly used method) A random sample of n units is taken from the target population Measures on certain baseline variables Z are obtained for ALL n individuals (pretest measures) Among the n units, n 1 are randomly selected and assigned to treatment ; Values of the response variable Y are obtained (posttest measures) The other n 0 = n n 1 units are assigned to control ; Values of the response variable Y are also obtained

8 / 32 (2) Pretest-Posttest Studies Y 1 : response under treatment Y 0 : response under control The available data (n = n 1 + n 0 ) i 1 2 n 1 n 1 + 1 n 1 + 2 n Z Z 1 Z 2 Z n1 Z n1 +1 Z n1 +2 Z n Y 1 Y 11 Y 12 Y 1n1 Y 0 Y 0(n1 +1) Y 0(n1 +2) Y 0n Two distinct features of the design: Response Missing-by-Design Availability of baseline information for all n units The Z variables follow the same distributions for both groups (due to randomization)

9 / 32 (2) Pretest-Posttest Studies A two-sample problem with unique features Parameter of primary interest: θ = µ 1 µ 0 Test H 0 : θ = 0 vs H 1 : θ 0 (or θ > 0) Test H 0 : F 1 = F 0 vs H 1 : F 1 < F 0 (OR H 0 : S 1 = S 0 vs H 1 : S 1 > S 0 ) Question: How to effectively use the baseline information and the feature of missing-by-design?

10 / 32 (3) Two-sample Problems for Observational Studies Baseline information (Z) collected for all n units Each unit is assigned to either treatment or control (Missing-by-Design) Assignments to treatment or control are not randomized: Example 1. Patients self-selection of treatment among two alternative choices. Example 2. Voluntary participation in a school smoking-intervention education program. Example 3. Modes of survey data collection: Web versus telephone interview.

11 / 32 An EL Approach to Pretest-Posttest Studies (Huang, Qin and Follmann, JASA, 2008) Parameter of interest: θ = µ 1 µ 0 Find the EL estimators of µ 1 and µ 0 separately Estimate θ by ˆθ = ˆµ 1 ˆµ 0 Estimate the standard error of ˆθ through a bootstrap method Inference on θ = µ 1 µ 0 using (ˆθ θ)/se(ˆθ) Finding ˆµ 1 (and ˆµ 0 ) is the main focus of the HQF paper

12 / 32 An Imputation-based Two-Sample EL Approach Why imputation? More efficient use of baseline information! i 1 2 n 1 n 1 + 1 n 1 + 2 n Z Z 1 Z 2 Z n1 Z n1 +1 Z n1 +2 Z n Y 1 Y 11 Y 12 Y 1n1 Y 0 Y 0(n1 +1) Y 0(n1 +2) Y 0n Regression modelling: Y 1i = Z T i β 1 + ɛ 1i, i = 1,, n, (1) Y 0i = Z T i β 0 + ɛ 0i, i = 1,, n, (2)

13 / 32 An Imputation-based Two-Sample EL Approach Let R i = 1 if i is under treatment, R i = 0 otherwise, ˆβ 1 = ˆβ 0 = ( n ) 1 n R i Z i Z T i R i Z i Y 1i, i=1 i=1 ( n ) 1 n (1 R i )Z i Z T i (1 R i )Z i Y 0i i=1 i=1 Regression imputation: Y 1i = Z T i ˆβ 1, i = n 1 + 1,, n Y 0i = Z T i ˆβ 0, i = 1,, n 1

14 / 32 Imputation-basede EL Inference on θ = µ 1 µ 0 Two augmented samples after imputation: {Ỹ 1i = R i Y 1i + (1 R i )Y 1i, i = 1,, n} {Ỹ 0i = (1 R i )Y 0i + R i Y 0i, i = 1,, n}. Each sample has an enlarged sample size at n = n 1 + n 0 The imputed samples are no longer independent Inferences are under the assumed regression models (the imputation model)

15 / 32 1 Empirical Likelihood for Two-sample Problems 2 Survey Samples for Observational Studies 3 Concluding Remarks

16 / 32 Two Independent Surveys Two independent survey samples S 1 and S 0 from the same population Two (possibly different) designs: d 1i, i S 1 ; d 0i, i S 0 d 1i = d 1i k S 1 d 1k and d0i = d 0i k S 0 d 0k Response variables Y 1 and Y 0 ; Survey sample data: } } {(y 1i, z 1i ), i S 1 and {(y 0i, z 0i ), i S 0 Parameter of interest: θ = µ y1 µ y0 Design-based estimator of θ: ˆθ 1 = ˆµ y1 ˆµ y0 = i S1 d 1i y 1i i S0 d 0i y 0i

17 / 32 Two-sample Pseudo Empirical Likelihood Two-sample pseudo EL function l(p, q) = 1 d1i log(p 1i ) + 1 d0i log(q 0i ) 2 2 i S 1 i S 0 Normalization constraints i S 1 p 1i = 1 and i S 0 q 0i = 1 Parameter constraint p 1i y 1i q 0i y 0i = θ i S 1 i S 0 Additional constraints on z 1 and z 0 depending on what s available

18 / 32 The ITC Four Country Survey (ITC4) The International Tobacco Control Policy Evaluation Project (The ITC Project) ITC Four Country Survey: Waves 1-7 by telephone interview ITC Four Country Survey: Wave 8, respondents chose to complete the survey either by telephone interview or self-administered web survey Total number of respondents at wave 8: 4507 Number of respondents by telephone: 2709 Number of respondents by web: 1798 The research question: Examine the effect of two different modes for data collection Y 1 : response by web (treatment) Y 0 : response by telephone (control)

19 / 32 Mode Effect in Survey Data Collection Survey sample S of size n; design weights d i, i = 1,, n. Units i = 1,, n 1 chose treatment Units i = n 1 + 1,, n chose control Let R i = 1 if unit i chose treatment, R i = 0 if unit i chose control, i = 1,, n Baseline information z i observed for all i = 1,, n Response variable missing-by-design: Y 1i observed for i = 1,, n 1 Y 0i observed for i = n 1 + 1,, n Point estimate and confidence interval for θ = µ y1 µ y0

20 / 32 Mode Effect Under randomization to treatment and control: E(Y 1 R = 1) E(Y 0 R = 0) = E(Y 1 ) E(Y 0 ) = θ With self-selection of treatment and control: E(Y 1 R = 1) E(Y 1 ), E(Y 0 R = 0) E(Y 0 ) Ignorable treatment assignment (Rosenbaum and Rubin, 1983): (Y 1, Y 0 ) and R are independent given Z Test ignobility using a two-phase sampling technique? (Chen and Kim, 2014)

21 / 32 Mode Effect: Propensity Score Adjustment (PSA) Treatment assignment (self-selection) depends only on Z P(R = 1 Y 1, Y 0, Z = z) = P(R = 1 Z = z) = r(z) Fit a feasible model to obtain the propensity scores ˆr i = ˆr(z i ), i = 1,, n Available data {(y 1i, z i ), i = 1,, n 1 } and { } (y 0i, z i ), i = n 1 + 1,, n Point estimator for θ = µ y1 µ y0 using PSA: ˆθ 2 = n 1 i=1 d i ˆr i y 1i n i=n 1 +1 d i 1 ˆr i y 0i

22 / 32 Pseudo EL Under Propensity Score Adjustment Design weights d i = P(i S), i = 1,, n; d i = d i / k S d k Pseudo empirical likelihood function and constraints l(p, q) = n 1 i=1 n 1 d i ˆr i log(p i ) + p i = 1, i=1 n 1 p i y 1i i=1 n j=n 1 +1 n j=n 1 +1 n j=n 1 +1 d j 1 ˆr j log(q j ) (3) q j = 1 (4) q j y 0j = θ (5) ˆp(θ) and ˆq(θ): Maximizer of (3) under constraints (4) and (5) The maximum ( pseudo EL ) estimator ˆθ 3 : Maximize l ˆp(θ), ˆq(θ) w.r.t. θ

23 / 32 A Simulation Study N = 20, 000; n = 200; Single stage PPS sampling x i1 Bernoulli(0.5); x i2 U[0, 1]; x i3 0.5 + 2 exp(1) π i x i3 ; max π i / min π i = 45 Linear models for the responses y i0 and y i1 : y ik = β 0k + β 1k x i1 + β 2k x i2 + β 3k x i3 + ε i, i = 1, 2,..., N Logistic regression model for R i : r i = P(R i = 1 x i ) ( ri ) log = γ 0 + γ 1 x i1 + γ 2 x i2 + γ 3 x i3, 1 r i i = 1, 2,..., N Three point estimators of θ = µ y1 µ y0 : ˆθ1, ˆθ2, ˆθ3 B = 1000 repeated simulation runs

24 / 32 A Simulation Study Table : Absolute Relative Bias (ARB, %) and Mean Square Error (MSE) θ = µ y1 µ y0 = 1 θ = µ y1 µ y0 = 2 E(R) ˆθ1 ˆθ2 ˆθ3 ˆθ1 ˆθ2 ˆθ3 0.50 ARB 3.2 2.4 2.8 1.4 0.3 0.6 MSE 0.2 0.1 0.1 0.3 0.1 0.1 0.60 ARB 42.9 17.9 10.6 26.7 8.7 4.9 MSE 0.4 0.5 0.2 0.6 0.5 0.2 0.70 ARB 104.8 67.2 14.1 61.6 33.8 6.4 MSE 1.5 12.1 0.7 2.0 12.1 0.7

25 / 32 Post-stratification by Propensity Score With self-selection of treatment and control, the distributions of Z R = 1 and Z R = 0 tend to be different Matching by propensity score is an effective way to balance the distribution of Z between the treated and the untreated (Rosenbaum and Rubin, 1983, 1984) Order the units based on fitted propensity scores ˆr (1) ˆr (2) ˆr (n) Form K strata based on suitable cut-off of the propensity scores (Popular choice: K = 5) Units within the same stratum have similar values of propensity scores

26 / 32 Post-stratification by Propensity Score Post-stratified samples: S = Q 1 Q K Estimated stratum weights ( ) ( ) Ŵ k = / d i, k = 1,, K Population means µ y1 = i Q k d i i S K W k µ y1k and µ y0 = k=1 K W k µ y0k k=1 Treatment and control groups within Q k = S 1k S 0k : R i = 1 if i S 1k and R i = 0 if i S 0k

27 / 32 Post-stratification by Propensity Score For each Q k, the distributions of Z over S 1k and S 0k are approximately the same Balance diagnostics tools are available (Austin, 2008, 2009) The post-stratified estimator of θ: ˆµ y1k = ˆθ = K k=1 Ŵ k ( ˆµ y1k ˆµ y0k ) i S 1k d i y 1i i S 1k d i, ˆµ y0k = i S 0k d i y 0i i S 0k d i Post-stratification provides an effective way of using baseline information on Z

28 / 32 Using Z in Pseudo EL Through Additional Constraints The pseudo EL function for the post-stratified samples l = K Ŵ k i S 1k dik log(p ik ) + k=1 k=1 K Ŵ k i S 0k dik log(q ik ) Constraints p ik = 1, q ik = 1, k = 1,, K i S 1k i S 0k K K Ŵ k p ik y 1i Ŵ k q ik y 0i = θ k=1 i S 1k k=1 i S 0k K K Ŵ k p ik z i = Ŵ k q ik z i i S 1k i S 0k k=1 k=1

29 / 32 Using Z in Pseudo EL Through Imputation For each Q k = S 1k S 0k : (y 1i, z i ), i S 1k plus z i, i S 0k (y 0i, z i ), i S 0k plus z i, i S 1k Build a model using {(y 1i, z i ), i S 1k } Predict y 1i for i S 0k using z i, i S 0k Build a model using {(y 0i, z i ), i S 0k } Predict y 0i for i S 1k using z i, i S 1k Using the two imputed stratified samples for pseudo EL inference

30 / 32 1 Empirical Likelihood for Two-sample Problems 2 Survey Samples for Observational Studies 3 Concluding Remarks

31 / 32 Concluding Remarks Confidence intervals or hypothesis tests using the EL ratio statistic have better performances than the conventional normal theory methods Performances of imputation-based EL approach to pretest-posttest studies depend on two crucial conditions: Prediction power of the baseline variables Z Reliability of the model used for imputation Linear regression models are convenient, but kernel regression models can also be used Efficient computational procedures for EL are available for practical implementations We are currently conducting further simulation studies on mode effects in survey data collection

32 / 32 Acknowledgement This talk is partially based on Min Chen s PhD dissertation research at the University of Waterloo Chen, M., Wu, C. and Thompson, M.E. (2015). An Imputation Based Empirical Likelihood Approach to Pretest-Posttest Studies. The Canadian Journal of Statistics, 43, 378 402. Chen, M., Wu, C. and Thompson, M.E. (2015). Mann-Whitney Test with Empirical Likelihood Methods for Pretest-Posttest Studies. Revised for Journal of Nonparametric Statistics. Wu, C. and Yan, Y. (2012). Weighted Empirical Likelihood Inference for Two-sample Problems. Statistics and Its Interface, 5, 345 354.