Data Integration for Big Data Analysis for finite population inference

Similar documents
An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Combining Non-probability and Probability Survey Samples Through Mass Imputation

Nonresponse weighting adjustment using estimated response probability

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Chapter 8: Estimation 1

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

A comparison of weighted estimators for the population mean. Ye Yang Weighting in surveys group

Robustness to Parametric Assumptions in Missing Data Models

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Fractional Imputation in Survey Sampling: A Comparative Review

Biostat 2065 Analysis of Incomplete Data

A note on multiple imputation for general purpose estimation

Sampling techniques for big data analysis in finite population inference

Propensity Score Weighting with Multilevel Data

6. Fractional Imputation in Survey Sampling

Weighting Methods. Harvard University STAT186/GOV2002 CAUSAL INFERENCE. Fall Kosuke Imai

Recent Advances in the analysis of missing data with non-ignorable missingness

Two-phase sampling approach to fractional hot deck imputation

Combining data from two independent surveys: model-assisted approach

Combining Non-probability and. Probability Survey Samples Through Mass Imputation

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

A measurement error model approach to small area estimation

Propensity Score Analysis with Hierarchical Data

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

Bootstrap inference for the finite population total under complex sampling designs

Deriving indicators from representative samples for the ESF

DOUBLY ROBUST NONPARAMETRIC MULTIPLE IMPUTATION FOR IGNORABLE MISSING DATA

Modification and Improvement of Empirical Likelihood for Missing Response Problem

How to Use the Internet for Election Surveys

Weighting in survey analysis under informative sampling

INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING

Miscellanea A note on multiple imputation under complex sampling

Recommendations as Treatments: Debiasing Learning and Evaluation

Empirical Likelihood Methods for Sample Survey Data: An Overview

Jong-Min Kim* and Jon E. Anderson. Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris

New Developments in Nonresponse Adjustment Methods

Propensity Score Methods for Estimating Causal Effects from Complex Survey Data

A weighted simulation-based estimator for incomplete longitudinal data models

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Chapter 3: Element sampling design: Part 1

Estimating the Marginal Odds Ratio in Observational Studies

Graybill Conference Poster Session Introductions

The ESS Sample Design Data File (SDDF)

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

Extending causal inferences from a randomized trial to a target population

Propensity score adjusted method for missing data

Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing

arxiv: v2 [math.st] 20 Jun 2014

Calibration Estimation for Semiparametric Copula Models under Missing Data

Cluster Sampling 2. Chapter Introduction

Covariate Balancing Propensity Score for General Treatment Regimes

Model Assisted Survey Sampling

Selection on Observables: Propensity Score Matching.

arxiv: v1 [stat.me] 15 May 2011

Parametric fractional imputation for missing data analysis

Likelihood-based inference with missing data under missing-at-random

Chapter 2. Section Section 2.9. J. Kim (ISU) Chapter 2 1 / 26. Design-optimal estimator under stratified random sampling

Combining multiple observational data sources to estimate causal eects

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Propensity Score Methods for Causal Inference

What if we want to estimate the mean of w from an SS sample? Let non-overlapping, exhaustive groups, W g : g 1,...G. Random

Estimation of change in a rotation panel design

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics

You are allowed 3? sheets of notes and a calculator.

Simulation-Extrapolation for Estimating Means and Causal Effects with Mismeasured Covariates

In Praise of the Listwise-Deletion Method (Perhaps with Reweighting)

Analyzing Pilot Studies with Missing Observations

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

Comment: Understanding OR, PS and DR

New Developments in Econometrics Lecture 9: Stratified Sampling

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY

Causal Inference Basics

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Calibration estimation using exponential tilting in sample surveys

Estimation of Parameters and Variance

Imputation for Missing Data under PPSWR Sampling

Comparing MLE, MUE and Firth Estimates for Logistic Regression

VARIANCE ESTIMATION FOR NEAREST NEIGHBOR IMPUTATION FOR U.S. CENSUS LONG FORM DATA

Chapter 4: Imputation

Ordered Designs and Bayesian Inference in Survey Sampling

Causal Inference with Measurement Error

High Dimensional Propensity Score Estimation via Covariate Balancing

What is Survey Weighting? Chris Skinner University of Southampton

Master s Written Examination

Opening Theme: Flexibility vs. Stability

Statistical Analysis of Randomized Experiments with Nonignorable Missing Binary Outcomes

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Flexible Estimation of Treatment Effect Parameters

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

Correlated and Interacting Predictor Omission for Linear and Logistic Regression Models

Introduction to Survey Data Integration

Main sampling techniques

Targeted Maximum Likelihood Estimation in Safety Analysis

Generalized Pseudo Empirical Likelihood Inferences for Complex Surveys

Transcription:

for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36

What is big data? 2 / 36

Data do not speak for themselves Knowledge Reproducibility Information Intepretation Data 3 / 36

Population and Sample Population Parameter Generalization Inference Sample Estimator 4 / 36

Survey Sampling Survey: Measurement Sampling: Representation Table: Survey Methodology and Sampling Statistics Survey Methodology Psychology, Cognitive Science Studies Nonsampling error Questionnaire design Sampling Statistics Statistics Studies Sampling error Sampling design, estimation 5 / 36

Two wings of survey data 6 / 36

Big Data Big Data era- Freeconomics 8 / 36

Big Data Survey sample data vs Big Data Table: Features Survey sample data Big Data Cost function C = C 0 + C1 n C is not linear in n Reprentativeness Bias Bias = 0 Bias 0 Variance Variance = K/n Variance = 0 9 / 36

Big Data Selection Bias Finite population: U = {1,, N}. Parameter of interest: ȲN = N 1 N i=1 y i Big data sample: B U. { 1 if i B δ i = 0 otherwise. Estimator: ȳ B = N 1 N B i=1 δ iy i, where N B = N i=1 δ i is the big data sample size (N B < N). 10 / 36

Big Data MSE of Big Data Estimator MSE Formula E δ (ȳ B ȲN ) 2 = E δ (ρ 2 δ,y ) σ 2 1 f B f B where ρ δ,y = Corr(δ, Y ), σ 2 = V ar(y ), f B = N B /N, and E δ ( ) is the expectation with respect to the big data sampling mechanism, generally unknown. If E δ (ρ δ,y ) = 0, then E δ (ρ 2 δ,y ) = O(N 1 B ) and the MSE is of order 1/NB. If E δ (ρ δ,y ) 0, then E δ (ρ 2 δ,y ) = O(1) the MSE is of order 1/f B 1. 11 / 36

Big Data Effective sample size n eff = f B 1 1 f B E δ (ρ 2 δ,y ). If ρ δ,y = 0.05 and f B = 1/2, then n eff = 400. For example, suppose that the population size is N = 10, 000, 000 and we have 50% of the population collected in the big data. If ρ δ,y = 0.05 then the MSE of the big data sample mean is equal to that of SRS mean with size n = 400. 12 / 36

Big Data Paradox of Big data (Meng 2018) Confidence interval using the big data sample (ignoring the selection bias): CI = (ȳ B 1.96 (1 f B )S 2 /N B, ȳ B + 1.96 (1 f B )S 2 /N B ) As N B, we have P r(ȳn CI) 0. Paradox: If one ignores the bias and apply the standard method of estimation, the bigger the dataset, the more misleading it is for valid statistical inference. 13 / 36

Salvation of Big Data Data Integration 15 / 36

Data integration: Basic Idea Two data set: Big data and survey data Big data may be subject to selection bias. For simplicity, assume a binary Y variable δ = 1 δ = 0 Y = 1 N B1 N C1 N 1 Y = 0 N B0 N C0 N 0 N B N C N where δ i = 1 if unit i belongs to the big data sample and δ i = 0 otherwise. Parameter of interest: P = P (Y = 1). 16 / 36

Data integration: Basic Idea (Cont d) In addition, we have a survey data of size n by SRS with the following observations in the sample level: How to combine two data sources? δ = 1 δ = 0 Y = 1 n B1 n C1 n 1 Y = 0 n B0 n C0 n 0 n 17 / 36

Combined estimation Data Integration Note that P (Y = 1) = P (Y = 1 δ = 1)P (δ = 1) + P (Y = 1 δ = 0)P (δ = 0). Three components 1 P (δ = 1): Big data proportion (known) 2 P (Y = 1 δ = 1) = N B1/N B: obtained from the big data. 3 P (Y = 1 δ = 0): estimated by n C1/(n C0 + n C1) from the survey data. Final estimator ˆP = P B W B + ˆP C (1 W B ) (1) where W B = N B /N, P B = N B1 /N B, and ˆP C = n C1 /(n C0 + n C1 ). 18 / 36

Remark 1 Variance V ( ˆP ) = (1 W B ) 2 V ( ˆP C ). = (1 W B ) 1 n P C(1 P C ). If W B is close to one, then the above variance is very small. Instead of using ˆP C = n C1 /(n C0 + n C1 ), we can construct a ratio estimator of P C to improve the efficiency. That is, use 1 ˆP C,r = 1 + ˆθ C where ˆθ C = N B0/N B1 n B0 /n B1 (n C0 /n C1 ). 19 / 36

Remark 2 The combined estimator is essentially a post-stratified estimator using δ as a post-stratification variable. Post-stratification idea can be directly applicable to continuous Y variable. Practical Issues δ can be obtained inaccurately (due to Imperfect Matching). We may have measurement errors in y in the big data. Survey sample may not observe y at all. 20 / 36

Two setups (A: survey sample data, B: Big data) Parameter of interest: θ = i U y i Table: Setup One Data X Y Represent? A B Probability sample does not observe the study variable Table: Setup Two Data X Y A B Probability sample does observe the study variable 21 / 36

Data Integration for Setup One Rivers (2007) idea 1 Use X to create nearest neighbor imputation for each unit i A. 2 Compute ˆθ = w iyi i A where y i is the imputed value of y i in i A. Based on MAR (missing at random) assumption f(y x, δ = 1) = f(y x) Bias may not be negligible if the dimension of x is high (due to curse of dimensionality). Naive variance estimator works well. (Estimation error is asymptotically negligible.) 22 / 36

Data Integration for Setup One Proposed method 1 1 Obtain δ i from A, by matching or by asking the membership for the big data. 2 Fit a model for P (δ = 1 x) using sample A. 3 Use ˆθ = i B ˆπ 1 i y i where ˆπ i = ˆP (δ i = 1 x i) and adjusted to satisfy i B ˆπ 1 i = N. Based on MAR assumption. Requires correct specification of the model for π(x) = P (δ = 1 x). 23 / 36

Data Integration for Setup One Proposed method 2 : Doubly robust (DR) estimation 1 Fit a working model for E(Y x) to get ŷ i = Ê(Yi xi) for each i A and i B. 2 Fit a working model for P (δ = 1 x) to get ˆπ i = ˆP (δ i = 1 x i) for each i B. 3 Use ˆθ DR = i A where ˆπ i = ˆP (δ i = 1 x i). Based on MAR assumption. w iŷ i + i B ˆπ 1 i (y i ŷ i) Requires one of the two models be correctly specified. 24 / 36

Justification for DR estimation Let ˆθ HT = i A w iy i be the Horvitz-Thompson estimator that could be used if y i were observed in sample A. Note that ˆθ DR ˆθ HT = i A w i ê i + i B ˆπ 1 i ê i where ê i = y i ŷ i. Double Robustness 1 If the model for P (δ = 1 x) is correctly specified, then E δ {ˆθ DR ˆθ HT } = i A w iê i + i U ê i which is design-unbiased to zero. 2 If the model for E(Y x) is correctly specified, then E(ê i) = 0 under MAR. 25 / 36

Data Integration for Setup Two Table: Setup Two Data X Y A B We are interested in estimating θ = i U y i from the two data sources. 26 / 36

Data Integration for Setup Two Note that we can compute ˆθ A = i A w iy i from sample A. Thus, unlike setup one, the goal of data integration is to improve the efficiency (i.e. reduce the variance), not to reduce the selection bias. How to incorporate the partial auxiliary information in data B? 1 If B = U, then it is an easy problem: Calibration weighting 2 For B U, we can treat B as a sub-population and apply the same calibration weighting for A B. 27 / 36

Calibration weighting in survey sampling Initial (design) weight: w i Final weight: w i satisfying i A w i (1, x i ) = i U (1, x i ). (2) Calibration weighting problem: Find w i that minimize D(w, w ) = i A ( ) w 2 w i i 1 w i subject to (2). 28 / 36

Calibration weighting for big data integration Auxiliary variable x i are observed only when δ i = 1. Calibration equation is changed to i A w i (1 δ i, δ i, δ i x i ) = i U (1 δ i, δ i, δ i x i ). (3) If y i = x i, it reduces to the post-stratification estimator in (1). 29 / 36

Simulation Study: Setup One Goal: Wish to compare four estimators 1 Naive estimator: mean of sample B 2 Rivers estimator 3 Proposed estimator 1 (PS estimator) using propensity score weighting. 4 Proposed estimator 2 (DR estimator) using a working model for E(Y x) and a working model for P (δ = 1 x). Three scenarios for the simulation study 1 Both models are correct 2 Only the model E(Y x) is correct. (i.e. The true distribution for P (δ = 1 x) is different from the working model. ) 3 Only the model P (δ = 1 x) is correct. 30 / 36

Simulation study one: Setup Outcome regression model 1 Linear model. That is, y i = 1 + x 1,i + x 2,i + ɛ i for i = 1,..., N, where x 1,i N(1, 1), x 2,i Ex(1), ɛ i N(0, 1), N = 1, 000, 000, and (x 1,i, x 2,i, ɛ i) is pair-wise independent. 2 Nonlinear model. That is, y i = 0.5(x 1,i 1.5) 2 + x 2,i + ɛ i, where (x 1,i, x 2,i, ɛ i) is the same with those in the linear model. Big data sampling mechanism 1 Linear logistic model. δ i p i Ber(p i) for i = 1,..., N, where logit(p i) = x 2,i. 2 Nonlinear logistic model. δ i p i Ber(p i) for i = 1,..., N, where logit(p i) = 0.5 + 0.5(x 2,i 2) 2. 31 / 36

Smulation Result Scenario n = 500 n = 1000 Bias S.E. C.R. Bias S.E. C.R. Naive 0.187 0.001 0.000 0.187 0.001 0.000 I Rivers 0.000 0.077 0.950-0.002 0.054 0.954 PS -0.001 0.023 0.950 0.000 0.016 0.946 DR -0.002 0.063 0.950-0.002 0.044 0.950 II III Naive -0.097 0.001 0.000-0.097 0.001 0.000 Rivers -0.003 0.077 0.955-0.001 0.055 0.945 PS 0.110 0.183 0.986 0.084 0.085 0.996 DR -0.001 0.063 0.947 0.000 0.046 0.946 Naive 0.187 0.001 0.000 0.187 0.001 0.000 Rivers 0.000 0.074 0.944 0.000 0.053 0.948 PS -0.001 0.022 0.946-0.001 0.016 0.947 DR -0.001 0.050 0.950 0.001 0.035 0.950 32 / 36

Simulation Study: Setup Two Finite population of size N = 1, 000, 000. x i N(2, 1) y i = 3 + 0.7 (x i 2) + e i y i = 2 + 0.9 (y i 3) + u i where e i N(0, 0.51) and u i N(0, 0.5 2 ). Note that yi is an inaccurate measurement of y i. Sampling mechanism for A: SRS of size n = 500. Big data sampling mechanism: Stratified random sampling 1 Create two strata using x i 2 and x i > 2. 2 Within each stratum, we select n h elements by SRS independently, where n 1 = 300, 000 and n 2 = 200, 000. 3 The stratum information is not available to data analyst. 33 / 36

Simulation Study: Setup Two In sample A, we observe y i. Two scenarios for sample B. 1 Observe y i: Big data is subject to selection bias 2 Observe y i : Big data is subject to selection bias and measurement error. We can identify the elements in A B. Three estimators for θ = E(Y ) 1 Mean of sample A (Mean A) 2 Mean of sample B (Mean B) 3 Proposed data integration (DI) method using calibration weighting: In scenario one, we use calibration using (1 δ i, δ iy i). In scenario two, we use calibration using (1 δ i, δ iy i ). 34 / 36

Simulation Result Table: Monte Carlo results of mean, variance, and the MSE of the four estimators (True mean = 3.00156) Scenario Method Mean Variance MSE ( 10 4 ) ( 10 4 ) Mean A 3.00 18.6 19 1 Mean B 2.89 0.0 121 Proposed DI 3.00 8.8 9 Mean A 3.00 18.6 19 2 Mean B 1.90 0.0 12,130 Proposed DI 3.00 11.4 11 35 / 36

Discussion Big data should not be analyzed naively. (Big data paradox!) Data integration is a useful tool for harnessing big data for finite population inference. Two setups are considered. In Setup One, both Rivers method and DR method are promising. In Setup Two, calibration weighting method is useful. In Setup One, MAR assumption is used. In Setup Two, we do not need MAR assumption. Promising area of research. 36 / 36