Data Integration for Big Data Analysis for finite population inference

for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36

What is big data? 2 / 36

Data do not speak for themselves Knowledge Reproducibility Information Intepretation Data 3 / 36

Population and Sample Population Parameter Generalization Inference Sample Estimator 4 / 36

Survey Sampling Survey: Measurement Sampling: Representation Table: Survey Methodology and Sampling Statistics Survey Methodology Psychology, Cognitive Science Studies Nonsampling error Questionnaire design Sampling Statistics Statistics Studies Sampling error Sampling design, estimation 5 / 36

Two wings of survey data 6 / 36

Big Data Big Data era- Freeconomics 8 / 36

Big Data Survey sample data vs Big Data Table: Features Survey sample data Big Data Cost function C = C 0 + C1 n C is not linear in n Reprentativeness Bias Bias = 0 Bias 0 Variance Variance = K/n Variance = 0 9 / 36

Big Data Selection Bias Finite population: U = {1,, N}. Parameter of interest: ȲN = N 1 N i=1 y i Big data sample: B U. { 1 if i B δ i = 0 otherwise. Estimator: ȳ B = N 1 N B i=1 δ iy i, where N B = N i=1 δ i is the big data sample size (N B < N). 10 / 36

Big Data MSE of Big Data Estimator MSE Formula E δ (ȳ B ȲN ) 2 = E δ (ρ 2 δ,y ) σ 2 1 f B f B where ρ δ,y = Corr(δ, Y ), σ 2 = V ar(y ), f B = N B /N, and E δ ( ) is the expectation with respect to the big data sampling mechanism, generally unknown. If E δ (ρ δ,y ) = 0, then E δ (ρ 2 δ,y ) = O(N 1 B ) and the MSE is of order 1/NB. If E δ (ρ δ,y ) 0, then E δ (ρ 2 δ,y ) = O(1) the MSE is of order 1/f B 1. 11 / 36

Big Data Effective sample size n eff = f B 1 1 f B E δ (ρ 2 δ,y ). If ρ δ,y = 0.05 and f B = 1/2, then n eff = 400. For example, suppose that the population size is N = 10, 000, 000 and we have 50% of the population collected in the big data. If ρ δ,y = 0.05 then the MSE of the big data sample mean is equal to that of SRS mean with size n = 400. 12 / 36

Big Data Paradox of Big data (Meng 2018) Confidence interval using the big data sample (ignoring the selection bias): CI = (ȳ B 1.96 (1 f B )S 2 /N B, ȳ B + 1.96 (1 f B )S 2 /N B ) As N B, we have P r(ȳn CI) 0. Paradox: If one ignores the bias and apply the standard method of estimation, the bigger the dataset, the more misleading it is for valid statistical inference. 13 / 36

Salvation of Big Data Data Integration 15 / 36

Data integration: Basic Idea Two data set: Big data and survey data Big data may be subject to selection bias. For simplicity, assume a binary Y variable δ = 1 δ = 0 Y = 1 N B1 N C1 N 1 Y = 0 N B0 N C0 N 0 N B N C N where δ i = 1 if unit i belongs to the big data sample and δ i = 0 otherwise. Parameter of interest: P = P (Y = 1). 16 / 36

Data integration: Basic Idea (Cont d) In addition, we have a survey data of size n by SRS with the following observations in the sample level: How to combine two data sources? δ = 1 δ = 0 Y = 1 n B1 n C1 n 1 Y = 0 n B0 n C0 n 0 n 17 / 36

Combined estimation Data Integration Note that P (Y = 1) = P (Y = 1 δ = 1)P (δ = 1) + P (Y = 1 δ = 0)P (δ = 0). Three components 1 P (δ = 1): Big data proportion (known) 2 P (Y = 1 δ = 1) = N B1/N B: obtained from the big data. 3 P (Y = 1 δ = 0): estimated by n C1/(n C0 + n C1) from the survey data. Final estimator ˆP = P B W B + ˆP C (1 W B ) (1) where W B = N B /N, P B = N B1 /N B, and ˆP C = n C1 /(n C0 + n C1 ). 18 / 36

Remark 1 Variance V ( ˆP ) = (1 W B ) 2 V ( ˆP C ). = (1 W B ) 1 n P C(1 P C ). If W B is close to one, then the above variance is very small. Instead of using ˆP C = n C1 /(n C0 + n C1 ), we can construct a ratio estimator of P C to improve the efficiency. That is, use 1 ˆP C,r = 1 + ˆθ C where ˆθ C = N B0/N B1 n B0 /n B1 (n C0 /n C1 ). 19 / 36

Remark 2 The combined estimator is essentially a post-stratified estimator using δ as a post-stratification variable. Post-stratification idea can be directly applicable to continuous Y variable. Practical Issues δ can be obtained inaccurately (due to Imperfect Matching). We may have measurement errors in y in the big data. Survey sample may not observe y at all. 20 / 36

Two setups (A: survey sample data, B: Big data) Parameter of interest: θ = i U y i Table: Setup One Data X Y Represent? A B Probability sample does not observe the study variable Table: Setup Two Data X Y A B Probability sample does observe the study variable 21 / 36

Data Integration for Setup One Rivers (2007) idea 1 Use X to create nearest neighbor imputation for each unit i A. 2 Compute ˆθ = w iyi i A where y i is the imputed value of y i in i A. Based on MAR (missing at random) assumption f(y x, δ = 1) = f(y x) Bias may not be negligible if the dimension of x is high (due to curse of dimensionality). Naive variance estimator works well. (Estimation error is asymptotically negligible.) 22 / 36

Data Integration for Setup One Proposed method 1 1 Obtain δ i from A, by matching or by asking the membership for the big data. 2 Fit a model for P (δ = 1 x) using sample A. 3 Use ˆθ = i B ˆπ 1 i y i where ˆπ i = ˆP (δ i = 1 x i) and adjusted to satisfy i B ˆπ 1 i = N. Based on MAR assumption. Requires correct specification of the model for π(x) = P (δ = 1 x). 23 / 36

Data Integration for Setup One Proposed method 2 : Doubly robust (DR) estimation 1 Fit a working model for E(Y x) to get ŷ i = Ê(Yi xi) for each i A and i B. 2 Fit a working model for P (δ = 1 x) to get ˆπ i = ˆP (δ i = 1 x i) for each i B. 3 Use ˆθ DR = i A where ˆπ i = ˆP (δ i = 1 x i). Based on MAR assumption. w iŷ i + i B ˆπ 1 i (y i ŷ i) Requires one of the two models be correctly specified. 24 / 36

Justification for DR estimation Let ˆθ HT = i A w iy i be the Horvitz-Thompson estimator that could be used if y i were observed in sample A. Note that ˆθ DR ˆθ HT = i A w i ê i + i B ˆπ 1 i ê i where ê i = y i ŷ i. Double Robustness 1 If the model for P (δ = 1 x) is correctly specified, then E δ {ˆθ DR ˆθ HT } = i A w iê i + i U ê i which is design-unbiased to zero. 2 If the model for E(Y x) is correctly specified, then E(ê i) = 0 under MAR. 25 / 36

Data Integration for Setup Two Table: Setup Two Data X Y A B We are interested in estimating θ = i U y i from the two data sources. 26 / 36

Data Integration for Setup Two Note that we can compute ˆθ A = i A w iy i from sample A. Thus, unlike setup one, the goal of data integration is to improve the efficiency (i.e. reduce the variance), not to reduce the selection bias. How to incorporate the partial auxiliary information in data B? 1 If B = U, then it is an easy problem: Calibration weighting 2 For B U, we can treat B as a sub-population and apply the same calibration weighting for A B. 27 / 36

Calibration weighting in survey sampling Initial (design) weight: w i Final weight: w i satisfying i A w i (1, x i ) = i U (1, x i ). (2) Calibration weighting problem: Find w i that minimize D(w, w ) = i A ( ) w 2 w i i 1 w i subject to (2). 28 / 36

Calibration weighting for big data integration Auxiliary variable x i are observed only when δ i = 1. Calibration equation is changed to i A w i (1 δ i, δ i, δ i x i ) = i U (1 δ i, δ i, δ i x i ). (3) If y i = x i, it reduces to the post-stratification estimator in (1). 29 / 36

Simulation Study: Setup One Goal: Wish to compare four estimators 1 Naive estimator: mean of sample B 2 Rivers estimator 3 Proposed estimator 1 (PS estimator) using propensity score weighting. 4 Proposed estimator 2 (DR estimator) using a working model for E(Y x) and a working model for P (δ = 1 x). Three scenarios for the simulation study 1 Both models are correct 2 Only the model E(Y x) is correct. (i.e. The true distribution for P (δ = 1 x) is different from the working model. ) 3 Only the model P (δ = 1 x) is correct. 30 / 36

Simulation study one: Setup Outcome regression model 1 Linear model. That is, y i = 1 + x 1,i + x 2,i + ɛ i for i = 1,..., N, where x 1,i N(1, 1), x 2,i Ex(1), ɛ i N(0, 1), N = 1, 000, 000, and (x 1,i, x 2,i, ɛ i) is pair-wise independent. 2 Nonlinear model. That is, y i = 0.5(x 1,i 1.5) 2 + x 2,i + ɛ i, where (x 1,i, x 2,i, ɛ i) is the same with those in the linear model. Big data sampling mechanism 1 Linear logistic model. δ i p i Ber(p i) for i = 1,..., N, where logit(p i) = x 2,i. 2 Nonlinear logistic model. δ i p i Ber(p i) for i = 1,..., N, where logit(p i) = 0.5 + 0.5(x 2,i 2) 2. 31 / 36

Smulation Result Scenario n = 500 n = 1000 Bias S.E. C.R. Bias S.E. C.R. Naive 0.187 0.001 0.000 0.187 0.001 0.000 I Rivers 0.000 0.077 0.950-0.002 0.054 0.954 PS -0.001 0.023 0.950 0.000 0.016 0.946 DR -0.002 0.063 0.950-0.002 0.044 0.950 II III Naive -0.097 0.001 0.000-0.097 0.001 0.000 Rivers -0.003 0.077 0.955-0.001 0.055 0.945 PS 0.110 0.183 0.986 0.084 0.085 0.996 DR -0.001 0.063 0.947 0.000 0.046 0.946 Naive 0.187 0.001 0.000 0.187 0.001 0.000 Rivers 0.000 0.074 0.944 0.000 0.053 0.948 PS -0.001 0.022 0.946-0.001 0.016 0.947 DR -0.001 0.050 0.950 0.001 0.035 0.950 32 / 36

Simulation Study: Setup Two Finite population of size N = 1, 000, 000. x i N(2, 1) y i = 3 + 0.7 (x i 2) + e i y i = 2 + 0.9 (y i 3) + u i where e i N(0, 0.51) and u i N(0, 0.5 2 ). Note that yi is an inaccurate measurement of y i. Sampling mechanism for A: SRS of size n = 500. Big data sampling mechanism: Stratified random sampling 1 Create two strata using x i 2 and x i > 2. 2 Within each stratum, we select n h elements by SRS independently, where n 1 = 300, 000 and n 2 = 200, 000. 3 The stratum information is not available to data analyst. 33 / 36

Simulation Study: Setup Two In sample A, we observe y i. Two scenarios for sample B. 1 Observe y i: Big data is subject to selection bias 2 Observe y i : Big data is subject to selection bias and measurement error. We can identify the elements in A B. Three estimators for θ = E(Y ) 1 Mean of sample A (Mean A) 2 Mean of sample B (Mean B) 3 Proposed data integration (DI) method using calibration weighting: In scenario one, we use calibration using (1 δ i, δ iy i). In scenario two, we use calibration using (1 δ i, δ iy i ). 34 / 36

Simulation Result Table: Monte Carlo results of mean, variance, and the MSE of the four estimators (True mean = 3.00156) Scenario Method Mean Variance MSE ( 10 4 ) ( 10 4 ) Mean A 3.00 18.6 19 1 Mean B 2.89 0.0 121 Proposed DI 3.00 8.8 9 Mean A 3.00 18.6 19 2 Mean B 1.90 0.0 12,130 Proposed DI 3.00 11.4 11 35 / 36

Discussion Big data should not be analyzed naively. (Big data paradox!) Data integration is a useful tool for harnessing big data for finite population inference. Two setups are considered. In Setup One, both Rivers method and DR method are promising. In Setup Two, calibration weighting method is useful. In Setup One, MAR assumption is used. In Setup Two, we do not need MAR assumption. Promising area of research. 36 / 36