Imputation for Missing Data under PPSWR Sampling

Similar documents
Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Nonresponse weighting adjustment using estimated response probability

6. Fractional Imputation in Survey Sampling

Combining data from two independent surveys: model-assisted approach

Jong-Min Kim* and Jon E. Anderson. Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

Jong-Min Kim* and Jon E. Anderson. Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris

Chapter 8: Estimation 1

Recent Advances in the analysis of missing data with non-ignorable missingness

A measurement error model approach to small area estimation

Model Assisted Survey Sampling

ESTIMATION OF DISTRIBUTION FUNCTION AND QUANTILES USING THE MODEL-CALIBRATED PSEUDO EMPIRICAL LIKELIHOOD METHOD

Topic 12 Overview of Estimation

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Inference with Imputed Conditional Means

REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLES

Graybill Conference Poster Session Introductions

Chapter 3: Maximum Likelihood Theory

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Regression Models - Introduction

ASYMPTOTIC NORMALITY UNDER TWO-PHASE SAMPLING DESIGNS

Sampling from Finite Populations Jill M. Montaquila and Graham Kalton Westat 1600 Research Blvd., Rockville, MD 20850, U.S.A.

analysis of incomplete data in statistical surveys

Combining Non-probability and Probability Survey Samples Through Mass Imputation

Asymptotic Normality under Two-Phase Sampling Designs

A decision theoretic approach to Imputation in finite population sampling

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

An Overview of the Pros and Cons of Linearization versus Replication in Establishment Surveys

Data Integration for Big Data Analysis for finite population inference

Empirical Likelihood Methods for Sample Survey Data: An Overview

What is Survey Weighting? Chris Skinner University of Southampton

Introduction to Survey Data Integration

Review of Econometrics

On the bias of the multiple-imputation variance estimator in survey sampling

Chapter 4: Imputation

Fractional Imputation in Survey Sampling: A Comparative Review

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation

Generalized Linear Models Introduction

Chapter 4. Replication Variance Estimation. J. Kim, W. Fuller (ISU) Chapter 4 7/31/11 1 / 28

A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

Parametric fractional imputation for missing data analysis

Statistics II. Management Degree Management Statistics IIDegree. Statistics II. 2 nd Sem. 2013/2014. Management Degree. Simple Linear Regression

TWO-WAY CONTINGENCY TABLES UNDER CONDITIONAL HOT DECK IMPUTATION

University of Michigan School of Public Health

ABHELSINKI UNIVERSITY OF TECHNOLOGY

Cluster Sampling 2. Chapter Introduction

Plausible Values for Latent Variables Using Mplus


Two-phase sampling approach to fractional hot deck imputation

Weighting in survey analysis under informative sampling

The Effective Use of Complete Auxiliary Information From Survey Data

Link lecture - Lagrange Multipliers

RESEARCH REPORT. Vanishing auxiliary variables in PPS sampling with applications in microscopy.

in Survey Sampling Petr Novák, Václav Kosina Czech Statistical Office Using the Superpopulation Model for Imputations and Variance

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Linear Models in Machine Learning

J.N.K. Rao, Carleton University Department of Mathematics & Statistics, Carleton University, Ottawa, Canada

of being selected and varying such probability across strata under optimal allocation leads to increased accuracy.

INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING

Introduction to Survey Data Analysis

Petter Mostad Mathematical Statistics Chalmers and GU

arxiv: v2 [math.st] 20 Jun 2014

A Design-Sensitive Approach to Fitting Regression Models With Complex Survey Data

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY

Simple and Multiple Linear Regression

Discussing Effects of Different MAR-Settings

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

Simple Linear Regression Analysis

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

This paper is not to be removed from the Examination Halls

A note on multiple imputation for general purpose estimation

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Linear Regression. Junhui Qian. October 27, 2014

MA 575 Linear Models: Cedric E. Ginestet, Boston University Bootstrap for Regression Week 9, Lecture 1

Graduate Econometrics I: Unbiased Estimation

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Multidimensional Control Totals for Poststratified Weights

STAT 512 sp 2018 Summary Sheet

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Workpackage 5 Resampling Methods for Variance Estimation. Deliverable 5.1

Admissible Estimation of a Finite Population Total under PPS Sampling

Propensity score adjusted method for missing data

Applied Econometrics (QEM)

Calibration estimation using exponential tilting in sample surveys

L2: Two-variable regression model

Nonstationary Panels

Gibbs Sampling in Endogenous Variables Models

VARIANCE ESTIMATION FOR NEAREST NEIGHBOR IMPUTATION FOR U.S. CENSUS LONG FORM DATA

The Simple Regression Model. Part II. The Simple Regression Model

Bessel s Equation. MATH 365 Ordinary Differential Equations. J. Robert Buchanan. Fall Department of Mathematics

Topic 16 Interval Estimation

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

2017 Financial Mathematics Orientation - Statistics

Economics 582 Random Effects Estimation

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Econometrics Multiple Regression Analysis: Heteroskedasticity

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Transcription:

July 5, 2010 Beijing Imputation for Missing Data under PPSWR Sampling Guohua Zou Academy of Mathematics and Systems Science Chinese Academy of Sciences 1 23

() Outline () Imputation method under PPSWR sampling () Special case: SRS sampling and uniform response () () 2 23

() Item nonresponse: occurs frequently in sample surveys. Example In sample survey on transportation, some vehicles may not be found, but their tonnage or seat capacity is known to us. Solutions: (1) Increasing response probability; (2) Imputing the missing values of the sampled units. Imputation methods: Ratio imputation, regression imputation, random imputation etc. Shortcoming: Uniform response (often), simple random sampling PPSWR sampling: the sampling with probability proportional to size with replacement which is often used in the first stage of multistage sampling. 3 23

() Imputation method under PPSWR sampling 1. Notation Survey population U: consist of N distinct units identified through the labels i = 1,..., N. Suppose that the auxiliary variable, X, is available for each unit of the population, but the variable of interest, Y, is missing for some of the sampled units. s n : sample with size n drawn from U by PPSWR sampling. s r : the respondent set with size r( 1). 4 23 s n r : the nonrespondent set with size n r.

Uniform response mechanism: independent response across sample units and equal response probability, p (q 1 p). Non-uniform response mechanism: independent response across sample units and possibly unequal response probabilities, p i (q i 1 p i ). Response indicator on y i : { 1, if the unit i responds to y i, I i = 0, otherwise. 5 23

2. Imputation method We first let p i be known. For the missing Y values, we suggest the following imputation method: y i = 1 n r q j y j x i, i s n r. p j s j x j r 6 23

It is interesting to note that yi is an approximation of the weighted least squares predictor under the following superpopulation model: y i = βx i + e i, ε(e i ) = 0, ε(e 2 i ) = σ2 x 2 i, ε(e ie j ) = 0 (i j). In fact, it can be seen that under the above superpopulation model, the weighted least squares estimator of β with the weights w i q i /(p i x 2 i ) is given by s ˆβ = r (q i y i )/(p i x i ) s r 1/p i r. Further, the expectation of s r 1/p i with respect to the response mechanism is n. 7 23

3. Estimator of population mean and its variance Applying the above imputation method to the PPSWR sampling, the corresponding Hansen-Hurwitz estimator of the population mean Ȳ is ˆȲ P P S = X n i s r y i = X p i x i n i s n y i p i x i I i. is design-unbiased under the non- Theorem 1. The estimator ˆȲ P P S uniform response mechanism. Theorem 2. Under the non-uniform response, the variance of ˆȲ P P S is given by V ( ˆȲ P P S) = 1 N 2 where Z i = X i /X. { 1 n N i=1 ( ) 2 Yi Z i Y + 1 Z i n N i=1 } q i Yi 2, p i Z i 8 23

4. Jackknife variance estimator Define the imputed values as follows: For i s n r, ( 1 q k y ) k x i, j s r, n r p k x k yi a k s (j) = r {j} ( 1 q k y ) k x i, j s n r, n r 1 p k x k k s r when the j-th sample unit is deleted. Based on these imputed values, the estimator of Ȳ can be obtained as X y i, j s r, n 1 p i x i ˆȲ P a i s P S(j) = r {j} X y i, j s n r, n 1 p i x i i s r when the j-th sample unit is deleted. 9 23

Define the j-th pseudovalue as follows: ˆȲ j = n ˆȲ P a P S (n 1) ˆȲ P P S(j). A jackknife variance estimator of ˆȲ P P S v J ( ˆȲ P P S ) = n 1 n = [ ˆȲ P P S j s n X 2 n(n 1) i s r is then given by y2 i p 2 i x2 i ˆȲ a P P S(j)] 2 1 n ( i s r y i p i x i ) 2. Theorem 3. Let n > 1. Then under the non-uniform response, we have E[v J ( ˆȲ P P S)] = V ( ˆȲ P P S). Theorem 3 shows that the jackknife variance estimator v J ( ˆȲ P P S ) is designunbiased under the non-uniform response. 10 23

5. Extension to the case of unknown response probability In practice, the response probability p i is rare to be known. Here we model the response probability p i by a parametric model p i = g(x i ; θ), where g is a known smooth function and θ is an unknown finitedimensional parameter. An example of such a model is the logistic regression model. The estimator ˆθ of the parameter θ can be obtained by the maximum likelihood approach. Let ˆp i = g(x i ; ˆθ) be the estimated response probability. Then the corresponding estimator of the population mean Ȳ and its jackknife variance estimator are given by ˆȲ e P P S = X n i s r y i ˆp i x i, 11 23

and respectively. v J ( ˆȲ e P P S) = X 2 n(n 1) i s r y2 i ˆp 2 i x2 i 1 n ( i s r y i ˆp i x i ) 2, For the relationship between the estimators with known and unknown p i, under some regularity conditions, we have and ˆȲ e P P S = ˆȲ P P S + O p ( 1 n ), ( ) v J ( ˆȲ P e P S) = v J ( ˆȲ 1 P P S) + O p. n 3/2 12 23

() Special case: SRS sampling and uniform response Imputed value: y i = Imputed estimator: 1 r y j x j s j r ˆȲ m SRS= ū r x n + where ū r = 1 u j with u j = y j /x j. r j s r x i, i s n r. r(n 1) (r 1)n (ȳ r ū r x r ), 13 23

The estimator ˆȲ SRS m is a design-unbiased estimator under uniform response. It is interesting that it is just the version of Hartley-Ross estimator (Hartley and Ross 1954, Nature) for estimating the population mean under two-phase sampling. Variance: V ( ˆȲ SRS) m = 1 ( ) np [S2 Y + (1 p)(2ȳ Ū X + Ū 2 T Ū 2 X2 2Ū Z)] 1 + O n 2, where T = 1 N and N T i with T i = Xi 2, and Z = 1 N i=1 S 2 Y = 1 N 1 N (Y i Ȳ )2. i=1 N Z i with Z i = Y i X i, i=1 14 23

Jackknife variance estimator: v J ( ˆȲ SRS m ) = 1 { (n 1)(r 1)ū 2 n(n 1)(r 1) r s 2 x(n) + 1 (r 2) 2[n(r 2) x n (nr r 2) x r ] 2 s 2 u(r) + (n 2)2 r 2 (r 2) 2 s2 y(r) where s y ν 1 u ν 2x ν 3(r) = 1 r 1 + (n 2)r(nr 2r2 + 4r 4) (r 2) 2 ū 2 r s 2 x(r) 2(n 2)r + (r 2) 2 [n(r 2) x n (nr r 2) x r ] s yu (r) 2 (r 2) 2[(n2 r n(r 2 r + 2) (r 1)(r 2))(r 2) x n (n 2 r 2 nr(r 2 + 4) + 2(3r 2 4r + 4)) x r ]ū r s ux (r) 2(n 2)r(nr r2 + r 2) (r 2) 2 ū r s yx (r) 2 r 2 [n(r 2) x n (nr r 2) x r ]s u 2 x(r) + 2(nr r2 + r 2) ū r s ux 2(r) r 2 2r +s u 2 x2(r) n (ȳ r ū r x r )s ux (r) + 2(n 2)r s yux (r) r 2 } r2 n(r 1) (ȳ r ū r x r ) 2, j s r (y j ȳ r ) ν 1 (u j ū r ) ν 2 (x j x r ) ν 3. 15 23

Two approximate Jackknife variance estimators: v J ( ˆȲ SRS m ) = nū2 1 rs 2 x(n) + 1 r [( x r x n ) 2 s 2 u(r) 2( x r x n )s yu (r) + s 2 y(r)] ( 1 + r 2 ) ( 1 ū 2 n rs 2 x(r) + 2 r 1 ) ū r [( x r x n )s ux (r) s yx (r)], n and v J( ˆȲ SRS) m = nū2 1 rs 2 x(n) + 1 ( 1 r s2 y(r) + r 2 ) ( 1 ū 2 n rs 2 x(r) 2 r 1 ) ū r s yx (r). n 16 23

Asymptotic design-unbiasedness: (i) E[v J ( ˆȲ SRS m m )] = V ( ˆȲ SRS ) + O ( ) 1 n. 2 (ii) E[v J ( ˆȲ SRS m m )] = V ( ˆȲ SRS ) + O ( ) 1 n. 2 (iii) E[v J ( ˆȲ SRS m m )] = V ( ˆȲ SRS ) + O ( ) 1 n. 2 Remark: We should note that the (approximate) design-unbiasedness is the main requirement for a good estimator in survey sampling. The approximate design-unbiasedness of the Jackknife variance estimators has been found first in Zou and Feng (1998), and then in Skinner and Rao (2002) and Zou et al. (2002). Such a property universally holds from our subsequent analysis. 17 23

() The data are generated from the three ratio models which are different only in the auxiliary variables: y i = 3.9x i + x i ε i with x i U(0.1, 2.1), N(1, 1), and N(20, 16), respectively, ε i N(0, 1), and x i and ε i are assumed to be independent. In the case of uniform response, we set p = 0.76; in the case of nonuniform response, the unequal response probability p i for the unit i follows the logistic model p i = exp( 1 + 2.3x i) 1 + exp( 1 + 2.3x i ). 18 23

We first generate a finite population with the size of N = 10, 000. Then the samples with n = 100 and n = 500 are drawn from the finite population by the PPSWR sampling, respectively. We repeat the process B = 5, 000 times. For the b-th run, denote the estimators of the population mean under the uniform (b) and non-uniform responses as and ˆȲ P P S, respectively. We calculate the ˆȲ I(b) P P S simulated means and variances as follows: and E ( ˆȲ I P P S) = 1 B V ( ˆȲ I P P S) = 1 B B ( b=1 B b=1 ˆȲ I(b) P P S, ˆȲ I(b) P P S Ȳ )2, E ( ˆȲ P P S) = 1 B V ( ˆȲ P P S) = 1 B B b=1 B ( b=1 ˆȲ (b) P P S ; ˆȲ (b) P P S Ȳ )2. Similarly, the corresponding jackknife variance estimators are calculated as E [v J ( ˆȲ I P P S)] = 1 B B b=1 v (b) J ( ˆȲ I P P S), and E [v J ( ˆȲ P P S)] = 1 B B b=1 v (b) J ( ˆȲ P P S). 19 23

20 23

Table 1 summarizes the results on the simulated mean, variance and jackknife variance estimate. It can be seen from the table that both of the estimators ˆȲ P I P S and ˆȲ P P S are very close to the true population means for the three distributions of auxiliary variable. Also, the jackknife variance estimators perform very well. 21 23

To study the effect of the response probability, we set various response probabilities: p = 0.5 for uniform response, and p i follows p i = exp{0.3(x i X)} 1 + exp{0.3(x i X)} for non-uniform response. The results are presented in Table 2. It is observed that the approximate design-unbiasedness of the proposed estimators still holds. On the other hand, it is also clear that the variances become larger for low response probability. For some other settings of response probability, we obtain the similar results but omit them here for saving space. 22 23

23 23

() Incomplete auxiliary information and multiple auxiliary information Estimation of response probability: the use of non-parametric approach. Other unequal probability sampling 24 23

Thank you! 25 23