Sanjay Chaudhuri Department of Statistics and Applied Probability, National University of Singapore

Similar documents
Empirical likelihood inference for regression parameters when modelling hierarchical complex survey data

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

CS 195-5: Machine Learning Problem Set 1

The Use of Survey Weights in Regression Modelling

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

Why experimenters should not randomize, and what they should do instead

Bayesian Estimation Under Informative Sampling with Unattenuated Dependence


Estimating the Size of Hidden Populations using Respondent-Driven Sampling Data

A measurement error model approach to small area estimation

A Study on Conditional Likelihood Estimation for Survey Sampling

of being selected and varying such probability across strata under optimal allocation leads to increased accuracy.

STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Empirical Likelihood Methods for Sample Survey Data: An Overview

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Lawrence D. Brown* and Daniel McCarthy*

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory

F & B Approaches to a simple model

Unequal Probability Designs

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Bootstrap Approach to Comparison of Alternative Methods of Parameter Estimation of a Simultaneous Equation Model

Cross Validation & Ensembling

Research supported by NSF grant DMS

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Spatial inference. Spatial inference. Accounting for spatial correlation. Multivariate normal distributions

Lecture 4 Multiple linear regression

Institute of Actuaries of India

Least Squares Estimation-Finite-Sample Properties

High-dimensional regression

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

Sampling distributions and the Central Limit. Theorem. 17 October 2016

Bias Variance Trade-off

Estimation of change in a rotation panel design

Does low participation in cohort studies induce bias? Additional material

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Lecture 8: Information Theory and Statistics

Chapter 6: Sampling with Unequal Probability

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

A Unified Theory of Empirical Likelihood Confidence Intervals for Survey Data with Unequal Probabilities and Non Negligible Sampling Fractions

1 One-way analysis of variance

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities

1. (Rao example 11.15) A study measures oxygen demand (y) (on a log scale) and five explanatory variables (see below). Data are available as

Master s Written Examination

Model Selection for Geostatistical Models

Non-uniform coverage estimators for distance sampling

Central Limit Theorem ( 5.3)

Master s Written Examination

Chapter 14 Stein-Rule Estimation

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

Indirect Sampling in Case of Asymmetrical Link Structures

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Eco517 Fall 2014 C. Sims FINAL EXAM

Chapter 8: Estimation 1

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Introduction to General and Generalized Linear Models

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

Estimation of Parameters

A General Overview of Parametric Estimation and Inference Techniques.

BAGGING PREDICTORS AND RANDOM FOREST

Classification. Chapter Introduction. 6.2 The Bayes classifier

Characterizing Forecast Uncertainty Prediction Intervals. The estimated AR (and VAR) models generate point forecasts of y t+s, y ˆ

A note on multiple imputation for general purpose estimation

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Measurement Error and Linear Regression of Astronomical Data. Brandon Kelly Penn State Summer School in Astrostatistics, June 2007

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

F9 F10: Autocorrelation

Weighting in survey analysis under informative sampling

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Example: Four levels of herbicide strength in an experiment on dry weight of treated plants.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Lecture : Probabilistic Machine Learning

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

BIOSTATISTICAL METHODS

Multiple comparisons - subsequent inferences for two-way ANOVA

Functional central limit theorems for single-stage sampling designs

Multiple Regression Analysis

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

A NONINFORMATIVE BAYESIAN APPROACH FOR TWO-STAGE CLUSTER SAMPLING

Using data to inform policy

Machine Learning Basics: Maximum Likelihood Estimation

Marginal Specifications and a Gaussian Copula Estimation

multilevel modeling: concepts, applications and interpretations

Unit 12: Analysis of Single Factor Experiments

UNIVERSITÄT POTSDAM Institut für Mathematik

Nonparametric Inference via Bootstrapping the Debiased Estimator

UNIVERSITY OF TORONTO Faculty of Arts and Science

Introduction to Maximum Likelihood Estimation

Accounting for Complex Sample Designs via Mixture Models

STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons (Ch. 4-5)

Estimation of Parameters and Variance

Chapter 3. Diagnostics and Remedial Measures

Transcription:

AN EMPIRICAL LIKELIHOOD BASED ESTIMATOR FOR RESPONDENT DRIVEN SAMPLED DATA Sanjay Chaudhuri Department of Statistics and Applied Probability, National University of Singapore Mark Handcock, Department of Statistics, University of California, Los Angeles

INTRODUCTION In classical survey sampling the estimators of population means and totals usually use unit-wise inclusion probabilities. Pairwise inclusion probabilities, when available, are mostly used to estimate the standard errors. The likelihood based estimators, however, often benefit from correct specification of dependency. In this talk we inspect a new empirical likelihood based procedure to estimate population parameters from data collected through an complex sampling scheme. The proposed procedure depends on the pair-wise inclusion probabilities directly and works well for dependent data. 1

A MOTIVATING EXAMPLE There were 4600 counties in United States of America in 2004. The county-wise total of cast votes and those in favour of the three presidential candidates namely John Kerry, George W. Bush and Ralph Nader are available. Suppose we want to estimate the proportion of counties John Kerry won in the election. The true value is known to be 0.3276. However, we want to estimate this number by drawing samples of size n = 40 from the population. Usually more populous counties in the urban areas vote democrat, so it natural to consider PPS sampling schemes where sampling probabilitiesπ i are proportional to the total votes cast. Three PPS schemes are considered, (i) Tillé Sampling, (ii) Midzuno sampling and (iii) Systematic sampling. These three schemes would sample units with various degrees of pairwise dependence π ij between units i and j. 2

MOTIVATION (CONTD.) Suppose y i is an indicator function that the ith county was won by John Kerry. We consider three estimators of the population proportion. These are: Sample mean = 1 n n i=1 y i, which ignores the unequal probability sampling. i, which only involves the first order selec- Hájek estimator = n i=1 y iπi 1 / n tion probabilities. i=1 π 1 We propose a new estimator which uses pairwise selection probabilities for estimating the proportion. The estimator is given by: Proposed estimator = 1 n n i=1 j=i+1 (y i +y j )πij 1 2 n n. i=1 j=i+1 π 1 ij We first compare the performances of above three estimators for three PPS sampling schemes by directly sampling from the population. The results presented below are based on 10,000 samples. 3

RESULTS Tillé Midzuno Systematic Tille sampling Midzuno sampling Systematic sampling SM PE HJ SM PE HJ SM PE HJ 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5.35 5.44 5.81 4.28 3.21 Sample mean Hajek Estimator Proposed Estimator 4.36 3.27 Sample mean Hajek Estimator Proposed Estimator 4.65 3.49 Sample mean Hajek Estimator Proposed Estimator Density 2.14 Density 2.18 Density 2.32 1.07 1.09 1.16 0.00 0.00 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Counties for Kerry Proportion of Counties for Kerry Proportion of Counties for Kerry PPS Estimators Sampling Sample Mean Hájek Estimator Proposed Estimator Scheme mse bias var mse bias var mse bias var Tillé 0.017 0.110 0.005 0.035 0.022 0.035 0.025 0.029 0.024 dep=0.002 Midzuno 0.017 0.111 0.005 0.037 0.019 0.037 0.027 0.027 0.026 dep=0.010 Systematic 0.016 0.110 0.004 0.034 0.017 0.034 0.014 0.049 0.012 dep=130.79 4

OBSERVATIONS In terms of mse, the sample mean, which ignores the PPS, surprisingly performs quite well. It is very biased, but has the smallest variance. The Hájek estimator actually performs worst, in terms of the mse. It has the lowest bias for all sampling schemes. But its variance is the highest. The proposed estimator does well for the systematic sampling. In fact it has the smallest mse overall,even though it is more biased than the Hájek estimator. From the histograms, it is seen that for Tillé and Midzuno schemes, the proposed estimator is quite close to the Hájek estimator. However, for systematic sampling with more dependence, the performance of the former improves. Even though its absolute bias increases, there is a marked accumulation near the true value. In summary, when the sampling is dependent the proposed estimator may be better. On the other hand, for independent or near independent sampling it probably won t be worse than the Hájek estimator. In what follows we provide an empirical likelihood based justification of our proposed estimator. 5

TOWARDS A COMPOSITE LIKELIHOOD BASED METHOD (SETUP) We discuss a general construction of a composite empirical likelihood below. We need some structural assumptions on the sampling design and the scheme. Super population: Y, X, D N i.i.d. draws. Finite population: Y, X, D Sample of size n from design. Sample: Y, X, Z D Consider a super population with response Y, auxiliary variablesx = { X (1), X (2),..., X (p)} and design variables D. The population P is an i.i.d. sample of size N from above. A random sample S of n observations is drawn from P according to a design depending on D. The data does not have all of D, a subset Z = {Z (1), Z (2),..., Z (m) } is supplied. Variables in X are not in the design. However, Y and X may depend on some variables in Z. A has all the explanatory variables in the model. Further, let V = {Y} X Z. 6

SAMPLE AND THE DESIGN For S P suppose I S is the random indicator function for S S. The sample S is the unique largest subset S of P such that I S = 1. If the sample units are drawn according to a design, the sampling mechanism may not be ignorable. That is, the observed distribution of V in the sample S may be different from its distribution in the population and may depend on the particular sample selected. For any S P, the design specifies the conditional probability of I S = 1, givend P. Suppose π S = Pr P (I S = 1 D P ), where Pr P ( ) is the probability under the population. Notice that, π S is a random variable because of D P. Though a bit controversial, in practice this assumption is un-avoidable. It also facilitates analysis, which we shall see at later. 7

BASIC ASSUMPTIONS ON THE DESIGN AND SAMPLE SELECTION X S Y S Zc S Z c S 7 7 Z S X S 7 π 3 S π S 7 Z S V S I S I S V S Y S Basic assumptions: For alls P, under the population distribution we assume 1. π S (Y P,X P ) D P. 2. I S (X P,Y P,D P ) π S. Sampling scheme. Implications of the basic assumptions: Assumption 1. π S = π(s,d P ). Note: π is user specified. I S D P π S, S. Selection ignorability (Sugden and Smith [1984]). I S (X P,Y P ) D P. Basic design assumption of Scott[1977]. Pr P [I S = 1 V S ] = E P [π S V S ]. Pr P [I S = 1,V S ] = Pr P [I S = 1 V S ]Pr P [V S ] = E P [π S V S ]Pr P [V S ]. 8

THE LIKELIHOOD Suppose i S = {(i,j) : i < j, i,j S}. Following Pfeffermann et. al., for any (i,j) i S define: (i,j) = Pr P [(V i,v j ) (i,j) i S ] = Pr P [(i,j) i S V i,v j ]df (P) (ij). Pr P [(i,j) i S ] df (S) where F (P) denotes the unknown population distribution. Under our assumption we get: (i,j) = E P [π ij V i,v j ]df (P) (ij) EP [π ij V i,v j ]df (P). df (S) Following Pfeffermann et. al. again, we can construct a composite likelihood of the sample by multiplying df (S) (i,j) over the set i S. The likelihood is given by: (ij) (i,j) i S df (S) (i,j) = (i,j) i S E P [π ij V i,v j ]df (P) (ij). EP [π ij V i,v j ]df (P) (ij) 9

EMPIRICAL LIKELIHOOD Suppose ν ij = E P [π ij V i,v j ]. Clearly E P [π ij V i,v j ]df (P) (ij) = ν ij df (P) (ij) = E P [π ij ]. Now if we further assume (cf. Godambe, Hartley Rao) E P [π ij ] = Υ, for all (i,j) i S, we can write an empirical likelihood as (i,j) i L(w,ν) = S ν ij w ij { } ( n ν 2). (i,j) is ijw ij Here we estimate the truef (P) using empirical likelihood. The unknown weightsw ij are the jumps of a distributionf at the point(v i,v j ). We estimateυby (i,j) i S ν ij w ij. This likelihood is same as the Vardi s likelihood. We are using the pairwise inclusion probabilities. 10

ESTIMATING THE MEAN In order to estimate µ 0, we maximise L(w,ν) over the set: W = w : w ( n 2) 1, w ij {(y i µ)+(y j µ)} = 0, µ R (i,j) i S Where ( N 2) 1 is the ( N 2) 1 dimensional simplex. Our maximal empirical likelihood estimator of µ 0 is given by: It follows that under our setup, ˆµ (n) E = argmax w W {L(w,ν)}. 1. The composite likelihood L(w, ν) has a unique maximum in W at ŵ ij = νij 1 / νij 1. (i,j) i S 2. Corresponding estimator ˆµ (n) E of µ 0 is unique and is given by: ˆµ (n) E = 1 2 ij / ij. (i,j) i S (y i +y j )ν 1 (i,j) i S ν 1 11

ASYMPTOTIC PROPERTIES Our estimator is based on non-degenerate U-statistics. So its asymptotic properties can be easily obtained. In particular, one can show that under the population distribution, where variables are well behaved and some regularity conditions: is strongly consistent for µ 0 under the distribution in the population. [ 2. Suppose U 1 = E P (Y 1 +Y 2 2µ 0 )ν 1 V 1 = v 1 ]. Then N 1. ˆµ (N) E 12 ( ˆµ (N) E µ 0 converges in distribution to an ( 0,σ 2) variable, withσ 2 = Var P [U 1 ] { E P [ ν 1 12 ) ]} 2. 12

PREDICTING FINITE POPULATION MEAN Our predictor ˆµ N for the finite population mean is given by ˆµ (n) E. The variance of ˆµ N with a finite population correction is given by ( Var P [ˆµ N ] = 1 n 1 ) σ 2 N 1 n, from which the standard errors can be computed. Using ŵ and ˆµ (n) E, we can estimate σ2 /n from the data. The first estimator which assumes that Var P [U 1 ] = E P [ U 2 1 ] is given by Var ˆ (1) P [ˆµ N ] = ( 1 n 1 ) { n n i=1 N 1 j=1,j i (y i +y j 2ˆµ (n) { n i=1 n j=1,j i ν 2 ij } 2 E )ν 2 ij } 2. Alternatively, one can compute the actual variance of U 1 from the data and get an estimate Var ˆ (2) P [ˆµ N ] of Var P [ˆµ N ]. 13

RESULTS FOR THE ELECTION DATA As an illustration, we estimate the standard errors and the coverages of 95% confidence intervals for the proposed estimator on the election data we used before. The estimated standard errors are compared with the those obtained by directly sampling from the population as shown in the previous table. These results are based on1000 samples each of size40. We use the Hartley-Rao estimator of the variance of the Hájek estimator. The Yule-Grundy estimator turns out to be negative for Systematic sampling. PPS Proposed Estimator Hájek Estimator Scheme Var Var ˆ P cov Var ˆ P cov Var Hartley-Rao cov Tillé 0.155 0.151 0.830 0.117 0.734 0.187 0.138 0.735 Midzuno 0.161 0.150 0.807 0.113 0.680 0.192 0.134 0.681 Systematic 0.110 0.139 0.923 0.120 0.879 0.184 0.138 0.749 From the table it seems that the proposed estimator of the standard error specially Var ˆ (1) P is quite close to the variance obtained from direct sampling from the population. The coverage seems to be good, specially for Systematic sampling. 14

ESTIMATING STANDARD ERROR WITH DEPENDENCE In finite samplesy i andy j are dependent. The variance formula is not very accurate. We use an ad hoc general estimator of the standard error. Suppose we define: ŵ ji = ŵ ij, for j < i, ŵ i = n j=1,j iŵij n n i=1 j=1,j iŵij, Û i = n j=1,j i ŵ ij ŵ i yi +y j 2ˆµ E ν ij. With these definition the denominator can be estimated by: [ ] n n Ê P ν 1 ŵ ij 12 =. ν i=1 ij j=1,j i If the sampled units were selected independently the estimator takes the form: 2 Var ˆ P [U 1 ] n n n = ŵ 2 n i Û2 i = ŵ ij yi +y j 2ˆµ. ν ij i=1 i=1 j=1,j i 15

ESTIMATING STANDARD ERROR WITH DEPENDENCE If the sampled units are dependent, the above estimator requires a correction which depends on the sampling probabilities. Suppose ( n ) n s = sign The correction term is given by: n n ( s =s i=1 n i=1 j=1,j i n j=1,j i ( ) 1 π iπ j ŵ i ŵ j Û i Û j π ij 1 π iπ j π ij ){ n k=1,k i i=1 So finally the variance of ˆµ is given by: { n n n ( ˆ Var P [ˆµ N ] = i=1 ŵ 2 i Û2 i +s i=1 j=1,j i (π ij π i π j ) ŵ ik yi +y k 2ˆµ E ν ik j=1,j i. }{ n l=1,l j ) }{ 1 π iπ j n ŵ i ŵ j Û i Û j π ij ŵ jl yj +y l 2ˆµ E ν jl i=1 n j=1,j i ŵ ij ν ij } } 2.. 16

EXTENSIONS TO REGRESSION MODELS There are several possible extensions to our proposed procedure. We discuss the one for regression estimation. Suppose V = (Y,A,Z), where Y is the response, A the set of auxiliary variables, which may include some design variables and Z is the set of observed design variables not in A. A regression model is specified by a vector of functions ψ θ (Y,A) depending on Y, A and a parameter θ R d, satisfying E P [ψ θ (Y 1,A 1 )] = 0. The composite likelihood L(w,ν) can be used here. However, the unknown w is determined by maximising the likelihood over the set W = w : w ( n 2) 1, w ij {ψ θ (y i,a i )+ψ θ (y j,a j )} = 0. θ R d (i,j) i S As before, the parameterθ can be estimated by the maximal argument of L overw. The rest like the standard errors etc. can be estimated similarly. 17

AN RDS EXAMPLE We consider a networked population of high risk HIV positive persons in Colorado Springs in 1990. The network of social ties were collected by the El Paso dept. public health (see Poteratt et. al. 2004). The network has 2587 nodes and 10381 edges. Covariate information is available. The RDS sample was drawn with 10 randomly selected seeds with probability proportional to there degrees, 2 tickets and was continued till the sample reached size 500. In order to find the selection probabilities 50000 such samples were drawn and the proportion of individual pair-wise etc occurrences were computed. The variation in probabilities thus obtained was about 3-4%. We are interested in estimating the proportion of non-white members in the network. The actual proportion is known to be.2810205. 18

AN RDS EXAMPLE (CONTD.) Some results: Estimator bias se rmse efficiency ŝe SS 0.0063 0.09648 0.0967 1.000 0.0965 Hajek -0.0007 0.09287 0.0928 0.923 0.0509 Proposed -0.0011 0.08884 0.0889 0.844 0.0882 The SS estimator (Krista Gile) is the current state of the art. It beats many other traditional estimators. Clearly, the proposed estimator is better than SS in terms of both bias and the variance. Our variance estimator, obtained using the formula underestimates. The variance of SS is usually determined by bootstrap. 19

CONCLUSION Several extensions of the basic methodology described here are possible. Population level constraints not depending on the parameter of interest can be easily included. A two step method of estimation would be useful in that case. Using U-statistics, empirical likelihood and pairwise sampling probabilities a new estimator of the population variance can also be considered. It is possible to handle degenerate U-statistics. The pairwise inclusion probabilities are not often available. In such cases we need to use Hartley-Rao, Overton approximations. Performance of the proposed estimator with approximate pairwise weights needs to be looked at. There are several unresolved issues about the finite sample and asymptotic behaviours of the estimator. Several applications to respondent driven sampling, snowball sampling, adaptive cluster sampling etc. are possible. 20