Combining Non-probability and Probability Survey Samples Through Mass Imputation

Size: px
Start display at page:

Download "Combining Non-probability and Probability Survey Samples Through Mass Imputation"

Transcription

1 Combining Non-probability and Probability Survey Samples Through Mass Imputation Jae-Kwang Kim 1 Iowa State University & KAIST October 27, Joint work with Seho Park, Yilin Chen, and Changbao Wu

2 Outline 1 Introduction 2 Proposed method 3 Variance estimation 4 Replication variance estimation 5 A real data application 6 Conclusion Kim (ISU) Survey Data Integration October 27, / 26

3 1. Introduction We are interested in combining information from two samples, one with probability sampling and the other with non-probability sampling (such as voluntary sample). We observe X from the probability sample and observe (X, Y ) from the non-probability sample. The sampling mechanism for sample B is unknown. Table: Data Structure Data X Y Representativeness A! Yes B!! No Kim (ISU) Survey Data Integration October 27, / 26

4 Mass Imputation Rivers (2007) idea 1 Use X to find the nearest neighbor for each unit i A. 2 Compute ˆθ = i A w i y i where yi is the imputed value of y i in i A using nearest neighbor imputation. Based on MAR (missing at random) assumption f (y x, δ = 1) = f (y x) where δ i = { 1 if i B 0 if i / B. Kim (ISU) Survey Data Integration October 27, / 26

5 Mass Imputation for data integration Basic Steps 1 Use sample B to estimate the conditional distribution f (y x). 2 Predict y-values for sample A using the estimated conditional distribution. If f (y x) is correctly specified and MAR holds, then the mass imputation estimator is unbiased. Question: How to estimate the variance of mass imputation estimator? Kim (ISU) Survey Data Integration October 27, / 26

6 2. Proposed method Assume Y i = m(x i ; β) + e i (1) for some β with known function m( ), with E(e i x i ) = 0. We assume that ˆβ is the unique solution to Û(β) {y i m(x i ; β)} h(x i ; β) = 0 (2) i B for some p-dimensional vector h(x i ; β). Thus, we use the observations in sample B to obtain ˆβ and then use it to construct ŷ i = m(x i ; ˆβ) for all i A. Kim (ISU) Survey Data Integration October 27, / 26

7 Theorem Suppose that model (1) and MAR condition hold. Under some regularity conditions, the mass imputation estimator satisfies where ỹ I (β) = N 1 i A c = [ n 1 B ȳ I = 1 w i m(x i ; N ˆβ) (3) i A ȳ I = ỹ I (β 0 ) + o p (n 1/2 B ) (4) w i m(x i ; β) + n 1 B {y i m(x i ; β)} h(x i ; β) c, i B ] 1 N ṁ(x i ; β 0 )h (x i ; β 0 ) N 1 ṁ(x i ; β 0 ), i B β 0 is the true value of β in (1), and ṁ(x; β) = m(x; β)/ β. Kim (ISU) Survey Data Integration October 27, / 26 i=1

8 Also, and E{ỹ I (β 0 ) ȳ N } = 0, (5) { V {ỹ I (β 0 ) ȳ N } = V N 1 w i m(x i ; β 0 ) N 1 m(x i ; β 0 ) i A i U [ + E E ( ] ei 2 ) { x i h(xi ; β 0 ) c } 2, (6) where e i = y i m(x i ; β 0 ). n 2 B i B } Kim (ISU) Survey Data Integration October 27, / 26

9 Example Under the special case of linear model y i = x iβ + e i with e i (0, σe), 2 we can use ŷ i = x ˆβ i with ˆβ = ( i B x ix ) 1 i i B x iy i to construct regression mass imputation. If we assume SRS for sample A, the asymptotic variance in (6) reduces to ) V (ˆθ I,reg = 1 β n 1σ 2 x { σ 2 ( xn x B ) 2 } e + E A n B i B (x i x B ) 2 σe. 2 If sample B were an independent random sample of size n B, then the third term would of order O(n 2 B ) and is negligible. However, as sample B is a non-probability sample, the third term is not negligible. Kim (ISU) Survey Data Integration October 27, / 26

10 3. Variance estimation For variance estimation of the mass imputation estimator (3), we have only to estimate the variance of the linearized estimator ỹ I (β 0 ) in (4). Since the variance formula can be written as where { V {ỹ I (β 0 ) ȳ N } = V A + V B V A = V N 1 w i m(x i ; β 0 ) N 1 m(x i ; β 0 ) i A i U [ V B = E E ( ] ei 2 ) { x i h(xi ; β 0 ) c } 2, n 2 B i B we can estimate V A and V B separately. } Kim (ISU) Survey Data Integration October 27, / 26

11 To estimate ˆV A, we can use ˆV A = N 2 π ij π i π j w i m(x i ; ˆβ)w j m(x j ; ˆβ). π ij i A j A where π ij is the joint inclusion probability for unit i and j, which is assumed to be positive. To estimate V B, we can use ˆV B = n 2 B i B ê 2 i { h(x i ; ˆβ) ĉ } 2, (7) where ê i = y i m(x i ; ˆβ) and [ ] 1 ĉ = ṁ(x i ; β 0 )h (x i ; β 0 ) N 1 w i ṁ(x i ; β 0 ) n 1 B i B Hence, the variance of ȳ I,reg can be estimated by ˆV (ȳ I,reg ) = ˆV A + ˆV B. i A Kim (ISU) Survey Data Integration October 27, / 26

12 Remark If n A /n B = o(1), then V B is smaller order than V A and total variance is dominated by V A. Otherwise, the two variances both contribute to the total variance. If sample B is a big data, n B is huge and V B can be safely ignored. However, to compute ˆV B in (7), we use individual observations of (x i, y i ) in sample B, which is not necessarily available when only sample A with mass imputation is released to the public. Note that the goal of mass imputation is to produce a representative sample with synthetic observations using sample B as a training data. Once the mass imputation is performed, the training data is no longer necessary in computing the point estimation. So, it is desirable to develop a variance estimation method that does not require access to the sample B observations. Kim (ISU) Survey Data Integration October 27, / 26

13 4. Replication variance estimation We consider a bootstrap method for variance estimation that creates replicated synthetic data {ŷ (k) i, i A} corresponding to each set of bootstrap weights {w (k) i, i A} associated with sample A only. This method enables the user to correctly estimate the variance of the mass imputation estimator ȳ I,reg without access to the training data {(y i, x i ) : i B} from sample B. The data file will contain additional columns of {y (k) i : i A} associated with the columns of weights {w (k) i ; i A} (k = 1,, L), where L is the number of replicates created from sample A. Kim (ISU) Survey Data Integration October 27, / 26

14 Kim & Rao (2012) developed a replication method for survey integration when sample B is also a probability sample. 1 Obtain ˆβ (k), the k-th replicate of ˆβ, by solving the same estimating equation for β using the replication weights for sample B. 2 The k-th replicate of the mass imputation estimator ȳ I,reg = i A w iŷ i is ȳ (k) I,reg = w (k) i ŷ (k) i i A where ŷ i = m(x i ; ˆβ (k) ). How to modify the method of Kim & Rao (2012) when sample B is a non-probability sample? Kim (ISU) Survey Data Integration October 27, / 26

15 In order to develop valid bootstrap method for mass imputation estimator ȳ I,reg in (3), it is critical to develop a valid bootstrap method for estimating V (ˆβ) when ˆβ is computed from (2). Note that, under assumption (1) and MAR, we can obtain V (ˆβ) =. J 1 ΩJ 1 (8) where J = E { n 1 B i B ṁih } { i and Ω = E n 2 B i B E(e2 i x)h i h } i with ṁ i = ṁ(x i ; β 0 ) and h i = h(x i ; β 0 ). In (8), the reference distribution is the joint distribution of the superpopulation model and the unknown sampling mechanism for sample B. Kim (ISU) Survey Data Integration October 27, / 26

16 Interestingly, the variance formula in (8) is exactly equal to the variance of ˆβ when sample B is selected from simple random sampling (SRS). That is, even though the sample design for sample B is not SRS, its effect for the variance of ˆβ is essentially equal to that under SRS. Roughly speaking, the MAR assumption makes the effect of the sampling design for estimating β ignorable even though it is still not ignorable for Ȳ N. Therefore, we can safely ignore the sampling design for sample B when estimating β and develop a valid bootstrap method for variance estimation of ˆβ using the bootstrap method for SRS. Kim (ISU) Survey Data Integration October 27, / 26

17 Thus, the proposed bootstrap method can be described as in the following steps: 1 Treating sample B as a simple random sample, generate the k-th bootstrap sample from sample B to compute ˆβ (k), the k-th bootstrap replicate of ˆβ, using the same estimation formula (2) applied to the bootstrap sample. 2 Using ˆβ (k) from [Step 1], compute ŷ i = m(x i ; ˆβ (k) ) for each i A. Using the ŷ (k) i, we obtain the replicated mass imputation estimator ȳ (k) I,reg = i A w (k) i ŷ (k) i. 3 The resulting bootstrap variance estimator of ȳ I,reg is then ˆV b (ȳ I,reg ) = L 1 L k=1 ( ȳ (k) I,reg ȳ I,reg ) 2. (9) Kim (ISU) Survey Data Integration October 27, / 26

18 5. A real data application Pew Research Center (PRC) data in 2015: a non-probability sample data of size n = 9, 301 with 56 variables, provided by eight different vendors with unknown sampling and data collection strategies. The PRC dataset aims to study the relation between people and community. We choose 9 variables, among them 8 are binary and 1 is continuous, as response variables in our analysis. We consider two probability samples with common auxiliary variables. The first is the Behavioral Risk Factor Surveillance System (BRFSS) survey data and the second is the Volunteer Supplement survey data from the Current Population Survey (CPS), both collected in Kim (ISU) Survey Data Integration October 27, / 26

19 ˆX PRC ˆX BRFSS ˆX CPS Age category < >=30,< >=50,< >= Gender Female Race White only Race Black only Origin Hispanic/Latino Region Northeast Region South Region West Marital status Married Employment Working Kim Employment (ISU) Retired Survey Data Integration October , / 26 Comparison of covariates from three datasets Table: Estimated Population Mean of Covariates from the Three Samples

20 Comparison of covariates from three datasets (Cont d) ˆX PRC ˆX BRFSS ˆX CPS Education High school or less Education Bachelor s degree and above Education Bachelor s degree NA Education Postgraduate NA Household Presence of child in household NA Household Home ownership NA Health Smoke everyday NA Health Smoke never NA Financial status No money to see doctors NA Financial status Having medical insurance NA Financial status Household income < 20K NA Financial status Household income >100K NA Volunteer works Volunteered NA Kim (ISU) Survey Data Integration October 27, / 26

21 Remark There are noticeable differences between the naive estimates from the PRC sample and the estimates from the two probability samples for covariates such as Origin (Hispanic/Latino), Education (High school or less), Household (with children), Health (Smoking) and Volunteer works. It is strong evidence that the PRC dataset is not a representative sample for the population. Kim (ISU) Survey Data Integration October 27, / 26

22 Mass Imputation using a single set of common covariates Table: Estimated Population Mean Using A Single Set of Common Covariates Binary Response y ˆθ v l ( 10 5 ) v b ( 10 5 ) Talked with PRC neighbours frequently BRFSS CPS Tended to trust PRC neighbours BRFSS CPS Expressed opinions PRC at a government level BRFSS CPS Voted local PRC elections BRFSS CPS Kim (ISU) Survey Data Integration October 27, / 26

23 Mass Imputation using a single set of common covariates Table: Estimated Population Mean Using A Single Set of Common Covariates Binary Response y ˆθ v l ( 10 5 ) v b ( 10 5 ) Participated in PRC school groups BRFSS CPS Participated in PRC service organizations BRFSS CPS Participated in PRC sports organizations BRFSS CPS No money PRC to buy food BRFSS CPS Kim (ISU) Survey Data Integration October 27, / 26

24 Mass Imputation using a single set of common covariates Table: Estimated Population Mean Using A Single Set of Common Covariates Continuous Response y ˆθ I v l ( 10 2 ) v b ( 10 2 ) Days had at least PRC one drink last month BRFSS CPS Kim (ISU) Survey Data Integration October 27, / 26

25 Discussion There are substantial discrepancies between the mass imputation estimator and the naive estimator in most cases. The mass imputation estimates obtained with two different probability samples are comparable for all cases. The two variance estimators obtained by using the linearization and the bootstrap methods generally agree with each other. Kim (ISU) Survey Data Integration October 27, / 26

26 6. Conclusion Using a non-probability sample data as a training set for prediction, we can implement mass imputation for survey sample data. Nonparametric model can be used for the imputation model. Machine learning algorithm can also be used for mass imputation. A promising area of research. Kim (ISU) Survey Data Integration October 27, / 26

27 REFERENCES Kim, J. K. & Rao, J. N. K. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika 99, Rivers, D. (2007). Sampling for web surveys. In Proceedings of the Survey Research Methods Section of the American Statistical Association. Kim (ISU) Survey Data Integration October 27, / 26

Combining Non-probability and. Probability Survey Samples Through Mass Imputation

Combining Non-probability and. Probability Survey Samples Through Mass Imputation Combining Non-probability and arxiv:1812.10694v2 [stat.me] 31 Dec 2018 Probability Survey Samples Through Mass Imputation Jae Kwang Kim Seho Park Yilin Chen Changbao Wu January 1, 2019 Abstract. This paper

More information

Data Integration for Big Data Analysis for finite population inference

Data Integration for Big Data Analysis for finite population inference for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36 What is big data? 2 / 36 Data do not speak for themselves Knowledge Reproducibility Information Intepretation

More information

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction

More information

Combining data from two independent surveys: model-assisted approach

Combining data from two independent surveys: model-assisted approach Combining data from two independent surveys: model-assisted approach Jae Kwang Kim 1 Iowa State University January 20, 2012 1 Joint work with J.N.K. Rao, Carleton University Reference Kim, J.K. and Rao,

More information

Chapter 8: Estimation 1

Chapter 8: Estimation 1 Chapter 8: Estimation 1 Jae-Kwang Kim Iowa State University Fall, 2014 Kim (ISU) Ch. 8: Estimation 1 Fall, 2014 1 / 33 Introduction 1 Introduction 2 Ratio estimation 3 Regression estimator Kim (ISU) Ch.

More information

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70 Chapter 5: Models used in conjunction with sampling J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70 Nonresponse Unit Nonresponse: weight adjustment Item Nonresponse:

More information

Introduction to Survey Data Integration

Introduction to Survey Data Integration Introduction to Survey Data Integration Jae-Kwang Kim Iowa State University May 20, 2014 Outline 1 Introduction 2 Survey Integration Examples 3 Basic Theory for Survey Integration 4 NASS application 5

More information

6. Fractional Imputation in Survey Sampling

6. Fractional Imputation in Survey Sampling 6. Fractional Imputation in Survey Sampling 1 Introduction Consider a finite population of N units identified by a set of indices U = {1, 2,, N} with N known. Associated with each unit i in the population

More information

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data Jae-Kwang Kim 1 Iowa State University June 28, 2012 1 Joint work with Dr. Ming Zhou (when he was a PhD student at ISU)

More information

A measurement error model approach to small area estimation

A measurement error model approach to small area estimation A measurement error model approach to small area estimation Jae-kwang Kim 1 Spring, 2015 1 Joint work with Seunghwan Park and Seoyoung Kim Ouline Introduction Basic Theory Application to Korean LFS Discussion

More information

How to Use the Internet for Election Surveys

How to Use the Internet for Election Surveys How to Use the Internet for Election Surveys Simon Jackman and Douglas Rivers Stanford University and Polimetrix, Inc. May 9, 2008 Theory and Practice Practice Theory Works Doesn t work Works Great! Black

More information

VARIANCE ESTIMATION FOR NEAREST NEIGHBOR IMPUTATION FOR U.S. CENSUS LONG FORM DATA

VARIANCE ESTIMATION FOR NEAREST NEIGHBOR IMPUTATION FOR U.S. CENSUS LONG FORM DATA Submitted to the Annals of Applied Statistics VARIANCE ESTIMATION FOR NEAREST NEIGHBOR IMPUTATION FOR U.S. CENSUS LONG FORM DATA By Jae Kwang Kim, Wayne A. Fuller and William R. Bell Iowa State University

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING

INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING Statistica Sinica 24 (2014), 1001-1015 doi:http://dx.doi.org/10.5705/ss.2013.038 INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING Seunghwan Park and Jae Kwang Kim Seoul National Univeristy

More information

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design 1 / 32 Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design Changbao Wu Department of Statistics and Actuarial Science University of Waterloo (Joint work with Min Chen and Mary

More information

A note on multiple imputation for general purpose estimation

A note on multiple imputation for general purpose estimation A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume

More information

Recent Advances in the analysis of missing data with non-ignorable missingness

Recent Advances in the analysis of missing data with non-ignorable missingness Recent Advances in the analysis of missing data with non-ignorable missingness Jae-Kwang Kim Department of Statistics, Iowa State University July 4th, 2014 1 Introduction 2 Full likelihood-based ML estimation

More information

Calibration estimation using exponential tilting in sample surveys

Calibration estimation using exponential tilting in sample surveys Calibration estimation using exponential tilting in sample surveys Jae Kwang Kim February 23, 2010 Abstract We consider the problem of parameter estimation with auxiliary information, where the auxiliary

More information

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Imputation in Survey Sampling: A Comparative Review Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical

More information

Bootstrap inference for the finite population total under complex sampling designs

Bootstrap inference for the finite population total under complex sampling designs Bootstrap inference for the finite population total under complex sampling designs Zhonglei Wang (Joint work with Dr. Jae Kwang Kim) Center for Survey Statistics and Methodology Iowa State University Jan.

More information

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University Statistica Sinica 27 (2017), 000-000 doi:https://doi.org/10.5705/ss.202016.0155 DISCUSSION: DISSECTING MULTIPLE IMPUTATION FROM A MULTI-PHASE INFERENCE PERSPECTIVE: WHAT HAPPENS WHEN GOD S, IMPUTER S AND

More information

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Department of Mathematics Carl von Ossietzky University Oldenburg Sonja Greven Department of

More information

Can a Pseudo Panel be a Substitute for a Genuine Panel?

Can a Pseudo Panel be a Substitute for a Genuine Panel? Can a Pseudo Panel be a Substitute for a Genuine Panel? Min Hee Seo Washington University in St. Louis minheeseo@wustl.edu February 16th 1 / 20 Outline Motivation: gauging mechanism of changes Introduce

More information

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Institute of Statistics and Econometrics Georg-August-University Göttingen Department of Statistics

More information

Chapter 3: Element sampling design: Part 1

Chapter 3: Element sampling design: Part 1 Chapter 3: Element sampling design: Part 1 Jae-Kwang Kim Fall, 2014 Simple random sampling 1 Simple random sampling 2 SRS with replacement 3 Systematic sampling Kim Ch. 3: Element sampling design: Part

More information

REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLES

REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLES Statistica Sinica 8(1998), 1153-1164 REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLES Wayne A. Fuller Iowa State University Abstract: The estimation of the variance of the regression estimator for

More information

Selection on Observables: Propensity Score Matching.

Selection on Observables: Propensity Score Matching. Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017

More information

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity P. Richard Hahn, Jared Murray, and Carlos Carvalho June 22, 2017 The problem setting We want to estimate

More information

Two-phase sampling approach to fractional hot deck imputation

Two-phase sampling approach to fractional hot deck imputation Two-phase sampling approach to fractional hot deck imputation Jongho Im 1, Jae-Kwang Kim 1 and Wayne A. Fuller 1 Abstract Hot deck imputation is popular for handling item nonresponse in survey sampling.

More information

Predicting the Treatment Status

Predicting the Treatment Status Predicting the Treatment Status Nikolay Doudchenko 1 Introduction Many studies in social sciences deal with treatment effect models. 1 Usually there is a treatment variable which determines whether a particular

More information

Imputation for Missing Data under PPSWR Sampling

Imputation for Missing Data under PPSWR Sampling July 5, 2010 Beijing Imputation for Missing Data under PPSWR Sampling Guohua Zou Academy of Mathematics and Systems Science Chinese Academy of Sciences 1 23 () Outline () Imputation method under PPSWR

More information

Non-parametric bootstrap and small area estimation to mitigate bias in crowdsourced data Simulation study and application to perceived safety

Non-parametric bootstrap and small area estimation to mitigate bias in crowdsourced data Simulation study and application to perceived safety Non-parametric bootstrap and small area estimation to mitigate bias in crowdsourced data Simulation study and application to perceived safety David Buil-Gil, Reka Solymosi Centre for Criminology and Criminal

More information

Parametric fractional imputation for missing data analysis

Parametric fractional imputation for missing data analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (????),??,?, pp. 1 15 C???? Biometrika Trust Printed in

More information

On the bias of the multiple-imputation variance estimator in survey sampling

On the bias of the multiple-imputation variance estimator in survey sampling J. R. Statist. Soc. B (2006) 68, Part 3, pp. 509 521 On the bias of the multiple-imputation variance estimator in survey sampling Jae Kwang Kim, Yonsei University, Seoul, Korea J. Michael Brick, Westat,

More information

Discrete Choice Modeling

Discrete Choice Modeling [Part 4] 1/43 Discrete Choice Modeling 0 Introduction 1 Summary 2 Binary Choice 3 Panel Data 4 Bivariate Probit 5 Ordered Choice 6 Count Data 7 Multinomial Choice 8 Nested Logit 9 Heterogeneity 10 Latent

More information

New Developments in Nonresponse Adjustment Methods

New Developments in Nonresponse Adjustment Methods New Developments in Nonresponse Adjustment Methods Fannie Cobben January 23, 2009 1 Introduction In this paper, we describe two relatively new techniques to adjust for (unit) nonresponse bias: The sample

More information

Recommender Systems: Overview and. Package rectools. Norm Matloff. Dept. of Computer Science. University of California at Davis.

Recommender Systems: Overview and. Package rectools. Norm Matloff. Dept. of Computer Science. University of California at Davis. Recommender December 13, 2016 What Are Recommender Systems? What Are Recommender Systems? Various forms, but here is a common one, say for data on movie ratings: What Are Recommender Systems? Various forms,

More information

Chapter 4: Imputation

Chapter 4: Imputation Chapter 4: Imputation Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Basic Theory for imputation 3 Variance estimation after imputation 4 Replication variance estimation

More information

Econometrics Problem Set 4

Econometrics Problem Set 4 Econometrics Problem Set 4 WISE, Xiamen University Spring 2016-17 Conceptual Questions 1. This question refers to the estimated regressions in shown in Table 1 computed using data for 1988 from the CPS.

More information

Missing Covariate Data in Matched Case-Control Studies

Missing Covariate Data in Matched Case-Control Studies Missing Covariate Data in Matched Case-Control Studies Department of Statistics North Carolina State University Paul Rathouz Dept. of Health Studies U. of Chicago prathouz@health.bsd.uchicago.edu with

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

What s New in Econometrics. Lecture 1

What s New in Econometrics. Lecture 1 What s New in Econometrics Lecture 1 Estimation of Average Treatment Effects Under Unconfoundedness Guido Imbens NBER Summer Institute, 2007 Outline 1. Introduction 2. Potential Outcomes 3. Estimands and

More information

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW SSC Annual Meeting, June 2015 Proceedings of the Survey Methods Section ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW Xichen She and Changbao Wu 1 ABSTRACT Ordinal responses are frequently involved

More information

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score Causal Inference with General Treatment Regimes: Generalizing the Propensity Score David van Dyk Department of Statistics, University of California, Irvine vandyk@stat.harvard.edu Joint work with Kosuke

More information

Jun Tu. Department of Geography and Anthropology Kennesaw State University

Jun Tu. Department of Geography and Anthropology Kennesaw State University Examining Spatially Varying Relationships between Preterm Births and Ambient Air Pollution in Georgia using Geographically Weighted Logistic Regression Jun Tu Department of Geography and Anthropology Kennesaw

More information

Combining Difference-in-difference and Matching for Panel Data Analysis

Combining Difference-in-difference and Matching for Panel Data Analysis Combining Difference-in-difference and Matching for Panel Data Analysis Weihua An Departments of Sociology and Statistics Indiana University July 28, 2016 1 / 15 Research Interests Network Analysis Social

More information

ECO375 Tutorial 8 Instrumental Variables

ECO375 Tutorial 8 Instrumental Variables ECO375 Tutorial 8 Instrumental Variables Matt Tudball University of Toronto Mississauga November 16, 2017 Matt Tudball (University of Toronto) ECO375H5 November 16, 2017 1 / 22 Review: Endogeneity Instrumental

More information

THE DESIGN (VERSUS THE ANALYSIS) OF EVALUATIONS FROM OBSERVATIONAL STUDIES: PARALLELS WITH THE DESIGN OF RANDOMIZED EXPERIMENTS DONALD B.

THE DESIGN (VERSUS THE ANALYSIS) OF EVALUATIONS FROM OBSERVATIONAL STUDIES: PARALLELS WITH THE DESIGN OF RANDOMIZED EXPERIMENTS DONALD B. THE DESIGN (VERSUS THE ANALYSIS) OF EVALUATIONS FROM OBSERVATIONAL STUDIES: PARALLELS WITH THE DESIGN OF RANDOMIZED EXPERIMENTS DONALD B. RUBIN My perspective on inference for causal effects: In randomized

More information

Econometrics Problem Set 3

Econometrics Problem Set 3 Econometrics Problem Set 3 Conceptual Questions 1. This question refers to the estimated regressions in table 1 computed using data for 1988 from the U.S. Current Population Survey. The data set consists

More information

1. Regressions and Regression Models. 2. Model Example. EEP/IAS Introductory Applied Econometrics Fall Erin Kelley Section Handout 1

1. Regressions and Regression Models. 2. Model Example. EEP/IAS Introductory Applied Econometrics Fall Erin Kelley Section Handout 1 1. Regressions and Regression Models Simply put, economists use regression models to study the relationship between two variables. If Y and X are two variables, representing some population, we are interested

More information

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY J.D. Opsomer, W.A. Fuller and X. Li Iowa State University, Ames, IA 50011, USA 1. Introduction Replication methods are often used in

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Nonresponse weighting adjustment using estimated response probability

Nonresponse weighting adjustment using estimated response probability Nonresponse weighting adjustment using estimated response probability Jae-kwang Kim Yonsei University, Seoul, Korea December 26, 2006 Introduction Nonresponse Unit nonresponse Item nonresponse Basic strategy

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics Amang S. Sukasih, Mathematica Policy Research, Inc. Donsig Jang, Mathematica Policy Research, Inc. Amang S. Sukasih,

More information

GROWING APART: THE CHANGING FIRM-SIZE WAGE PREMIUM AND ITS INEQUALITY CONSEQUENCES ONLINE APPENDIX

GROWING APART: THE CHANGING FIRM-SIZE WAGE PREMIUM AND ITS INEQUALITY CONSEQUENCES ONLINE APPENDIX GROWING APART: THE CHANGING FIRM-SIZE WAGE PREMIUM AND ITS INEQUALITY CONSEQUENCES ONLINE APPENDIX The following document is the online appendix for the paper, Growing Apart: The Changing Firm-Size Wage

More information

Machine Learning. Naïve Bayes classifiers

Machine Learning. Naïve Bayes classifiers 10-701 Machine Learning Naïve Bayes classifiers Types of classifiers We can divide the large variety of classification approaches into three maor types 1. Instance based classifiers - Use observation directly

More information

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables Applied Econometrics (MSc.) Lecture 3 Instrumental Variables Estimation - Theory Department of Economics University of Gothenburg December 4, 2014 1/28 Why IV estimation? So far, in OLS, we assumed independence.

More information

Weighting in survey analysis under informative sampling

Weighting in survey analysis under informative sampling Jae Kwang Kim and Chris J. Skinner Weighting in survey analysis under informative sampling Article (Accepted version) (Refereed) Original citation: Kim, Jae Kwang and Skinner, Chris J. (2013) Weighting

More information

IEOR165 Discussion Week 5

IEOR165 Discussion Week 5 IEOR165 Discussion Week 5 Sheng Liu University of California, Berkeley Feb 19, 2016 Outline 1 1st Homework 2 Revisit Maximum A Posterior 3 Regularization IEOR165 Discussion Sheng Liu 2 About 1st Homework

More information

Chapter 4. Replication Variance Estimation. J. Kim, W. Fuller (ISU) Chapter 4 7/31/11 1 / 28

Chapter 4. Replication Variance Estimation. J. Kim, W. Fuller (ISU) Chapter 4 7/31/11 1 / 28 Chapter 4 Replication Variance Estimation J. Kim, W. Fuller (ISU) Chapter 4 7/31/11 1 / 28 Jackknife Variance Estimation Create a new sample by deleting one observation n 1 n n ( x (k) x) 2 = x (k) = n

More information

Lecture 4 Multiple linear regression

Lecture 4 Multiple linear regression Lecture 4 Multiple linear regression BIOST 515 January 15, 2004 Outline 1 Motivation for the multiple regression model Multiple regression in matrix notation Least squares estimation of model parameters

More information

Selective Inference for Effect Modification

Selective Inference for Effect Modification Inference for Modification (Joint work with Dylan Small and Ashkan Ertefaie) Department of Statistics, University of Pennsylvania May 24, ACIC 2017 Manuscript and slides are available at http://www-stat.wharton.upenn.edu/~qyzhao/.

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Linear Regression With Special Variables

Linear Regression With Special Variables Linear Regression With Special Variables Junhui Qian December 21, 2014 Outline Standardized Scores Quadratic Terms Interaction Terms Binary Explanatory Variables Binary Choice Models Standardized Scores:

More information

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan

More information

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions Joe Schafer Office of the Associate Director for Research and Methodology U.S. Census

More information

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse Nanhua Zhang Division of Biostatistics & Epidemiology Cincinnati Children s Hospital Medical Center (Joint work

More information

Robust Bayesian Variable Selection for Modeling Mean Medical Costs

Robust Bayesian Variable Selection for Modeling Mean Medical Costs Robust Bayesian Variable Selection for Modeling Mean Medical Costs Grace Yoon 1,, Wenxin Jiang 2, Lei Liu 3 and Ya-Chen T. Shih 4 1 Department of Statistics, Texas A&M University 2 Department of Statistics,

More information

Job Training Partnership Act (JTPA)

Job Training Partnership Act (JTPA) Causal inference Part I.b: randomized experiments, matching and regression (this lecture starts with other slides on randomized experiments) Frank Venmans Example of a randomized experiment: Job Training

More information

Combining multiple observational data sources to estimate causal eects

Combining multiple observational data sources to estimate causal eects Department of Statistics, North Carolina State University Combining multiple observational data sources to estimate causal eects Shu Yang* syang24@ncsuedu Joint work with Peng Ding UC Berkeley May 23,

More information

A Course in Applied Econometrics. Lecture 2 Outline. Estimation of Average Treatment Effects. Under Unconfoundedness, Part II

A Course in Applied Econometrics. Lecture 2 Outline. Estimation of Average Treatment Effects. Under Unconfoundedness, Part II A Course in Applied Econometrics Lecture Outline Estimation of Average Treatment Effects Under Unconfoundedness, Part II. Assessing Unconfoundedness (not testable). Overlap. Illustration based on Lalonde

More information

Even Simpler Standard Errors for Two-Stage Optimization Estimators: Mata Implementation via the DERIV Command

Even Simpler Standard Errors for Two-Stage Optimization Estimators: Mata Implementation via the DERIV Command Even Simpler Standard Errors for Two-Stage Optimization Estimators: Mata Implementation via the DERIV Command by Joseph V. Terza Department of Economics Indiana University Purdue University Indianapolis

More information

Table B1. Full Sample Results OLS/Probit

Table B1. Full Sample Results OLS/Probit Table B1. Full Sample Results OLS/Probit School Propensity Score Fixed Effects Matching (1) (2) (3) (4) I. BMI: Levels School 0.351* 0.196* 0.180* 0.392* Breakfast (0.088) (0.054) (0.060) (0.119) School

More information

Low-Income African American Women's Perceptions of Primary Care Physician Weight Loss Counseling: A Positive Deviance Study

Low-Income African American Women's Perceptions of Primary Care Physician Weight Loss Counseling: A Positive Deviance Study Thomas Jefferson University Jefferson Digital Commons Master of Public Health Thesis and Capstone Presentations Jefferson College of Population Health 6-25-2015 Low-Income African American Women's Perceptions

More information

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid Applied Economics Regression with a Binary Dependent Variable Department of Economics Universidad Carlos III de Madrid See Stock and Watson (chapter 11) 1 / 28 Binary Dependent Variables: What is Different?

More information

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both

More information

Truncation and Censoring

Truncation and Censoring Truncation and Censoring Laura Magazzini laura.magazzini@univr.it Laura Magazzini (@univr.it) Truncation and Censoring 1 / 35 Truncation and censoring Truncation: sample data are drawn from a subset of

More information

Medical GIS: New Uses of Mapping Technology in Public Health. Peter Hayward, PhD Department of Geography SUNY College at Oneonta

Medical GIS: New Uses of Mapping Technology in Public Health. Peter Hayward, PhD Department of Geography SUNY College at Oneonta Medical GIS: New Uses of Mapping Technology in Public Health Peter Hayward, PhD Department of Geography SUNY College at Oneonta Invited research seminar presentation at Bassett Healthcare. Cooperstown,

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

Globally Estimating the Population Characteristics of Small Geographic Areas. Tom Fitzwater

Globally Estimating the Population Characteristics of Small Geographic Areas. Tom Fitzwater Globally Estimating the Population Characteristics of Small Geographic Areas Tom Fitzwater U.S. Census Bureau Population Division What we know 2 Where do people live? Difficult to measure and quantify.

More information

Empirical Likelihood Methods for Sample Survey Data: An Overview

Empirical Likelihood Methods for Sample Survey Data: An Overview AUSTRIAN JOURNAL OF STATISTICS Volume 35 (2006), Number 2&3, 191 196 Empirical Likelihood Methods for Sample Survey Data: An Overview J. N. K. Rao Carleton University, Ottawa, Canada Abstract: The use

More information

Advanced Topics in Survey Sampling

Advanced Topics in Survey Sampling Advanced Topics in Survey Sampling Jae-Kwang Kim Wayne A Fuller Pushpal Mukhopadhyay Department of Statistics Iowa State University World Statistics Congress Short Course July 23-24, 2015 Kim & Fuller

More information

emerge Network: CERC Survey Survey Sampling Data Preparation

emerge Network: CERC Survey Survey Sampling Data Preparation emerge Network: CERC Survey Survey Sampling Data Preparation Overview The entire patient population does not use inpatient and outpatient clinic services at the same rate, nor are racial and ethnic subpopulations

More information

Statistical Education - The Teaching Concept of Pseudo-Populations

Statistical Education - The Teaching Concept of Pseudo-Populations Statistical Education - The Teaching Concept of Pseudo-Populations Andreas Quatember Johannes Kepler University Linz, Austria Department of Applied Statistics, Johannes Kepler University Linz, Altenberger

More information

Generating Private Synthetic Data: Presentation 2

Generating Private Synthetic Data: Presentation 2 Generating Private Synthetic Data: Presentation 2 Mentor: Dr. Anand Sarwate July 17, 2015 Overview 1 Project Overview (Revisited) 2 Sensitivity of Mutual Information 3 Simulation 4 Results with Real Data

More information

Weighted Least Squares

Weighted Least Squares Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w

More information

AN INSTRUMENTAL VARIABLE APPROACH FOR IDENTIFICATION AND ESTIMATION WITH NONIGNORABLE NONRESPONSE

AN INSTRUMENTAL VARIABLE APPROACH FOR IDENTIFICATION AND ESTIMATION WITH NONIGNORABLE NONRESPONSE Statistica Sinica 24 (2014), 1097-1116 doi:http://dx.doi.org/10.5705/ss.2012.074 AN INSTRUMENTAL VARIABLE APPROACH FOR IDENTIFICATION AND ESTIMATION WITH NONIGNORABLE NONRESPONSE Sheng Wang 1, Jun Shao

More information

Might using the Internet while travelling affect car ownership plans of Millennials? Dr. David McArthur and Dr. Jinhyun Hong

Might using the Internet while travelling affect car ownership plans of Millennials? Dr. David McArthur and Dr. Jinhyun Hong Might using the Internet while travelling affect car ownership plans of Millennials? Dr. David McArthur and Dr. Jinhyun Hong Introduction Travel habits among Millennials (people born between 1980 and 2000)

More information

Lecture 6 Multiple Linear Regression, cont.

Lecture 6 Multiple Linear Regression, cont. Lecture 6 Multiple Linear Regression, cont. BIOST 515 January 22, 2004 BIOST 515, Lecture 6 Testing general linear hypotheses Suppose we are interested in testing linear combinations of the regression

More information

Categorical Predictor Variables

Categorical Predictor Variables Categorical Predictor Variables We often wish to use categorical (or qualitative) variables as covariates in a regression model. For binary variables (taking on only 2 values, e.g. sex), it is relatively

More information

DOUBLY ROBUST NONPARAMETRIC MULTIPLE IMPUTATION FOR IGNORABLE MISSING DATA

DOUBLY ROBUST NONPARAMETRIC MULTIPLE IMPUTATION FOR IGNORABLE MISSING DATA Statistica Sinica 22 (2012), 149-172 doi:http://dx.doi.org/10.5705/ss.2010.069 DOUBLY ROBUST NONPARAMETRIC MULTIPLE IMPUTATION FOR IGNORABLE MISSING DATA Qi Long, Chiu-Hsieh Hsu and Yisheng Li Emory University,

More information

Gov 2002: 4. Observational Studies and Confounding

Gov 2002: 4. Observational Studies and Confounding Gov 2002: 4. Observational Studies and Confounding Matthew Blackwell September 10, 2015 Where are we? Where are we going? Last two weeks: randomized experiments. From here on: observational studies. What

More information

Treatment Effects. Christopher Taber. September 6, Department of Economics University of Wisconsin-Madison

Treatment Effects. Christopher Taber. September 6, Department of Economics University of Wisconsin-Madison Treatment Effects Christopher Taber Department of Economics University of Wisconsin-Madison September 6, 2017 Notation First a word on notation I like to use i subscripts on random variables to be clear

More information

Successive Difference Replication Variance Estimation in Two-Phase Sampling

Successive Difference Replication Variance Estimation in Two-Phase Sampling Successive Difference Replication Variance Estimation in Two-Phase Sampling Jean D. Opsomer Colorado State University Michael White US Census Bureau F. Jay Breidt Colorado State University Yao Li Colorado

More information

Chapter 2 The Simple Linear Regression Model: Specification and Estimation

Chapter 2 The Simple Linear Regression Model: Specification and Estimation Chapter The Simple Linear Regression Model: Specification and Estimation Page 1 Chapter Contents.1 An Economic Model. An Econometric Model.3 Estimating the Regression Parameters.4 Assessing the Least Squares

More information

Gender Recognition Act Survey ONLINE Fieldwork: 19th-21st October 2018

Gender Recognition Act Survey ONLINE Fieldwork: 19th-21st October 2018 Recognition Act Survey ONLINE Fieldwork: thst October 0 Table Q. Do you think that those who wish to legally change their gender on official documentation (e.g. birth certificate, passport) should or should

More information

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Department of Mathematics Carl von Ossietzky University Oldenburg Sonja Greven Department of

More information

Bias-Variance in Machine Learning

Bias-Variance in Machine Learning Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on

More information

Graybill Conference Poster Session Introductions

Graybill Conference Poster Session Introductions Graybill Conference Poster Session Introductions 2013 Graybill Conference in Modern Survey Statistics Colorado State University Fort Collins, CO June 10, 2013 Small Area Estimation with Incomplete Auxiliary

More information