Weighting Missing Data Coding and Data Preparation Wrap-up Preview of Next Time. Data Management

Similar documents
Introduction to Survey Data Analysis

Sampling Techniques and Questionnaire Design

FCE 3900 EDUCATIONAL RESEARCH LECTURE 8 P O P U L A T I O N A N D S A M P L I N G T E C H N I Q U E

MN 400: Research Methods. CHAPTER 7 Sample Design

How to Use the Internet for Election Surveys

Survey Sample Methods

ICES training Course on Design and Analysis of Statistically Sound Catch Sampling Programmes

MISSING or INCOMPLETE DATA

Topic 3 Populations and Samples

Part 7: Glossary Overview

Nonresponse weighting adjustment using estimated response probability

Lecture 5: Sampling Methods

Model Assisted Survey Sampling

TECH 646 Analysis of Research in Industry and Technology

Module 16. Sampling and Sampling Distributions: Random Sampling, Non Random Sampling

ECON1310 Quantitative Economic and Business Analysis A

Now we will define some common sampling plans and discuss their strengths and limitations.

Sampling. Module II Chapter 3

Data Collection: What Is Sampling?

Business Statistics: A First Course

Unit 3 Populations and Samples

Advising on Research Methods: A consultant's companion. Herman J. Ader Gideon J. Mellenbergh with contributions by David J. Hand

BOOK REVIEW Sampling: Design and Analysis. Sharon L. Lohr. 2nd Edition, International Publication,

Sampling : Error and bias

New Developments in Nonresponse Adjustment Methods

Time-Invariant Predictors in Longitudinal Models

Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information:

SYA 3300 Research Methods and Lab Summer A, 2000

Statistics for Managers Using Microsoft Excel 5th Edition

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

POPULATION AND SAMPLE

Known unknowns : using multiple imputation to fill in the blanks for missing data

Basics of Modern Missing Data Analysis

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

SAMPLING- Method of Psychology. By- Mrs Neelam Rathee, Dept of Psychology. PGGCG-11, Chandigarh.

Fundamentals of Applied Sampling

Biostat 2065 Analysis of Incomplete Data

MISSING or INCOMPLETE DATA

Prepared by: DR. ROZIAH MOHD RASDI Faculty of Educational Studies Universiti Putra Malaysia

Lesson 6 Population & Sampling

Formalizing the Concepts: Simple Random Sampling. Juan Muñoz Kristen Himelein March 2012

Time-Invariant Predictors in Longitudinal Models

Fractional Imputation in Survey Sampling: A Comparative Review

Sampling. Sampling. Sampling. Sampling. Population. Sample. Sampling unit

Formalizing the Concepts: Simple Random Sampling. Juan Muñoz Kristen Himelein March 2013

Regression Analysis in R

TECH 646 Analysis of Research in Industry and Technology

A comparison of weighted estimators for the population mean. Ye Yang Weighting in surveys group

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

No is the Easiest Answer: Using Calibration to Assess Nonignorable Nonresponse in the 2002 Census of Agriculture

Analyzing Pilot Studies with Missing Observations

Sampling Techniques. Esra Akdeniz. February 9th, 2016

Ch. 16 SAMPLING DESIGNS AND SAMPLING PROCEDURES

3. When a researcher wants to identify particular types of cases for in-depth investigation; purpose less to generalize to larger population than to g

Sampling distributions and the Central Limit. Theorem. 17 October 2016

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

CHOOSING THE RIGHT SAMPLING TECHNIQUE FOR YOUR RESEARCH. Awanis Ku Ishak, PhD SBM

Sample size and Sampling strategy

Multidimensional Control Totals for Poststratified Weights

Comparing Group Means When Nonresponse Rates Differ

2 Naïve Methods. 2.1 Complete or available case analysis

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Taking into account sampling design in DAD. Population SAMPLING DESIGN AND DAD

Inferential Statistics. Chapter 5

Sampling from Finite Populations Jill M. Montaquila and Graham Kalton Westat 1600 Research Blvd., Rockville, MD 20850, U.S.A.

6. Fractional Imputation in Survey Sampling

Why Sample? Selecting a sample is less time-consuming than selecting every item in the population (census).

What is Survey Weighting? Chris Skinner University of Southampton

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Deriving indicators from representative samples for the ESF

Discussing Effects of Different MAR-Settings

Chapter Goals. To introduce you to data collection

MMWS Software Program Manual

TWO-WAY CONTINGENCY TABLES UNDER CONDITIONAL HOT DECK IMPUTATION

Some methods for handling missing values in outcome variables. Roderick J. Little

Log-linear Modelling with Complex Survey Data

The propensity score with continuous treatments

GALLUP NEWS SERVICE JUNE WAVE 1

STA 291 Lecture 16. Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately) normal

SAS/STAT 14.2 User s Guide. Introduction to Survey Sampling and Analysis Procedures

Chapter 10. Theory and Practice of Sampling. Business Research Methods Verónica Rosendo Ríos Enrique Pérez del Campo Marketing Research

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

Sampling in Space and Time. Natural experiment? Analytical Surveys

Research Methods in Environmental Science

Introduction to Survey Sampling

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Applied Statistics in Business & Economics, 5 th edition

104 Business Research Methods - MCQs

analysis of incomplete data in statistical surveys

VARIANCE ESTIMATION FOR NEAREST NEIGHBOR IMPUTATION FOR U.S. CENSUS LONG FORM DATA

An Overview of the Pros and Cons of Linearization versus Replication in Establishment Surveys

Training and Technical Assistance Webinar Series Statistical Analysis for Criminal Justice Research

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score

Sociology 6Z03 Review I

Data Integration for Big Data Analysis for finite population inference

Use of Matching Methods for Causal Inference in Experimental and Observational Studies. This Talk Draws on the Following Papers:

Successive Difference Replication Variance Estimation in Two-Phase Sampling

Transcription:

Data Management Department of Political Science and Government Aarhus University November 24, 2014

Data Management Weighting Handling missing data Categorizing missing data types Imputation Summary measures Scale construction Combining question branches Coding and editing Open-ended questions Marking problematic data Data preparation Codebook creation File formats Archiving, access, and rights

1 Weighting 2 Missing Data 3 Coding and Data Preparation 4 Wrap-up 5 Preview of Next Time

1 Weighting 2 Missing Data 3 Coding and Data Preparation 4 Wrap-up 5 Preview of Next Time

Goal of Survey Research The goal of survey research is to estimate population-level quantities (e.g., means, proportions, totals) Samples estimate those quantities with uncertainty (sampling error) Sample estimates are unbiased if they match population quantities

Realities of Survey Research Sample may not match population for a variety of reasons: Due to constraints on design Due to sampling frame coverage Due to intentional over/under-sampling Due to nonresponse Due to sampling error

Realities of Survey Research Sample may not match population for a variety of reasons: Due to constraints on design Due to sampling frame coverage Due to intentional over/under-sampling Due to nonresponse Due to sampling error Weights can be used to correct a sample

Realities of Survey Research Sample may not match population for a variety of reasons: Due to constraints on design Due to sampling frame coverage Due to intentional over/under-sampling Due to nonresponse Due to sampling error Weights can be used to correct a sample Weighting is never perfect Limited to work with observed variables Rarely have good knowledge of coverage, nonresponse, or sampling error Weighting can increase sampling variance

Three Kinds of Weights Design Weights Nonresponse Weights Post-Stratification Weights

Design Weights Address design-related unequal probability of selection into a sample Applied to complex survey designs: Disproportionate allocation stratified sampling Oversampling of subpopulations Cluster sampling Combinations thereof

Design Weights: Simple Random Sampling Imagine sampling frame of 100,000 units Sample size will be 1,000 What is the probability that a unit in the sampling frame is included in the sample?

Design Weights: Simple Random Sampling Imagine sampling frame of 100,000 units Sample size will be 1,000 What is the probability that a unit in the sampling frame is included in the sample? p = 1000 100,000 =.01

Design Weights: Simple Random Sampling Imagine sampling frame of 100,000 units Sample size will be 1,000 What is the probability that a unit in the sampling frame is included in the sample? p = 1000 100,000 =.01 Design weight for all units is w = 1/p = 100 SRS is self-weighting

Design Weights: Stratified Sample Imagine sampling frame of 100,000 units 90,000 Danes & 10,000 Immigrants Sample size will be 1,000 (proportionate allocation) 900 Danes & 100 Immigrants What is the probability that a unit in the sampling frame is included in the sample?

Design Weights: Stratified Sample Imagine sampling frame of 100,000 units 90,000 Danes & 10,000 Immigrants Sample size will be 1,000 (proportionate allocation) 900 Danes & 100 Immigrants What is the probability that a unit in the sampling frame is included in the sample? p Danish = 900 90,000 =.01 p Imm = 100 10,000 =.01

Design Weights: Stratified Sample Imagine sampling frame of 100,000 units 90,000 Danes & 10,000 Immigrants Sample size will be 1,000 (proportionate allocation) 900 Danes & 100 Immigrants What is the probability that a unit in the sampling frame is included in the sample? p Danish = 900 90,000 =.01 p Imm = 100 10,000 =.01 Design weight for all units is w = 1/p = 100 Proportionate allocation is self-weighting

Design Weights: Stratified Sample Imagine sampling frame of 100,000 units 90,000 Danes & 10,000 Immigrants Sample size will be 1,000 (disproportionate allocation) 500 Danes & 500 Immigrants What is the probability that a unit in the sampling frame is included in the sample?

Design Weights: Stratified Sample Imagine sampling frame of 100,000 units 90,000 Danes & 10,000 Immigrants Sample size will be 1,000 (disproportionate allocation) 500 Danes & 500 Immigrants What is the probability that a unit in the sampling frame is included in the sample? p Danish = 500 90,000 =.0056 p Imm = 500 10,000 =.05

Design Weights: Stratified Sample Imagine sampling frame of 100,000 units 90,000 Danes & 10,000 Immigrants Sample size will be 1,000 (disproportionate allocation) 500 Danes & 500 Immigrants What is the probability that a unit in the sampling frame is included in the sample? p Danish = 500 90,000 =.0056 p Imm = 500 10,000 =.05 Design weights differ across units: w Danish = 1/p Danish = 178.57 w Imm = 1/p Imm = 20 Disproportionate allocation is not self-weighting

Design Weights: Cluster Sample Imagine sampling frame of 1000 units in 5 clusters of varying sizes Sample size will be 10 each from 3 clusters What is the probability that a unit in the sampling frame is included in the sample? p = n clusters /N clusters 1/n cluster = 3 5 1/n cluster

Design Weights: Cluster Sample Imagine sampling frame of 1000 units in 5 clusters of varying sizes Sample size will be 10 each from 3 clusters What is the probability that a unit in the sampling frame is included in the sample? p = n clusters /N clusters 1/n cluster = 3 5 1/n cluster

Design Weights: Cluster Sample Imagine sampling frame of 1000 units in 5 clusters of varying sizes Sample size will be 10 each from 3 clusters What is the probability that a unit in the sampling frame is included in the sample? p = n clusters /N clusters 1/n cluster = 3 5 1/n cluster Design weights differ across units: Clusters are equally likely to be sampled Probability of selection within cluster varies with cluster size Cluster sampling is rarely self-weighting

Nonresponse Weights Correct for nonresponse Require knowledge of nonrespondents on variables that have been measured for respondents Requires data are missing at random Two common methods Weighting classes Propensity score subclassification

Nonresponse Weights: Example Imagine immigrants end up being less likely to respond 1 RR Danish = 1.0 RR Imm = 0.8 1 This refers to a lower RR in this particular survey sample, not in general.

Nonresponse Weights: Example Imagine immigrants end up being less likely to respond 1 RR Danish = 1.0 RR Imm = 0.8 Using weighting classes: w rr,danish = 1/1 = 1 w rr,imm = 1/0.8 = 1.25 Can generalize to multiple variables and strata 1 This refers to a lower RR in this particular survey sample, not in general.

Post-Stratification Correct for nonresponse, coverage errors, and sampling errors

Post-Stratification Correct for nonresponse, coverage errors, and sampling errors Reweight sample data to match population distributions Divide sample and population into strata Weight units in each stratum so that the weighted sample stratum contains the same proportion of units as the population stratum does

Post-Stratification Correct for nonresponse, coverage errors, and sampling errors Reweight sample data to match population distributions Divide sample and population into strata Weight units in each stratum so that the weighted sample stratum contains the same proportion of units as the population stratum does There are numerous other related techniques

Post-Stratification: Example Imagine our sample ends up skewed on immigration status and gender relative to the population Group Pop. Sample Rep. Weight Danish, Female.45.5 Danish, Male.45.4 Immigrant, Female.05.07 Immigrant, Male.05.03 PS weight is just w ps = N l /n l

Post-Stratification: Example Imagine our sample ends up skewed on immigration status and gender relative to the population Group Pop. Sample Rep. Weight Danish, Female.45.5 Over Danish, Male.45.4 Under Immigrant, Female.05.07 Over Immigrant, Male.05.03 Under PS weight is just w ps = N l /n l

Post-Stratification: Example Imagine our sample ends up skewed on immigration status and gender relative to the population Group Pop. Sample Rep. Weight Danish, Female.45.5 Over 0.900 Danish, Male.45.4 Under Immigrant, Female.05.07 Over Immigrant, Male.05.03 Under PS weight is just w ps = N l /n l

Post-Stratification: Example Imagine our sample ends up skewed on immigration status and gender relative to the population Group Pop. Sample Rep. Weight Danish, Female.45.5 Over 0.900 Danish, Male.45.4 Under 1.125 Immigrant, Female.05.07 Over Immigrant, Male.05.03 Under PS weight is just w ps = N l /n l

Post-Stratification: Example Imagine our sample ends up skewed on immigration status and gender relative to the population Group Pop. Sample Rep. Weight Danish, Female.45.5 Over 0.900 Danish, Male.45.4 Under 1.125 Immigrant, Female.05.07 Over 0.714 Immigrant, Male.05.03 Under PS weight is just w ps = N l /n l

Post-Stratification: Example Imagine our sample ends up skewed on immigration status and gender relative to the population Group Pop. Sample Rep. Weight Danish, Female.45.5 Over 0.900 Danish, Male.45.4 Under 1.125 Immigrant, Female.05.07 Over 0.714 Immigrant, Male.05.03 Under 1.667 PS weight is just w ps = N l /n l

Post-Stratification Should only be done after correcting for sampling design Strata must be large (n > 15) Need accurate population-level stratum sizes Only useful if stratifying variables are related to key constructs of interest

Post-Stratification Should only be done after correcting for sampling design Strata must be large (n > 15) Need accurate population-level stratum sizes Only useful if stratifying variables are related to key constructs of interest This is the basis for inference in non-probability samples Probability samples make design-based inferences

Questions about weighting?

1 Weighting 2 Missing Data 3 Coding and Data Preparation 4 Wrap-up 5 Preview of Next Time

Sources of Missing Data Unit or item nonresponse Attrition or break-off Data loss

Effects of Missing Data Sampling variance and effective sample size

Effects of Missing Data Sampling variance and effective sample size Scale construction and multi-variate analysis

Effects of Missing Data Sampling variance and effective sample size Scale construction and multi-variate analysis Bias in estimates

Imputation Definition

Imputation Definition: Systematic replacement of missing values

Imputation Definition: Systematic replacement of missing values Why? Casewise deletion creates loss of information Preserve sampling variances (i.e., no loss of precision)

Imputation Definition: Systematic replacement of missing values Why? Casewise deletion creates loss of information Preserve sampling variances (i.e., no loss of precision) Considerations Why are data missing? How do we impute? What are the consequences of imputation?

Missing Data Assumptions Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR)

Imputation Methods Single Imputation

Imputation Methods Single Imputation Mean imputation Top/bottom category imputation Random imputation Hot deck imputation Regression imputation

Imputation Methods Single Imputation Mean imputation Top/bottom category imputation Random imputation Hot deck imputation Regression imputation Multiple Imputation

Imputation Methods Single Imputation Mean imputation Top/bottom category imputation Random imputation Hot deck imputation Regression imputation Multiple Imputation Single imputation multiple times, combining results across data sets Can apply numerous imputation methods Accounts for uncertainty due to missingness

1 Weighting 2 Missing Data 3 Coding and Data Preparation 4 Wrap-up 5 Preview of Next Time

Coding What is coding? Categorizing responses Assigning numeric values to categories

Coding What is coding? Categorizing responses Assigning numeric values to categories When in the data collection process do we code? In the field After data collection

Coding What is coding? Categorizing responses Assigning numeric values to categories When in the data collection process do we code? In the field After data collection How do we code? Create set of exhaustive, mutually exclusive categories Assign responses to categories Add new categories, as needed

Practice Coding 1 Code the Gordon Brown responses as: Correct Incorrect Don t know 2 Code the MIP responses into issue categories

Data Editing Leftover of manual data recording Software handles most data editing now Online survey tools (e.g., Qualtrics) CATI systems May still have problematic data points that need to be marked or changed If still in field, may clarify answers with respondents

Anonymizing Data Why do data need to be anonymous?

Anonymizing Data Why do data need to be anonymous? Guarantees of anonymity Sensitive data

Anonymizing Data Why do data need to be anonymous? Guarantees of anonymity Sensitive data When are data non-anonymous?

Anonymizing Data Why do data need to be anonymous? Guarantees of anonymity Sensitive data When are data non-anonymous? Identifying information Statistical identifiability

Anonymizing Data Why do data need to be anonymous? Guarantees of anonymity Sensitive data When are data non-anonymous? Identifying information Statistical identifiability How do we anonymize?

Anonymizing Data Why do data need to be anonymous? Guarantees of anonymity Sensitive data When are data non-anonymous? Identifying information Statistical identifiability How do we anonymize? Restrict data access Remove identifying variables

Data Storage, Archiving, and Sharing In what formats can we store survey data?

Data Storage, Archiving, and Sharing In what formats can we store survey data? Paper Punchcards Digitally

Data Storage, Archiving, and Sharing In what formats can we store survey data? Paper Punchcards Digitally Considerations in digital formats Open versus proprietary Human-readable versus machine-readable File sizes Study-level metadata Question-level metadata

Study-level Metadata Title Creator/Author Sponsor Description Date of publication Dates of data collection Population, sampling frame, etc. Sampling design Sample size Recruitment details Mode Rights

Question-level Metadata Response codes Response labels Variable labels Variable names Missing data categories Variable types Mode

Question-level Metadata II Details of randomization or question order Exclusion criteria Source of data (if not from R) Frequencies or summary statistics Interviewer instructions Constraints on responses

Example Codebook 2 Question B 10 DK Which party did you vote for in that election? (Denmark) Variable name and label: prtvtcdk Party voted for in last national election, Denmark Values and categories 01 Socialdemokraterne - the Danish social democtrats 02 Det Radikale Venstre - Danish Social-Liberal Party 03 Det Konservative Folkeparti - Conservative 04 SF Socialistisk Folkeparti - Socialist People s Party 05 Dansk Folkeparti - Danish peoples party 06 Kristendemokraterne - Christian democtrats 07 Venstre, Danmarks Liberale Parti - Venstre 08 Liberal Alliance - Liberal Alliance 09 Enhedslisten - Unity List - The Red-Green Alliance 10 Andet - other 66 Not applicable 77 Refusal 88 Don t know 99 No answer Filter: If code 1 at B9 2 From: http://www.europeansocialsurvey.org/docs/round6/survey/ess6_appendix_a7_e02_0.pdf

File Formats Go to the course website Open data files under Week 11 All contain the same data What do you notice about the different files?

Metadata Standards Most survey data are stored in proprietary formats using codebooks constructed in arbitrary formats This makes it hard to work with survey data There are common standards for metadata Dublin Core (DC) Data Documentation Initiative (DDI)

For Your Project Discuss appropriate file format for data storage/sharing Discuss how data can be used after collection (i.e., rights) Discuss codebook creation When do you create a codebook What goes in your codebook Where do you record study-level metadata

Questions about handling survey data?

1 Weighting 2 Missing Data 3 Coding and Data Preparation 4 Wrap-up 5 Preview of Next Time

Total Survey Error Design-Related Errors Coverage Error Sampling Error Nonresponse Error Adjustment Error

Total Survey Error Design-Related Errors Coverage Error Sampling Error Nonresponse Error Adjustment Error Measurement Errors Construct Validity Measurement Error and Response Biases Processing Error

Total Survey Error Design-Related Errors Coverage Error Sampling Error Nonresponse Error Adjustment Error Measurement Errors Construct Validity Measurement Error and Response Biases Processing Error Our goal: Minimize total error (thus maximizing data quality), within the constraints of time, cost, and other resources

1 Weighting 2 Missing Data 3 Coding and Data Preparation 4 Wrap-up 5 Preview of Next Time

Agenda for next two classes Presentations Prepare questions to get help with Email me if you want to meet (after Dec. 4)