Survey nonresponse and the distribution of income

Similar documents
Survey Nonresponse and the Distribution of Income

kmr: A Command to Correct Survey Weights for Unit Nonresponse using Group s Response Rates

VARIANCE ESTIMATION FOR NEAREST NEIGHBOR IMPUTATION FOR U.S. CENSUS LONG FORM DATA

Population Profiles

Poverty, Inequality and Growth: Empirical Issues

Club Convergence and Clustering of U.S. State-Level CO 2 Emissions

What is Survey Weighting? Chris Skinner University of Southampton

Updating Small Area Welfare Indicators in the Absence of a New Census

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Commuting paradox revisited - Compensation for commutes in two-earner households? Konferenz Verkehrsökonomik und -politik 2017

Introduction to Survey Data Integration

The impact of residential density on vehicle usage and fuel consumption*

Vanessa Nadalin Lucas Mation IPEA - Institute for Applied Economic Research Regional Studies Association Global Conference Fortaleza 2014

extreme weather, climate & preparedness in the american mind

A multivariate multilevel model for the analysis of TIMMS & PIRLS data

More on Roy Model of Self-Selection

An Overview of the Pros and Cons of Linearization versus Replication in Establishment Surveys

UNIVERSITY OF WARWICK. Summer Examinations 2015/16. Econometrics 1

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Introduction to Survey Data Analysis

Social Vulnerability Index. Susan L. Cutter Department of Geography, University of South Carolina

VALIDATING A SURVEY ESTIMATE - A COMPARISON OF THE GUYANA RURAL FARM HOUSEHOLD SURVEY AND INDEPENDENT RICE DATA

Data Collection. Lecture Notes in Transportation Systems Engineering. Prof. Tom V. Mathew. 1 Overview 1

Multivariate Statistics

Typical information required from the data collection can be grouped into four categories, enumerated as below.

Gini Coefficient. A supplement to Mahlerʼs Guide to Loss Distributions. Exam C. prepared by Howard C. Mahler, FCAS Copyright 2017 by Howard C. Mahler.

Eric V. Slud, Census Bureau & Univ. of Maryland Mathematics Department, University of Maryland, College Park MD 20742

Might using the Internet while travelling affect car ownership plans of Millennials? Dr. David McArthur and Dr. Jinhyun Hong

UK Data Archive Study Number British Social Attitudes, British Social Attitudes 2008 User Guide. 2 Data collection methods...

Robust Non-Parametric Techniques to Estimate the Growth Elasticity of Poverty

Summary Article: Poverty from Encyclopedia of Geography

A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

MEASURING THE SOCIO-ECONOMIC BIPOLARIZATION PHENOMENON

Chapter 11. Regression with a Binary Dependent Variable

CS229 Machine Learning Project: Allocating funds to the right people

DEEP, University of Lausanne Lectures on Econometric Analysis of Count Data Pravin K. Trivedi May 2005

Income Inequality in Missouri 2000

Math 142 Lecture Notes. Section 7.1 Area between curves

Introduction to Econometrics. Assessing Studies Based on Multiple Regression

GROWING APART: THE CHANGING FIRM-SIZE WAGE PREMIUM AND ITS INEQUALITY CONSEQUENCES ONLINE APPENDIX

Measuring Poverty. Introduction

Sidestepping the box: Designing a supplemental poverty indicator for school neighborhoods

TUESDAYS AT APA PLANNING AND HEALTH. SAGAR SHAH, PhD AMERICAN PLANNING ASSOCIATION SEPTEMBER 2017 DISCUSSING THE ROLE OF FACTORS INFLUENCING HEALTH

TWO-WAY CONTINGENCY TABLES UNDER CONDITIONAL HOT DECK IMPUTATION

Electronic Research Archive of Blekinge Institute of Technology


GIS-Based Analysis of the Commuting Behavior and the Relationship between Commuting and Urban Form

A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and

PALS: Neighborhood Identification, City of Frederick, Maryland. David Boston Razia Choudhry Chris Davis Under the supervision of Chao Liu

THE USAGE OF GIS APPLICATIONS IN SURVEYS DEPARTMENT. Tools for planning and managing the work of interviewers

Molinas. June 15, 2018

2011 NATIONAL SURVEY ON DRUG USE AND HEALTH

HOUSEHOLD surveys collect information on incomes,

The Impact of Residential Density on Vehicle Usage and Fuel Consumption: Evidence from National Samples

Known unknowns : using multiple imputation to fill in the blanks for missing data

CIRPÉE Centre interuniversitaire sur le risque, les politiques économiques et l emploi

Determining Changes in Welfare Distributions at the Micro-level: Updating Poverty Maps By Chris Elbers, Jean O. Lanjouw, and Peter Lanjouw 1

Multidimensional Control Totals for Poststratified Weights

Dwelling Price Ranking vs. Socio-Economic Ranking: Possibility of Imputation

On Measuring Growth and Inequality. Components of Changes in Poverty. with Application to Thailand

New Developments in Nonresponse Adjustment Methods

Economic Theory of Spatial Costs. of Living Indices with. Application to Thailand

Planned Missingness Designs and the American Community Survey (ACS)

PhD/MA Econometrics Examination January 2012 PART A

COSTA RICA Limon City-Port Project

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

CRP 608 Winter 10 Class presentation February 04, Senior Research Associate Kirwan Institute for the Study of Race and Ethnicity

Poverty and Distributional Outcomes of Shocks and Policies

Interpreting coefficients for transformed variables

2005 Mortgage Broker Regulation Matrix

SOUTH COAST COASTAL RECREATION METHODS

Notes on Heterogeneity, Aggregation, and Market Wage Functions: An Empirical Model of Self-Selection in the Labor Market

Chapter. Organizing and Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

A simple bivariate count data regression model. Abstract

Poverty and Hazard Linkages

Secondary Towns and Poverty Reduction: Refocusing the Urbanization Agenda

Secondary Towns, Population and Welfare in Mexico

A Joint Tour-Based Model of Vehicle Type Choice and Tour Length

Targeting the Poor. Towards evidence-based implementation. Johan A. Mistiaen (World Bank)

Analyzing Pilot Studies with Missing Observations

BOOK REVIEW Sampling: Design and Analysis. Sharon L. Lohr. 2nd Edition, International Publication,

Market access and rural poverty in Tanzania

Incorporating Level of Effort Paradata in Nonresponse Adjustments. Paul Biemer RTI International University of North Carolina Chapel Hill

CIV3703 Transport Engineering. Module 2 Transport Modelling

Plausible Values for Latent Variables Using Mplus

12 Lorenz line and Gini index

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

NEW YORK DEPARTMENT OF SANITATION. Spatial Analysis of Complaints

A s i a n J o u r n a l o f M u l t i d i m e n s i o n a l R e s e a r c h KUZNETS CURVE: THE CASE OF THE INDIAN ECONOMY

INTRODUCTION TO TRANSPORTATION SYSTEMS

Assisting Countries in the Collection and Analysis of National Statistics

SOCIO-DEMOGRAPHIC INDICATORS FOR REGIONAL POPULATION POLICIES

Effective estimation of population mean, ratio and product in two-phase sampling in presence of random non-response

Two-phase sampling approach to fractional hot deck imputation

COSTA RICA Limon City-Port Project

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010

Exploiting TIMSS and PIRLS combined data: multivariate multilevel modelling of student achievement

Statistics for Managers Using Microsoft Excel 5th Edition

Unless provided with information to the contrary, assume for each question below that the Classical Linear Model assumptions hold.

Transcription:

Survey nonresponse and the distribution of income Emanuela Galasso* Development Research Group, World Bank Module 1. Sampling for Surveys

1: Why are we concerned about non response? 2: Implications for measurement of poverty and inequality 3: Evidence for the US Estimation methods Results 4: An example for China 2

1: Why do we care? 3

Types of nonresponse Item-nonresponse (participation to the survey but non-response on single questions) Imputation methods using matching Lillard et al. (1986); Little and Rubin (1987) 4

Types of nonresponse Item-nonresponse Imputation methods using matching Lillard et al. (1986); Little and Rubin (1987) The idea: Observations with complete data X Yes Y Yes missing data Yes No For sub-sample with complete data: Then impute missing data using: Y = M (X ) Y ˆ = Mˆ ( X ) 5

Types of nonresponse Unit-nonresponse ( non-compliance ) (non-participation to the survey altogether) 6

Unit-nonresponse: possible solutions Ex-ante: Replace non respondents with similar households Increase the sample size to compensate for it Using call-backs, monetary incentives: Van Praag et al. (1983), Alho (1990), Nijman and Verbeek (1992) Ex-post: Corrections by re-weighting the data Use imputation techniques (hot-deck, cold-deck, warm-deck, etc.) to simulate the answers of nonrespondents 7

Unit-nonresponse: possible solutions Ex-ante: Replace nonrespondents with similar households Increase the sample size to compensate for it Using call-backs, monetary incentives: Van Praag et al. (1983), Alho (1990), Nijman and Verbeek (1992) Ex-post: Corrections by re-weighting the data Use imputation techniques (hot-deck, cold-deck, warm-deck, etc.) to simulate the answers of nonrespondents None of the above 8

The best way to deal with unit-nonresponse is to prevent it Lohr, Sharon L. Sampling: Design & Analysis (1999)

Qualification Motivation Training Work Load Data collection method Interviewers Total Nonresponse Availability Socio-economic Type of survey Respondents Burden Economic Motivation Demographic Proxy 10 Source: Some factors affecting Non-Response. by R. Platek. 1977. Survey Methodology. 3. 191-214

Rising concern about unitnonresponse High nonresponse rates of 10-30% are now common LSMS: 0-26% nonresponse (Scott and Steele, 2002) UK surveys: 15-30% US: 10-20% Concerns that the problem might be increasing 11

Nonresponse is a choice, so we need to understand behavior Survey participation is a matter of choice nobody is obliged to comply with the statistician s randomized assignment There is a perceived utility gain from compliance the satisfaction of doing one s civic duty But there is a cost too An income effect can be expected 12

Nonresponse bias in measuring poverty and inequality Compliance is unlikely to be random: Rich people have: higher opportunity cost of time more to hide (tax reasons) more likely to be away from home? multiple earners Poorest might also not comply: alienated from society? homeless 13

2: Implications for poverty and inequality measures 14

Implications for poverty F(y) is the true income distribution, density f(y) Fˆ ( y) is the observed distribution, density fˆ ( y) Note: ˆ ( F( y ) F y = and F( y ) F y ) = 0 p = p ) 0 r ˆ ( = r 15

Implications for poverty F(y) is the true income distribution, density f(y) Fˆ ( y) is the observed distribution, density fˆ ( y) Note: ˆ ( F( y ) F y = and F( y ) F y ) = 0 p = p ) 0 r ˆ ( = r Definition: correction factor w(y) such that: f ( y) = w( y) fˆ( y) y F ( y) = w( x) fˆ( x) dx y p 16

Implications for poverty cont., If compliance falls with income then poverty is overestimated for all measures and poverty lines. i.e., first-order dominance: if w (y) > 0 for all y (y P, y R ), F ( y) < Fˆ ( y) then for all y (y P, y R ) 17

First-order dominance 1 0.8 0.6 w (y) > 0 0.4 0.2 observed corrected 0 0 1 2 3 4 5 6 18

Example Poor Non-poor Estimated distribution (%) 81 19 However, Response rate (%) 90 50 True distribution of 70 30 population (%) Correction factors 0.87 1.56 19

Implications for inequality If compliance falls with income (w (y) > 0) then the implications for inequality are ambiguous Lorenz curves intersect so some inequality measures will show higher inequality, some lower 20

Example of crossing Lorenz Curves 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 21

3: Evidence for the U.S. 22

Current Population Survey Source: CPS March supplement, 1998 2002, Census Bureau 3 types of non-interviews: type A: individual refused to respond or could not be reached what we define as non-response type B: housing unit vacant; type C: housing unit demolished we ignore type B/C in our analysis Year total number of households Type A households rate of nonresponse (%) 1998 54,574 4,221 7.73% 1999 55,103 4,318 7.84% 2000 54,763 3,747 6.84% 2001 53,932 4,299 7.97% 2002 84,831 6,566 7.74% All years 303,203 23,151 7.64% 23

Dependence of response rate on income Response rate and average per-capita income for 51 US states, CPS March supplement 2002 State Response Rate Average Income Maryland 86.77% $31,500 District Of Columbia 87.21% $34,999 Alaska 88.16% $26,564 New York 88.61% $26,013 New Jersey 88.71% $28,746 California 89.66% $26,822 Mississippi 95.08% $17,821 Indiana 95.21% $23,909 North Dakota 95.36% $20,154 Georgia 95.66% $23,893 West Virginia 96.65% $18,742 Alabama 97.24% $21,155 24

Dependence of response rate on income Response rate and average per-capita income for 51 US states, CPS March supplement 2002 98% 96% probability of compliance 94% 92% 90% 88% 86% 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000 36000 average state income 25

Estimation method In survey data, the income of non-responding households is by definition unobservable. However, we can observe the survey compliance rates by geographical areas. The observed characteristics of responding households, in conjunction with the observed compliance rates of the areas in which they live, allow one to estimate the household-specific probability of survey response. Thus we can correct for selective compliance by re-weighting the survey data. 26

Estimation method cont., {(X ij, m ij )} set of households in state j s.t. m ij households each carry characteristics X ij, where X ij includes e.g. ln(y ij ), a constant, etc. total number of households in state j: M j representative sample S j in state j with sampled households m j = Σ m ij for each sampled household ε there s a probability of response D εij {0,1} P ( D = 1 X, θ ) = P = logistic( X θ ) εij ij i i 27

Estimation method cont., The observed mass of respondents of group i in state/area j is E ( m ) = obs ij obs mij E [ ] = mij Pi Then summing up for a given j yields: Now let s define: i m obs ij P i = i=1 obs obs obs mij mij mij ψ j ( θ ) { E[ ]} = { m j} P P P i i m i ij w ij P i = m These are the individual weights j i i This is known! 28

Estimation method cont., obs obs obs mij mij mij ψ j ( θ ) { E[ ]} = { m j} P P P i i i i i where obviously [ ( θ )] = 0 E ψ j Then we can estimate θ = arg min Ψ( θ ) ψ W ( θ )' ψ ( θ ) ˆ 1 ( θ ) 29

Estimation method cont., Optimal weighting matrix W = Var(ψ(θ)) Hansen (1982) Assume for single state j: [ ] 2 ψ ( θ ) m σ Var = j j This can be estimated as Finally, Var ( ˆ) 2 θ = ˆ σ [ G' NG] 1 ˆ σ 2 = where ψ j ( θ ) w j 2 ψ G = θ ( θ ) 30

Alternative Specifications Specification Ψ(θ) min θ 1 θ 2 θ 3 1: P = logit (θ 1 + θ 2 ln (y ) + θ 3 ln (y ) 2 ) 27.866 32.55-4.151 0.1193 (85.95) (-15.90) (0.7320) 2: P = logit (θ 1 + θ 2 ln (y )) 27.940 17.81-1.489 (3.51) (-0.329) 3: P = logit (θ 1 + θ 3 ln (y ) 2 ) 28.068 9.551-0.0666 (1.646) (-0.0145) 4: P = logit (θ 2 ln (y ) + θ 3 ln (y ) 2 ) 28.324 1.725-0.1438 (0.289) (-0.0272) 5: P = logit (θ 1 + θ 2 y ) 34.303 2.995-13.11 10-6 6: P = logit (θ 1 + θ 2 y + θ 3 y 2 ) 7: P = logit (θ 1 + θ 2 y + θ 3 ln (y )) (0.202) (-4.76 10-6 ) 28.639 3.792-37.45 10-6 67.21 10-12 (0.463) (-14.92 10-6 ) (58.33 10-12 ) 27.891 19.64 1.889 10-6 -1.671 (12.13) (17.35 10-6 ) (-1.229) 31

Results From Specification 2 P = logit(θ 1 + θ 2 ln(y)) Year Ψ(θ) min θ 1 θ 2 Gini uncorr Gini corr ΔGini 1998 17.321 19.90-1.697 45.49% 50.92% 5.43% (4.58) (-0.43) 1999 21.437 18.10-1.528 45.21% 49.03% 3.82% (4.42) (-0.418) 2000 12.558 22.21-1.890 44.30% 47.67% 3.37% (4.46) (-0.413) 2001 17.793 20.11-1.702 44.99% 49.47% 4.48% (3.82) (-0.355) 2002 27.94 17.81-1.489 44.36% 48.02% 3.66% (3.51) (-0.329) All 102.16 19.47-1.654 44.83% 49.07% 4.24% (1.89) (-0.177) 32

Graph of specification 2: Probability of compliance as a function of income 1 probability of compliance 0.8 0.6 0.4 0.2 0 $0 $50,000 $100,000 $150,000 $200,000 $250,000 Y... income/capita 33

Empirical and Corrected Cumulative Income Distribution 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 observed corrected 0 $0 $50,000 $100,000 $150,000 $200,000 $250,000 34

Income Distribution: Magnification 0.3 0.25 0.2 0.15 0.1 0.05 observed corrected 0 $0 $2,500 $5,000 $7,500 $10,000 35

Correction by Percentile of Income 50% 40% implied correction of income 30% 20% 10% 0% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 population percentile 36

Empirical and Corrected Lorenz Curve 1 0.9 observed corrected 0.8 0.7 % total income 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 % population 37

Lorenz Curves: Magnification 1 0.9999 observed corrected 0.9998 0.9997 0.9996 0.9995 0.9994 0.9993 0.9992 0.9991 0.999 0.99995 0.999975 1.00000 % population 38

Specifications with Other Variables Specifications 10 18, P = logit(θ 1 + θ 2 ln(y)+ θ 3 X 1 + θ 4 X 2 ): Specification Ψ(θ) min θ 1 θ 2 θ 3 θ 4 Gini corrected ΔGini 2: (baseline) 102.16 19.47-1.654 49.07% 4.24% (1.89) (-0.177) 10: X 1 = age X 2 = age 2 11: X 1 = age 100.19 101.84 17.78 (3.52) 20.17 (2.61) -1.695 (-0.188) -1.679 (-0.201) 0.09321 (0.10569) -0.00888 (-0.01648) -0.00092 (-0.00095) 49.09% 49.23% 4.26% 4.40% 12: X 2 = age 2 101.57 20.11 (2.31) -1.688 (0.198) -0.00010 (-0.00014) 49.26% 4.43% 13: X 1 = (age>64) 14: X 1 = edu X 2 = edu 2 15: X 1 = edu 100.52 99.795 101.15 20.04 (2.06) 25.90 (7.59) 18.71-1.696 (-0.188) -1.469 (-0.334) -1.502-0.6123 (-0.5753) -1.481 (-1.447) -0.08292 0.06235 (0.06456) 49.23% 48.52% 48.68% 4.40% 3.69% 3.85% (2.48) (-0.333) (-0.12667) 16: X 1 = (edu>39) 98.725 18.44 (1.93) -1.456 (-0.233) -1.352 (-1.187) 48.53% 3.70% 17: X 1 = sex 101.00 19.37 (1.92) -1.627 (-0.187) -0.4785 (-0.5315) 48.84% 4.01% 18: X 1 = race 93.353 17.51-1.516 0.5877 48.26% 3.43% (1.96) (-0.183) (0.1592) 19: X 1 = size 100.11 21.51-1.777-0.3102 49.15% 4.32% (2.12) (-0.189) (-0.1316) 20: X 1 = race 91.709 19.15-1.618 0.5672-0.229 48.38% 3.55% X 2 = size (2.16) (-0.189) (0.1574) (-0.1289) 39

4: China 40

Example for China Urban Household Survey of NBS Two stages in sampling Stage 1: Large national random sample with very short questionnairre and high repsonse rate Stage 2: Random sample drawn from Stage 1 sample, given very detailed survey, including daily diary, regular visits etc Use Stage 1 data to model determinants of compliance Then re-weight the data 41

Further reading Korinek, Anton, Johan Mistiaen and Martin Ravallion, An Econometric Method of Correcting for unit Nonresponse Bias in Surveys, Journal of Econometrics, (2007), 136: 213-235 Korinek, Anton, Johan Mistiaen and Martin Ravallion, Survey Nonresponse and the Distribution of Income. Journal of Economic Inequality, (2006), 4:33-55 42