How to Use the Internet for Election Surveys

Similar documents
SAMPLING FOR WEB SURVEYS 1. INTRODUCTION

Can a Pseudo Panel be a Substitute for a Genuine Panel?

GALLUP NEWS SERVICE JUNE WAVE 1

Combining Non-probability and Probability Survey Samples Through Mass Imputation

Data Integration for Big Data Analysis for finite population inference

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

Weighting Missing Data Coding and Data Preparation Wrap-up Preview of Next Time. Data Management

Statistics 135 Fall 2007 Midterm Exam

PS 203 Spring 2002 Homework One - Answer Key

Nonresponse weighting adjustment using estimated response probability

Business Statistics: A First Course

ST 371 (IX): Theories of Sampling Distributions

Appendix A: Topline questionnaire

Why Sample? Selecting a sample is less time-consuming than selecting every item in the population (census).

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Estimation of the Conditional Variance in Paired Experiments

SOUTH COAST COASTAL RECREATION METHODS

MN 400: Research Methods. CHAPTER 7 Sample Design

Math 138 Summer Section 412- Unit Test 1 Green Form, page 1 of 7

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Advanced Quantitative Research Methodology Lecture Notes: January Ecological 28, 2012 Inference1 / 38

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Combining Difference-in-difference and Matching for Panel Data Analysis

You are allowed 3? sheets of notes and a calculator.

Topic 3 Populations and Samples

New Developments in Nonresponse Adjustment Methods

Section 7.1 How Likely are the Possible Values of a Statistic? The Sampling Distribution of the Proportion

Hanmer and Kalkan AJPS Supplemental Information (Online) Section A: Comparison of the Average Case and Observed Value Approaches

Ecological inference with distribution regression

Identify the scale of measurement most appropriate for each of the following variables. (Use A = nominal, B = ordinal, C = interval, D = ratio.

Sociology Exam 2 Answer Key March 30, 2012

CHOOSING THE RIGHT SAMPLING TECHNIQUE FOR YOUR RESEARCH. Awanis Ku Ishak, PhD SBM

Model Assisted Survey Sampling

Who Believes that Astrology is Scientific?

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Have you... Unit 1: Introduction to data Lecture 1: Data collection, observational studies, and experiments. Readiness assessment

ASSOCIATED PRESS-WEATHER UNDERGROUND WEATHER SURVEY CONDUCTED BY KNOWLEDGE NETWORKS January 28, 2011

Lecture 3: Vectors. In Song Kim. September 1, 2011

Scales of Measuement Dr. Sudip Chaudhuri

Statistics for Managers Using Microsoft Excel 5th Edition

Data Collection: What Is Sampling?

Chi-Square. Heibatollah Baghi, and Mastee Badii

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

The Use of Survey Weights in Regression Modelling

Accounting for Nonignorable Unit Nonresponse and Attrition in. Panel Studies with Refreshment Samples

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Basics of Modern Missing Data Analysis

Forecasting the 2012 Presidential Election from History and the Polls

Comment on Tests of Certain Types of Ignorable Nonresponse in Surveys Subject to Item Nonresponse or Attrition

Inference for Average Treatment Effects

Unit 3 Populations and Samples

A Course in Applied Econometrics. Lecture 2 Outline. Estimation of Average Treatment Effects. Under Unconfoundedness, Part II

Advanced Quantitative Research Methodology, Lecture Notes: April Missing 26, Data / 41

Sampling distributions and the Central Limit. Theorem. 17 October 2016

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Measuring Social Influence Without Bias

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Announcements. Unit 1: Introduction to data Lecture 1: Data collection, observational studies, and experiments. Statistics 101

Item Response Theory for Conjoint Survey Experiments

Polling and sampling. Clement de Chaisemartin and Douglas G. Steigerwald UCSB

Non-parametric bootstrap and small area estimation to mitigate bias in crowdsourced data Simulation study and application to perceived safety

Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Introduction to Statistical Inference

Supplemental Material for Policy Deliberation and Voter Persuasion: Experimental Evidence from an Election in the Philippines

AP Statistics Review Ch. 7

Recent Advances in the analysis of missing data with non-ignorable missingness

Week 2: Review of probability and statistics

CSSS/STAT/SOC 321 Case-Based Social Statistics I. Levels of Measurement

Successive Difference Replication Variance Estimation in Two-Phase Sampling

Gov 2000: 6. Hypothesis Testing

Overview. Confidence Intervals Sampling and Opinion Polls Error Correcting Codes Number of Pet Unicorns in Ireland

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Dynamics in Social Networks and Causality

TECH 646 Analysis of Research in Industry and Technology

Inferences About Two Proportions

Lesson 6 Population & Sampling

Ecological Inference

What s New in Econometrics. Lecture 1

and the Sample Mean Random Sample

Biostat 2065 Analysis of Incomplete Data

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

Introduction to Statistical Data Analysis Lecture 4: Sampling

Selection on Observables: Propensity Score Matching.

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Regression Discontinuity Designs

Stochastic calculus for summable processes 1

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

Sampling. Benjamin Graham

1. Capitalize all surnames and attempt to match with Census list. 3. Split double-barreled names apart, and attempt to match first half of name.

Comparing Group Means When Nonresponse Rates Differ

Representativeness. Sampling and. Department of Government London School of Economics and Political Science

Dummies and Interactions

STA 260: Statistics and Probability II

Estimating Dynamic Games of Electoral Competition to Evaluate Term Limits in U.S. Gubernatorial Elections: Online Appendix

Introduction to Survey Data Integration

1 Introduction Overview of the Book How to Use this Book Introduction to R 10

Transcription:

How to Use the Internet for Election Surveys Simon Jackman and Douglas Rivers Stanford University and Polimetrix, Inc. May 9, 2008

Theory and Practice Practice Theory Works Doesn t work Works Great! Black magic Doesn t work Too bad Don t!

What Works in Theory and Practice Current Population Survey Biennial Registration and Voting Supplements in November of election years High response rate (92% in 2004) Overstates registration and voting slightly (about 2-3%) Aside from misreporting, it can t have large errors By method of Cochran, Mosteller and Tukey, error bound is about 7%

2004 National Election Study 66% response rate in pre-election wave with 88% reinterview rate for an overall response rate of 58% 89% registration rate and 76% turnout rate are too high. Versus CPS estimates of 72% registration and 64% turnout Actual was about 70% registration and 61% VEP turnout CMT bounds for NES are typically around ±30% But, in practice, it works reasonably well.

RDD Telephone Surveys Actual response rates for short field periods are very low (15-20%) Higher contact (and response rates) possible for longer field periods Within-household selection usually non-random Gallup uses quotas ABC/Washington Post uses unequal probabilities of selection, but ignores in weighting Oldest male/youngest female something that fails in theory, but seems to work in practice

Errors in Democratic Primary Polls in 2008 14 12 10 8 6 4 2 0 40 30 20 10 0 10 20 Poll Errors Days to Election Error in Obama Lead Percentage

Actual and Theoretical Error Distribution in NH Distribution of Poll Errors Density 0.0 0.1 0.2 0.3 0.4 Theoretical Actual 2 0 2 4 6 Error (in SD units)

Don t believe the MOE! 0 2 4 6 8 0 2 4 6 8 Sample Size and Reported Margin of Error 1/sqrt(n) Reported Margin of Error M.O.E. with DE = 2.56 M.O.E. with DE = 1 and rounded to nearest integer

Incorporating a Design Effect for Weighting Assuming Design Effect of 2.56 Density 0.0 0.1 0.2 0.3 0.4 Theoretical Actual 2 0 2 4 6 Error (in SD units)

2008 Web Election Surveys 2008 is the first year with large Internet-based election surveys NES (dedicated panel, recruited by KN) NAES and AP/Yahoo/Harvard (KN access panel) CCAP, CCES, and campaigns (YG/Polimetrix access panel)

Benefits of RDD Recruitment Coverage of non-internet households RDD widely accepted as probability-based Can calculate a meaningful response rate Avoids volunteers

Problems with RDD Recruitment Expensive Small Overuse Response rates are low Attrition is high Practical fixes require abandoning theoretical purity

An Alternative Approach Sample selection from a panel with unknown selection probabilities Use large opt-in panel Availability of large amounts of auxiliary data on both population and panel from voter and consumer databases, high quality probability samples We know a lot about both the target population and the panelists and should use it. Sample matching used for selection of subsamples from the panel Works in both theory and practice

Comparison of RDD with Web Opt-ins Unweighted Web 2004 2003 CPS Group RDD Opt-ins ACS Internet Blacks 7.9% 4.3% 11.8% 9.3% Hispanics 4.8% 3.3% 13.5% 7.2% Postgrad 17.2% 23.3% 9.4% 14.7% Age 18-24 6.4% 8.7% 10.3% 16.0% Male 41.9% 58.8% 48.9% 48.7% Married 57.7% 60.4% 54.3% 55.3%

Bias and SEs Standard errors measure sampling variability, not bias. Possible to calculate SEs without knowing data generating process Observations are, at a minimum, exchangeable and usually independent. No logical difference between nonresponse and self-selection. In both cases, the selection probabilities are unknown Validity of estimates depends upon untestable modeling assumptions Standard approach for both RDD and Web panels leave substantial amounts of bias

Auxilliary Information Voter and consumer databases provide a sampling frame for social science research. Frame contains a large amount of auxiliary information that can be used for bias reduction. Age, gender, vote history, address, name Home value, children, interests, magazines Data available for everyone both panelists and population. Idea: Draw a sample from the frame and find the closest matching respondents from a panel.

Sample Matching Recruit a large reservoir of respondents who are accessible for interviewing (the panel ). Obtain a population frame containing auxiliary information for matching. Select a target sample from the frame. For each unit in the target sample, find the closest matching unit in the reservoir. This is the matched sample. Variants: Use a high quality sample from another source as the target sample. Dynamic matching: match to multiple studies simultaneously using a flow of invitations.

Notation N = size of panel n = size of sample X = covariates (k vector) Y = survey measurements Z = panel membership indicator Panel ( X 1, Ỹ1),..., ( X N, ỸN)

Distributions and Parameters f X (x) = density of X in population fx (x) = density of X conditional on Z i = 1 f Y X (y x) = conditional distribution of Y given X fy X (y x) = conditional distribution in the panel µ(x) = E(Y X = x) = yf Y X (y x) dx θ 0 = E(Y ) = µ(x) f X (x) dx σ1 2 (x) = V (Y X = x, Z = 1)

Assumptions IID Data Generating Process (X i, Y i, Z i ) are i.i.d. Ignorable Selection f Y X (y x) = f Y X (y x) Continuous Covariates X has a continuous distribution with bounded convex support Overlap The support of X is the same in the panel as in the population with density bounded away from zero Continuity µ(x) is Lipschitz continuous Regularity V (Y X, Z = 1) is uniformly bounded

Matching Process Target Sample: Choose a (stratified) random sample of size n from the frame (X 1,..., X n ). For each element of the target sample, find the closest matching element M(i) in the panel: M(i) = j iff X i X j X i X l for all l in the panel Let Xi = M(i) and Yi Matched Sample Matching Estimator = ỸM(i). (X 1, Y 1 ),..., (X n, Y n ) θ = n 1 n i=1 Ỹ i

Scalar Matching The conditional density of X i given X i = x is N f (x)[1 F X (x + x x ) + F X (x x x )] N 1 where F X is the distribution function of X in the panel. Conditional on X i = x, the limiting distribution of N (X i x) is Laplace with mean zero and variance 1/2 f X (x) 2. The approximate distribution of Xi X i is, thus, Laplace with mean zero and variance O(1/N 2 )

Asymptotic Distribution of X X 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4

Theoretical Results When matching on a n-consistent estimator of the propensity score, the matched sample is consistent and asymptotically normal. When using nearest-neighbors matching and the number of matching variables is greater than two, the estimate is consistent, but involves a bias of order O(n k/2 /N) where k is the number of matching variables. In general, propensity score matching is to be preferred.

Monte Carlo Simulations Covariates have different means and covariance structure Support of X is [ 1, 1] [1, 1] Population distribution is truncated bivariate normal with mean zero, SD 1, and correlation 0.3 Panel distribution is truncated bivariate normal with mean (0.2, 0.3), SD 1, and correlation 0.5 Y X N(X 1 + X 2 /2, 1) (ignorable selection) Sample size of n = 1000 Panel size of N = 1500, 2000, 5000 or 10000. 1000 Monte Carlo Simulations

Simulated Bias of Estimators Bias 0.0 0.1 0.2 0.3 SRS from target Matched sample Cell weighted matched sample Propensity score weighted matched sample 1.5 2 3 5 10 Ratio of reservoir size to sample size

Simulated RMSE of Estimators RMSE 0.0 0.1 0.2 0.3 SRS from target Matched sample Cell weighted matched sample Propensity score weighted matched sample 1.5 2 3 5 10 Ratio of reservoir size to sample size

Empirical Application: 2006 CCES Consortium of 37 universities Pre- and post election interviews n = 37, 000 N = 129, 000 Frame was 2005 ACS matched to 2004 NEP Exit Poll 7-point party ID and 5-point ideology imputed from 2004 NAES Matched on age, race, gender, education, income, martial status, party ID, and ideology.

Senate Estimates and 95% Confidence Intervals

Governor Estimates and 95% Confidence Intervals Poll Percentage 20 30 40 50 60 70 80 20 30 40 50 60 70 80 Vote Percentage

Comparison of Accuracy Source n Mean Error RMSE Phone 255 2.76 8.34 Rasmussen (IVR) 83 3.82 8.47 SurveyUSA (IVR) 63 3.4 7.25 Zogby (Internet) 72 4.86 9.36 Polimetrix (Internet) 40-0.47 5.21 Source: Blumenthal and Franklin (2007)

Some Lessons from 2006 Some categories are in short supply, even in large panels: Older, low education minorities Panelists recruited through public opinion surveys have high levels of political interest (and over-report registration and turnout) The much-maligned paid survey takers are helpful for matching unregistered voters. Matching is imperfect, so weighting is important to remove remaining biases. However, much smaller weights are needed after matching. Panel attrition requires modeling regardless of the method of sampling. The question is not whether you do weighting, but whether you do it well.

Party ID Trends 2007-08 30 35 40 Democratic Percentage Nov Jan Mar May 20 25 30 35 Republican Date Percentage Nov Jan Mar May

Political and Campaign Interest 2007-08 Interest in Politics and Public Affairs Percentage 0 10 30 50 Nov Jan Mar May Interest in Campaign Percentage 0 10 30 50 Nov Jan Mar May

Trial Heats 2008 Obama vs. McCain Percentage 25 35 45 55 Feb 11 Mar 02 Mar 22 Apr 11 May 01 Clinton vs. McCain Percentage 25 35 45 55 Feb 11 Mar 02 Mar 22 Apr 11 May 01