Regression Diagnostics for Survey Data

Similar documents
Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Multiple Linear Regression

STAT 4385 Topic 06: Model Diagnostics

Package svydiags. June 4, 2015

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

CHAPTER 5. Outlier Detection in Multivariate Data

Regression diagnostics

Collinearity Diagnostics for Complex Survey Data. Dan Liao Doctor of Philosophy, 2010

Checking model assumptions with regression diagnostics

Detecting and Assessing Data Outliers and Leverage Points

STATISTICS 110/201 PRACTICE FINAL EXAM

Unit 10: Simple Linear Regression and Correlation

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Chapter 1 Statistical Inference

Regression Analysis for Data Containing Outliers and High Leverage Points

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

Math 423/533: The Main Theoretical Topics

Quantitative Methods I: Regression diagnostics

Day 4: Shrinkage Estimators

Unit 11: Multiple Linear Regression

Weighted Least Squares

10 Model Checking and Regression Diagnostics

Regression Diagnostics Procedures

Remedial Measures for Multiple Linear Regression Models

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Condition Indexes and Variance Decompositions for Diagnosing Collinearity in Linear Model Analysis of Survey Data

ISyE 691 Data mining and analytics

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

10. Alternative case influence statistics

Regression Diagnostics

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Regression Model Building

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

((n r) 1) (r 1) ε 1 ε 2. X Z β+

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

The Effect of a Single Point on Correlation and Slope

Topic 18: Model Selection and Diagnostics

Single and multiple linear regression analysis

Graduate Econometrics Lecture 4: Heteroskedasticity

MATH Notebook 4 Spring 2018

Business Statistics. Lecture 9: Simple Regression

Lecture 4: Regression Analysis

Weighted Least Squares

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

1 Least Squares Estimation - multiple regression.

y response variable x 1, x 2,, x k -- a set of explanatory variables

Multiple Linear Regression

An Overview of the Pros and Cons of Linearization versus Replication in Establishment Surveys

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

9 Correlation and Regression

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Multiple Regression Methods

Section 2 NABE ASTEF 65

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

REGRESSION OUTLIERS AND INFLUENTIAL OBSERVATIONS USING FATHOM

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Introduction to Regression

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Chapter 10 Building the Regression Model II: Diagnostics

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Introduction to Econometrics. Heteroskedasticity

Kutlwano K.K.M. Ramaboa. Thesis presented for the Degree of DOCTOR OF PHILOSOPHY. in the Department of Statistical Sciences Faculty of Science

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Chapter 5 Matrix Approach to Simple Linear Regression

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Lecture 1: Linear Models and Applications

The Model Building Process Part I: Checking Model Assumptions Best Practice

4 Multiple Linear Regression

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Linear Regression Models

Linear Regression In God we trust, all others bring data. William Edwards Deming

Hypothesis Testing for Var-Cov Components

holding all other predictors constant

Diagnostics and Remedial Measures

14 Multiple Linear Regression

Generalized Linear Models

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Confidence Intervals, Testing and ANOVA Summary

Regression Models - Introduction

Classification & Regression. Multicollinearity Intro to Nominal Data

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Lectures 5 & 6: Hypothesis Testing

Regression, Ridge Regression, Lasso

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Formal Statement of Simple Linear Regression Model

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Transcription:

Regression Diagnostics for Survey Data Richard Valliant Joint Program in Survey Methodology, University of Maryland and University of Michigan USA Jianzhu Li (Westat), Dan Liao (JPSM) 1

Introduction Topics adaptations to survey data of Leverages DFBETAS DFFITS Cook s D Collinearity measures Forward search Comparisons to standard diagnostics 2

Linear Regression on Survey Data Weighted least squares estimates (fixed effects) Y ( 2 v ) T i i i i i = x β +ε, ε ~ 0, σ independent (no clustering but can be ( ) 1 ˆ T 1 T 1 β = X WV X X WV Y handled) If constant variance, ( ) 1 ˆ T T β = X WX X WY W = diagonal matrix of survey weights ˆβ can be interpreted as an estimate of (i) (ii) parameter in underlying model or of census fit parameter 3

Reasons for Using Diagnostics Extreme points can affect regression parameter estimates, hypothesis tests, & confidence intervals Extremes can be due to - outlying X s or Y s (survey or non-survey data) - large weights (survey data) - interaction of weights with X s and Y s 4

B C Y A X A, B, and C are all influential. A, C may affect estimated slope. C will not affect slope but may reduce SE of slope. 5

Y 10000 20000 30000 40000 Y 10000 20000 30000 40000 0 50 100 150 200 250 300 Beds 0 2000 4000 6000 8000 Adds Generated data based on a survey of mental health organizations The 5 points in the lower right may or may not be influential depending on size of their survey weights. 6

Survey Weights Survey weights are intended to expand a sample to a finite population. They are NOT same as inverse-variance weights in usual WLS regression. Reasons for variation in size of weights due to sample design - Household surveys - Different sampling rates for demographic groups (e.g., to get equal sample sizes for groups) - Business/institution surveys - Varying sampling rates by type of business (retail, service, etc) - PPS sampling (probs no. of employees) 7

More reasons for variation in size of weights - Differential follow-up for nonresponse, i.e., subsampling of neighborhoods at different rates for nonresponse conversion, callbacks - Low response rates followed by large nonresponse adjustments in some groups - Use of auxiliary data in estimation poststratification by age, race, sex; general regression estimation using no. of employees, prior year expenditures, etc. 8

Examples 1999-2002 National Health & Nutrition Examination Survey (NHANES) Weight range for Mexican-Americans: 698 103,831 (148:1) 1998 Survey of Mental Health Organizations Weight range: 1-159 2002 Status of the Armed Forces Survey Weight range: 2.3 384 (168:1) 9

Hat Matrix and Leverages (Li & Valliant, Survey Methodology 2009) Predicted values: Y ˆ = HY 1 T H= XA X W with A = T X WX Leverages on the diagonal of hat matrix are h T 1 ii = xi A x iwi When model has an intercept, leverage can be decomposed as h 1 wi ˆ T 1 = 1 + N( ) ( ) nw x x S x x, ii i W i W S is a x-product matrix involving x s; x W wtd mean of x s A point has high leverage if its weight is >> average or x i is toward edge of ellipsoid centered at x W. 10

An Example 1998 Survey of Mental Health Organizations (SMHO). PPS sample Regress expenditures on no. of beds (BEDS), no. patients added during years (ADDS) Quantiles of Variables in SMHO Regression. Quantiles Variables 0% 25% 50% 75% 100% Expenditure (1000 s) 17 2,932 6,240 11,842 519,863 BEDS 0 6 36 93 2,405 ADDS 0 558 1,410 2,406 79,808 Weights 1 1.42 2.48 7.76 158.86 11

Scatterplots of expenditures versus beds and additions. High leverage points based on OLS (SW) are highlighted in top (bottom) row. OLS 0 500 1000 1500 SW 12 Expenditures 0 50000 100000 200000 Expenditures 0 50000 100000 200000 0 5000 15000 25000 Beds Additions Expenditures 0 50000 100000 200000 Expenditures 0 50000 100000 200000 0 500 1000 1500 0 5000 15000 25000 Beds Additions

Plot of survey weighted leverages versus OLS unweighted leverages. SW leverages 0.00 0.05 0.10 0.15 0.20 0.25 A B 0.00 0.02 0.04 0.06 0.08 OLS leverages Rule-of-thumb cutoff is 2 pn A = detected by SW only; B = detected by OLS only 13

OLS and SW parameter estimates of SMHO regression using all 875 sample cases. Independent OLS Estimation SW Estimation Variables Coefficient SE t Coefficient SE t Intercept -1,201 526-2.3 514 1,158 0.4 # of Beds 94 3 31 81 13 6.2 # of Additions 2.3 0.13 18 1.8 0.8 2.4 Deleting observations with leverages greater than 2 pn=0.007 Intercept 2,987 490 6 1,994 354 5.6 # of Beds 69 4.4 16 76 6.7 11.2 # of Additions 0.95 0.20 4.7 1.0 0.20 4.7 After deleting high leverage points, SEs reduced, OLS and WLS estimates closer to each other. Significance of coefficients unchanged (except for intercept) 14

Variance Estimators Estimators of Var( β ˆ ) are needed for several diagnostics Options are Binder sandwich (ISR 1983) or replication (jackknife, BRR, bootstrap) These are both design- and model-consistent. Purely model-based estimator useful for setting cutoffs vm ( ˆ ) 2 1 T 2 1 2 2 β = ˆ σ A X W XA with ˆ σ = ( ˆ ) i i i s we N p e T ˆ i Yi i = x β, N ˆ = i s w i 15

Standardized Residuals Standardizing so that residuals have (approximate) variance 1 makes interpretation easier. Use e ˆ σ i Cutoff for large: 2 or 3 based on Gauss inequality (No design-based, distribution theory for residuals, even asymptotically) 16

DFBETAS, DFFITS (Li & Valliant 2009, submitted) Measure effect of single unit on each ˆ β j separately DFBETAS ij c e ( 1 h ) ji i ii = with ( 1 c ) ij iew i i j ( ˆ ) j v β = A x, i = unit, j = parm Based on ˆ ˆ 1 DFBETA = β β ( i) = A x ew ( 1 h ) i i i i ii Large if any of weight, residual, or leverage is large A lot to look at: np values 17

Measure effect of unit i on prediction Multiply DFBETA i by x T i to get DFFITS i = he ii i ( 1 h ) ( ˆ ) j v β ii Heuristic cutoffs DFBETAS ij z n DFFITS i z p n, z = 2 or 3 (Bonferroni adjustment to cutoffs can be used) 18

Extended Cook s D Measures effect of single unit on vector estimate ˆβ ( β ˆ β ˆ ) ( β ˆ) ( β ˆ β ˆ() ) () T 1 EDi = i v i Compare to quantiles from χ 2 ( p) distribution. Influential units are ones that define a large ellipsoid centered at ˆβ. Per Atkinson (JRSS-B 1982), an alternative that detects more points is MDi = nedi p. Heuristic cutoff for MD i is 2 or 3 19

SMHO Data: Regress expenditures on BEDS, ADDS SW DFBETAS: Beds -0.4-0.2 0.0 0.2 0.4 C A B D SW DFBETAS: Adds -0.4-0.2 0.0 0.2 0.4 C A B D -1.0-0.5 0.0 0.5 1.0-1.0-0.5 0.0 0.5 1.0 OLS DFBETAS: Beds OLS DFBETAS: Adds C & D are cases identified by OLS but not by SW These are all cases with small weights. OLS flags 57 cases; SW 9. 20

SW Cook's D 0 2 4 6 8 10 A B -5 0 5 10 15 20 25 30 OLS Cook's D A = cases identified by SW only; B = OLS only OLS flags 44; MD flags 10 21

OLS and SW Parameter Estimates after Deleting Observations with Large Modified Cook s Distance. Independent OLS Estimation SW Estimation Variables Coefficient SE t Coefficient SE t Intercept -1,201 526-2.3 514 1,158 0.4 # of Beds 94 3 31 81 13 6.2 # of Additions 2.3 0.13 18 1.8 0.8 2.4 No. units 44 10 deleted Independent Coefficient SE t Coefficient SE t Variables Intercept 1660 335 4.9 932 345 2.7 # of Beds 81 2.4 33 83 5.7 14.5 # of Additions 1.2 0.12 9.7 1.4 0.3 5.4 22

Forward Search (Atkinson & Riani book 2000), Li & Valliant 2009, draft) One outlier can mask effect of another Identify groups of influential observations to avoid masking effect 23

Method - Fit a robust regression (e.g., least median of squares) to subsample of full sample - Choose subsample that minimizes ( 2 median e ) OLS, i - Subsample m= p - Find m + 1 cases with smallest squared residuals - Track ˆ σ 2 - Look for point at which after that are called outliers. (No abrupt changes no outliers) Adaptations made for survey data ˆ 2 σ makes abrupt change. All cases 24

SMHO Data again Plots of Parameter Estimates from Forward Search Intercept 1000 1500 2000 400 500 600 700 800 900 Beds 50 55 60 65 70 75 80 400 500 600 700 800 900 Adds 0.8 1.0 1.2 1.4 400 500 600 700 800 900 Subset size Subset size Subset size 83 points identified as influential; 20 never identified by singlecase deletion methods (DFBETAS, DFFITS, modified Cook, etc) Method may have promise but more work needed. 25

Collinearity Collinearity is worrisome for both numerical and statistical reasons. Estimates of slopes can be numerically unstable, i.e., small changes in the X's or the Y 's can produce large changes in estimates. Correlation among predictors can lead to slope estimates with large variances. When X's are strongly correlated, 2 R can be large while the individual slope estimates are not statistically significant. Even if slope estimates are significant, they may have opposite sign of what is expected. 26

Variance inflation factor (VIF) Measure of how much var ( ˆ ) j would be if x s were orthogonal. β is inflated compared to what it Var M ( ˆ β ) k 1 σ = 1 2 2 2 Rk x i s ik VIF 2 R k is the R-square from regressing x k on the other x s. x k = column k of X 27

For survey weighted regression estimator, if = + ( ˆ ) ζ Var kηk M β k = Var if x others 2 k 1 RSW ( k ) 2 SW k ( ) VIF ( ) Y Xβ ε, ε ~ ( 0V, ) R = R-square from SW of regression of x k on other x's T k ( ) ( k ) ζ k = e WVWe T e( k ) We( k ), T x k Wx k k T x k WVWx k η =, e ( k ) = vector of residuals from regressing x k on other x's 28

Approaches to estimation - Purely model-based - Think of census value of 1 ζ η k k 2 R SW k ( ) ; fill in design-based estimates of each component. Variance decomposition using SVD: use to identify pairs of x s that are collinear (ala Belsley, Kuh, Welsch 1980) Work is in progress on this 29

Conclusion Different points can be influential in OLS and SW regression. Specialized diagnostics needed for survey data (assuming survey weighted LS used). If you adopt OLS regression, use OLS diagnostics; if you adopt SW regression, use SW diagnostics. Little formal distribution theory available Packages do not currently include diagnostics for survey regressions 30

Implications of dropping points based on diagnostics - Core model being fitted: one that fits for the portion of population that excludes influential points - Idea of estimating census parameter is lost 31

What if mechanical procedure used that automatically drops points? - SE s too small, CI s cover at less than nominal rate, hypothesis tests reject too often - Similar to problems known for stepwise regression (Zhang BMKA 1992, Hurvich & Tsai TAS 1990) Collinearity has similar effects on survey estimators as in regular regression - Same inference problems may exist as above if automatic procedure used. 32