Checking model assumptions with regression diagnostics

Similar documents
Single and multiple linear regression analysis

Exam Applied Statistical Regression. Good Luck!

Regression Diagnostics Procedures

Regression Model Building

Repeated-Measures ANOVA in SPSS Correct data formatting for a repeated-measures ANOVA in SPSS involves having a single line of data for each

Repeated Measures Analysis of Variance

The Model Building Process Part I: Checking Model Assumptions Best Practice

STAT 4385 Topic 06: Model Diagnostics

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

9 Correlation and Regression

ANCOVA. Psy 420 Andrew Ainsworth

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Analysing data: regression and correlation S6 and S7

Linear Regression Measurement & Evaluation of HCC Systems

Correlation & Simple Regression

Inferences for Regression

holding all other predictors constant

Regression Diagnostics for Survey Data

T. Mark Beasley One-Way Repeated Measures ANOVA handout

Topic 23: Diagnostics and Remedies

Lectures on Simple Linear Regression Stat 431, Summer 2012

Labor Economics with STATA. Introduction to Regression Diagnostics

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Cheat Sheet: Linear Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

AMS 7 Correlation and Regression Lecture 8

y response variable x 1, x 2,, x k -- a set of explanatory variables

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Residuals in the Analysis of Longitudinal Data

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Multiple Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Cox s proportional hazards/regression model - model assessment

General Linear Model

Statistical Modelling in Stata 5: Linear Models

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Nemours Biomedical Research Statistics Course. Li Xie Nemours Biostatistics Core October 14, 2014

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Statistics in medicine

Psy 420 Final Exam Fall 06 Ainsworth. Key Name

2. Outliers and inference for regression

Types of Statistical Tests DR. MIKE MARRAPODI

Chapter 14: Repeated-measures designs

Correlation and Regression Bangkok, 14-18, Sept. 2015

Unit 6 - Introduction to linear regression

Review of Statistics 101

Introduction to Linear regression analysis. Part 2. Model comparisons

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

STAT Checking Model Assumptions

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Comparing Several Means: ANOVA

Chapter 1 Statistical Inference

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Regression. Marc H. Mehlman University of New Haven

Multivariate Lineare Modelle

Simple Linear Regression

same hypothesis Assumptions N = subjects K = groups df 1 = between (numerator) df 2 = within (denominator)

Regression diagnostics

Multiple linear regression S6

36-463/663: Multilevel & Hierarchical Models

Chapter 24. Comparing Means. Copyright 2010 Pearson Education, Inc.

A discussion on multiple regression models

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

Unit 11: Multiple Linear Regression

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

Chapter 16. Simple Linear Regression and dcorrelation

Mixed- Model Analysis of Variance. Sohad Murrar & Markus Brauer. University of Wisconsin- Madison. Target Word Count: Actual Word Count: 2755

Psychology 282 Lecture #4 Outline Inferences in SLR

Chapter 16. Simple Linear Regression and Correlation

Linear Modelling: Simple Regression

Correlation and Simple Linear Regression

Contents. Acknowledgments. xix

Statistics 5100 Spring 2018 Exam 1

SEEC Toolbox seminars

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Lecture 6: Linear Regression

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Prediction of Bike Rental using Model Reuse Strategy

Inference for Regression Inference about the Regression Model and Using the Regression Line

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

More Statistics tutorial at Logistic Regression and the new:

Basic Statistics Exercises 66

Statistics: A review. Why statistics?

Inference with Simple Regression

Linear Regression Models

Simple Linear Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Linear Models, Problems

Notes for week 4 (part 2)

Experimental Design and Data Analysis for Biologists

Bivariate data analysis

Transcription:

@graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool

Conflicts of interest None Assistant Editor (Statistical Consultant) for EJCTS and ICVTS

Question: who routinely checks model assumptions when analyzing data? (raise your hand if the answer is Yes)

Outline Illustrate with multiple linear regression Plethora of residuals and diagnostics for other model types Focus is not to what to do if you detect a problem, but how to diagnose (potential) problems

My personal experience* Reviewer of EJCTS and ICVTS for 5-years Authors almost never report if they assessed model assumptions Example: only one paper submitted where authors considered sphericity in RM-ANOVA at first submission Usually one or more comment is sent to authors regarding model assumptions * My views do not reflect those of the EJCTS, ICVTS, or of other statistical reviewers

Linear regression modelling Collect some data y " : the observed continuous outcome for subject i (e.g. biomarker) x %", x '",, x )" : p covariates (e.g. age, male, ) Want to fit the model y " = β, + β % x %" + β ' x "' + + β ) x )" + ε " Estimate the regression coefficients β0,, β0 %, β0 ',, β0 ) Report the coefficients and make inference, e.g. report 95% CIs But we do not stop there

Residuals For a linear regression model, the residual for the i-th observation is r " = y " y3 " where y3 " is the predicted value given by y3 " = β0, + β0 % x %" + β0 ' x "' + + β0 ) x )" Lots of useful diagnostics are based on residuals

Linearity of functional form Assumption: scatterplot of (x ", r " ) should not show any systematic trends Trends imply that higher-order terms are required, e.g. quadratic, cubic, etc.

0 20 40 60 80 0 5 10 15 20 X Y A 10 5 0 5 10 0 5 10 15 20 X Residual B 0 20 40 60 80 0 5 10 15 20 X Y C 4 0 4 8 0 5 10 15 20 X Residual D Fitted model: Y = β, + β % X + ε Y = β, + β % X + β ' X ' + ε

Homogeneity We often assume assume that ε " N 0, σ ' The assumption here is that the variance is constant, i.e. homogeneous Estimates and predictions are robust to violation, but not inferences (e.g. F-tests, confidence intervals) We should not see any pattern in a scatterplot of y3 ", r " Residuals should be symmetric about 0

Homoscedastic residuals Heteroscedastic residuals 10 5 0 5 0 5 10 15 20 25 Fitted value Residual A 10 5 0 5 0 5 10 15 20 25 Fitted value Residual B

Normality If we want to make inferences, we generally assume ε " N 0, σ ' Not always a critical assumption, e.g.: Want to estimate the best fit line Want to make predictions The sample size is quite large and the other assumptions are met We can assess graphically using a Q-Q plot, histogram Note: the assumption is about the errors, not the outcomes y "

2 1 0 1 2 6 2 2 4 6 Normal residuals Theoretical Quantiles Sample Quantiles 2 1 0 1 2 0 5 10 15 Skewed residuals Theoretical Quantiles Sample Quantiles Residuals Frequency 6 4 2 0 2 4 6 8 0 5 10 15 20 25 Residuals Frequency 0 5 10 15 0 5 10 20 30

Independence We assume the errors are independent Usually able to identify this assumption from the study design and analysis plan E.g. if repeated measures, we should not treat each measurement as independent If independence holds, plotting the residuals against the time (or order of the observations) should show no pattern

60 30 0 30 0 25 50 75 100 X Residual A 150 100 50 0 50 100 0 25 50 75 100 X Residual B Independent Non-independent

Multicollinearity Correlation among the predictors (independent variables) is known as collinearity (multicollinearity when >2 predictors) If aim is inference, can lead to Inflated standard errors (in some cases very large) Nonsensical parameter estimates (e.g. wrong signs or extremely large) If aim is prediction, it tends not to be a problem Standard diagnostic is the variance inflation factor (VIF) VIF X? = 1 1 R? ' Rule of thumb: VIF > 10 indicates multicollinearity

Outliers & influential points r = 0.82 Dataset 1 Dataset 2 y = 3.00 + 0.500x y = 3.00 + 0.500x r = 0.82 Outlier Measurement 2 y 12 8 4 12 Dataset 3 Dataset 4 y = r = 3.00 0.82+ 0.500x y r = 3.00 0.82 + 0.500x High leverage point 8 4 5 10 15 5 10 15 Measurement 1 x

Diagnostics to detect influential points DFBETA (or Δβ) Leave out i-th observation out and refit the model Get estimates of β0, C", β0 % C", β0 ' i,, β0 ) C" Repeat for i = 1, 2,, n Cook s distance D-statistic A measure of how influential each data point is Automatically computer / visualized in modern software Rule of thumb: D " > 1 implies point is influential

Residuals from other models GLMs (incl. logistic regression) Deviance Pearson Response Partial Δβ Cox regression Martingale Deviance Score Schoenfeld Δβ Useful for exploring the influence of individual observations and model fit

Two scenarios Statistical methods routinely submitted to EJCTS / ICVTS include: 1. Repeated measures ANOVA 2. Cox proportional hazards regression Each has very important assumptions

Repeated measures ANOVA Assumptions: those used for classical ANOVA + sphericity Sphericity: the variances of the differences of all pairs of the within subject conditions (e.g. time) are equal Patient T0 T1 T2 T0 T1 T0 T2 T1 T2 1 30 27 20 3 10 7 2 35 30 28 5 7 2 3 25 30 20 5 5 10 4 15 15 12 0 3 3 5 9 12 7 3 2 5 Variance 17.0 10.3 10.3 It s a questionable a priori assumption for longitudinal data

Mauchly's test A popular test (but criticized due to power and robustness) H 0 : sphericity satisfied (i.e. σ HI CH J = σ HI CH K = σ HJ CH K H 1 : non-sphericity (at least one variance is different) ' ' ' ) If rejected, it is usual to apply a correction to the degrees of freedom (df) in the RM-ANOVA F-test The correction is ε x df, where ε = epsilon statistic (either Greenhouse-Geisser or Huynh-Feldt) Software (e.g. SPSS) will automatically report ε and the corrected tests

Proportionality assumption Cox regression assumes proportional hazards: Equivalently, the hazard ratio must be constant over time There are many ways to assess this assumption, including two using residual diagnostics: Graphical inspection of the (scaled) Schoenfeld residuals A test* based on the Schoenfeld residuals * Grambsch & Therneau. Biometrika. 1994; 81: 515-26.

Global Schoenfeld Test p: 0.416 Beta(t) for wt.loss Beta(t) for sex Beta(t) for age 0.3 0.2 0.1 0.0 0.1 0.2 0.3 3 2 1 0 1 2 0.2 0.0 0.2 Schoenfeld Individual Test p: 0.5385 56 150 200 280 350 450 570 730 Time Schoenfeld Individual Test p: 0.1253 56 150 200 280 350 450 570 730 Time Schoenfeld Individual Test p: 0.8769 56 150 200 280 350 450 570 730 Time Simple Cox model fitted to the North Central Cancer Treatment Group lung cancer data set* If proportionality is valid, then we should not see any association between the residuals and time Can formally test the correlation for each covariate Can also formally test the global proportionality *Loprinzi CL et al. Journal of Clinical Oncology. 12(3) :601-7, 1994.

Conclusions Residuals are incredibly powerful for diagnosing issues in regression models If a model doesn t satisfy the required assumptions, don t expect subsequent inferences to be correct Assumptions can usually be assessed using methods other than (or in combination with) residuals Always report in manuscript What diagnostics were used, even if they are absent from the Results section Any corrections or adjustments made as a result of diagnostics

Thanks for listening Any questions? Statistical Primer article to be published soon! Slides available (shortly) from: www.glhickey.com