Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Similar documents
Comparison of Regression Lines

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Negative Binomial Regression

Statistics for Economics & Business

Chapter 13: Multiple Regression

Chapter 15 - Multiple Regression

Chapter 9: Statistical Inference and the Relationship between Two Variables

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Statistics for Business and Economics

Chapter 11: Simple Linear Regression and Correlation

Chapter 8 Indicator Variables

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Chapter 14 Simple Linear Regression

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Chap 10: Diagnostics, p384

/ n ) are compared. The logic is: if the two

Chapter 15 Student Lecture Notes 15-1

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Linear Regression Analysis: Terminology and Notation

On the detection of influential outliers in linear regression analysis

Basic Business Statistics, 10/e

A Robust Method for Calculating the Correlation Coefficient

Introduction to Generalized Linear Models

STATISTICS QUESTIONS. Step by Step Solutions.

Basically, if you have a dummy dependent variable you will be estimating a probability.

28. SIMPLE LINEAR REGRESSION III

x i1 =1 for all i (the constant ).

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

18. SIMPLE LINEAR REGRESSION III

Regression. The Simple Linear Regression Model

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

The Ordinary Least Squares (OLS) Estimator

January Examinations 2015

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

ANOVA. The Observations y ij

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

Learning Objectives for Chapter 11

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

REGRESSION ANALYSIS II- MULTICOLLINEARITY

STAT 3008 Applied Regression Analysis

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Global Sensitivity. Tuesday 20 th February, 2018

x = , so that calculated

Polynomial Regression Models

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Econ107 Applied Econometrics Topic 9: Heteroskedasticity (Studenmund, Chapter 10)

SIMPLE LINEAR REGRESSION

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

Outlier Detection in Logistic Regression: A Quest for Reliable Knowledge from Predictive Modeling and Classification

Correlation and Regression

Chapter 12 Analysis of Covariance

III. Econometric Methodology Regression Analysis

Now we relax this assumption and allow that the error variance depends on the independent variables, i.e., heteroskedasticity

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

Activity #13: Simple Linear Regression. actgpa.sav; beer.sav;

Using the estimated penetrances to determine the range of the underlying genetic model in casecontrol

Chapter 6. Supplemental Text Material

Econometrics of Panel Data

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

SOME METHODS OF DETECTION OF OUTLIERS IN LINEAR REGRESSION MODEL

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

[ ] λ λ λ. Multicollinearity. multicollinearity Ragnar Frisch (1934) perfect exact. collinearity. multicollinearity. exact

4.3 Poisson Regression

Laboratory 3: Method of Least Squares

e i is a random error

Chapter 14 Simple Linear Regression Page 1. Introduction to regression analysis 14-2

Methods of Detecting Outliers in A Regression Analysis Model.

A METHOD FOR DETECTING OUTLIERS IN FUZZY REGRESSION

Uncertainty as the Overlap of Alternate Conditional Distributions

Lecture 6: Introduction to Linear Regression

Influence Diagnostics on Competing Risks Using Cox s Model with Censored Data. Jalan Gombak, 53100, Kuala Lumpur, Malaysia.

Lecture 3 Stat102, Spring 2007

Y = β 0 + β 1 X 1 + β 2 X β k X k + ε

Bayesian predictive Configural Frequency Analysis

Lecture 16 Statistical Analysis in Biomaterials Research (Part II)

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Empirical Methods for Corporate Finance. Identification

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

Discussion of Extensions of the Gauss-Markov Theorem to the Case of Stochastic Regression Coefficients Ed Stanek

Online Appendix to: Axiomatization and measurement of Quasi-hyperbolic Discounting

SIMPLE LINEAR REGRESSION and CORRELATION

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Chapter 3. Two-Variable Regression Model: The Problem of Estimation

RESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATA

This column is a continuation of our previous column

A LINEAR PROGRAM TO COMPARE MULTIPLE GROSS CREDIT LOSS FORECASTS. Dr. Derald E. Wentzien, Wesley College, (302) ,

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

Indeterminate pin-jointed frames (trusses)

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Composite Hypotheses testing

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Statistics II Final Exam 26/6/18

Lab 2e Thermal System Response and Effective Heat Transfer Coefficient

Transcription:

Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons. Consequences of volatons. Methods for detectng volatons. Remedes for volatons. Methods for detectng volatons of assumptons are often referred to as dagnostcs n that they are used to dagnose or reveal problems n the data. We now extend our use of dagnostcs to address two other ssues of potental mportance n almost any applcaton of multple regresson: Outlers Multcollnearty Each of these phenomena can have a major mpact on regresson results and nterpretaton. Any hghqualty applcaton of MLR should nclude a drect and careful nvestgaton of both ssues to determne f any serous problem s present and, f so, how t should be remeded.

2 Outlers Outlers are atypcal data ponts that do not ft wth the rest of the data. An outler may arse due to some sort of contamnaton or error, or may be a vald but very extreme observaton. (More on ths later.) Outlers may have a dramatc mpact on results of regresson analyses, potentally havng major mpact on effects szes and regresson coeffcents. Outlers may cause a weak (or zero) lnear relatonshp to appear to be a strong lnear relatonshp, or may have the opposte effect by maskng a strong lnear relatonshp. Outlers tend to have a stronger effect when n s small than when n s large. Detecton of outlers In applcatons of MLR t s useful to be able to detect and dentfy outlers. As we shall see later, once an outler s detected the nvestgator must carefully attempt to determne the source or cause of the atypcal observaton and then consder how to

3 remedy the stuaton. But frst we wll consder methods for detectng outlers. Outler detecton methods nvolve the use of statstcs that are obtaned for each case (or observaton). Three types of detecton measures are commonly used: Leverage: Extremty of each case on the IVs. Dscrepancy: Extremty of each case on the DV. Influence: Influence of each case on regresson results. These statstcs are commonly avalable n commercal statstcal software. The general approach s to obtan such measures for each case and then determne whch cases, f any, exhbt suffcently extreme values to be consdered outlers. Leverage: Measures of leverage assess the extremty, or atypcalty, of each case on the IVs. Extreme cases have the potental to have great nfluence on results of regresson analyses.

4 When there s only one IV, the usual measure of leverage s gven by: h 1 ( X X = + 2 n x 2 ) Cases near the mean of X produce low values of h, whereas cases further from the mean produce larger values. Ths measure of leverage can be extended to the case of k IVs, but the formula requres matrx representaton and s omtted here. In ths more general context, observatons near the centrod (jont mean) of the dstrbuton of the IVs yeld low values of h, and cases further from the centrod yeld larger values. Once we obtan h values, one for each of the n cases, we need to examne them to dentfy extreme values. There are two common approaches: 1) Index plot: Ths s a plot where the horzontal axs represents case number, runnng from 1 to n, and the vertcal axs represents the leverage measure. See examples n the text. Vsual nspecton of the plot can reveal cases wth extreme leverage values. Index plots are most useful when n s not too large.

2) Cutoff values: There are rule-of-thumb cutoffs that are more useful when n s large. Common values nclude h >2(k+1)n or h >3(k+1)n. A general prncple s that we do not want to dentfy a large number of cases as outlers, so the use of more extreme values wll result n fewer cases dentfed. Ths process helps us to dentfy observatons that are hghly dscrepant on the IVs and whch thus have a large potental nfluence on results of the regresson analyss. Dscrepancy: Measures of dscrepancy assess extremty on the DV n the context of the regresson model. A smple measure of extremty would be the raw regresson resdual for each case: Y Yˆ However, t must be kept n mnd that an extreme observaton wll nfluence the regresson lne n such a way as to make the correspondng resdual smaller for that observaton. A better measure of dscrepancy for case would be the value of the resdual that would be obtaned f that case were not ncluded n the regresson model. 5

6 Ths value s desgnated d = Y Y ( ) where Y ˆ ( ) s the predcted value of Y that would be obtaned for case usng a regresson equaton derved from the sample excludng case. Observatons exhbtng a large value of d are cases that are devant n terms of ther resduals when the regresson equaton s derved based on the rest of the sample. To put these values on a standardzed scale we defne d t = SE d These values are called Studentzed resduals. We then wsh to dentfy extreme values by usng ether an ndex plot (when n s small to moderate) or cutoff values. Snce these resduals approxmately follow a t-dstrbuton, common cutoffs are ±2 n small to moderate samples, and ±3 or ±4 n large samples. By ths process we can dentfy observatons that are hghly dscrepant on the DV n the context of the regresson model. ˆ

7 Influence: Influence statstcs measure the nfluence of each case on the regresson model. These measures combne nformaton represented by the notons of leverage and dscrepancy. There are two knds of nfluence measures: global measures, and measures of nfluence on a specfc regresson coeffcent. Global nfluence: A global nfluence measure assesses the change n the predcted Y values as a functon of whether an observaton,, s ncluded n the sample or not. For each case we obtan a measure called DFFITS : DFFITS = Yˆ Yˆ MS ( ) resdual( ) The numerator represents the dfference n predcted Y values obtaned when case s ncluded vs. excluded from the analyss, and the denomnator serves to standardze these values. h Ths ndex of nfluence s closely related to another commonly used ndex called Cook s dstance: DFFITS Cook' sds tan ce ( k + 1) 2

Once we obtan one of these measures for each case we agan seek to dentfy extreme values usng an ndex plot, when n s not too large, or cutoffs. Commonly used cutoffs are values >1 when n s small to moderate, or values > 2 ( k + 1) when n s large. Ths process helps us to dentfy observatons that have a relatvely large global nfluence on the results of the regresson analyss. Influence on specfc regresson coeffcents: In some studes we may have a prmary nterest n estmatng and nterpretng the value of regresson coeffcents assocated wth specfc IVs. In such stuatons we would be nterested n whether those partcular coeffcents mght be hghly nfluenced by outlers. Such nfluences can be assessed usng an ndex called DFBETA. For a gven IV, j, we can obtan a DFBETA value for each case, : DFBETA j B j B = SE B j ( ) where B j s the regresson coeffcent obtaned from the full sample, and B j() s the coeffcent obtaned when case s excluded from the sample. n j( ) 8

9 Ths value represents the nfluence of observaton on the regresson coeffcent B j. Once we obtan one of these measures for each case we agan seek to dentfy extreme values usng an ndex plot, when n s not too large, or cutoffs. Commonly used cutoffs are ±1 for small to moderate n, and larger values such as ± 2 when n s large. Ths approach allows us to dentfy observatons that have a large nfluence on the value of B j. In practce we can obtan DFBETA values assocated wth each B j f so desred. n General approach Gven a sample of n observatons on k IVs and one DV, we can obtan any or all of these dagnostc measures. For each measure we can dentfy extreme observatons usng ndex plots and/or conventonal cutoff values. Index plots are more useful for small to moderate n, whle cutoffs are useful for larger n.

In any event, we do not want ths process to result n dentfcaton of large numbers of outlers. Any outlers that are dentfed must be evaluated, and t s mpractcal to evaluate large numbers of such cases. Of the varous dagnostc statstcs defned above, measures of nfluence are probably the most useful and mportant. These statstcs combne nformaton represented by measures of leverage and dscrepancy and ndcate actual nfluence of each case on results of regresson analyses. Cases that exhbt extreme values of leverage or dscrepancy but do not exhbt substantal nfluence are probably not problematc. For cases dentfed as havng hgh nfluence, measures of leverage and dscrepancy can provde more detal about the specfc nature of the extremty or atypcalty of those cases. 10 What to do when outlers are dentfed When outlers are clearly dentfed t s useful and potentally of great mportance to attempt to determne ther source or cause.

Two prmary causes: Contamnaton or error: Some sort of error occurred n the measurement or data recordng process. In such cases, the error may be fxed or, f that s not possble, the outlers may be deleted from the sample. Vald but rare cases: The outlers are vald observatons but are extreme n some way relatve to the rest of the sample. Determnng what to do wth outlers of the second type can be problematc. There s a tenson between two goals: 1) Retanng and seekng to account for all vald data. 2) Obtanng results that represent the general effects present n the populaton, are not overly nfluenced by ndvdual cases, and generalze and cross-valdate well. Both objectves are legtmate and mportant. As a result, outlers should not be deleted casually. 11

An attempt to determne the nature of the outlers can provde mportant nformaton and nsghts. For example: Outlers may be observatons from a dfferent populaton than the one of nterest. Outlers may arse due to unexpected or unrecognzed effects or phenomena. Outlers may reveal msspecfcaton of the regresson model (e.g., nonlnear v. lnear). 12 It should also be noted that falure to delete outlers may be ethcally problematc. It can and does happen that a fndng of a statstcally sgnfcant effect occurs because of effects of a small number of outlers, and that f those outlers are removed the effect vanshes. One could argue that t would be unethcal to report such a sgnfcant effect wthout notng ts dependence on the presence of a few extreme observatons. If a decson s made to delete outlers from the data set so as to elmnate ther nfluence, then the followng ponts should be kept n mnd:

13 1) It s essental that the nvestgator report ths decson along wth the number and nature of the outlers that have been deleted. 2) The deleton of outlers produces, n effect, a new sample. In that new sample, values of all of the dagnostc statstcs defned above would be dfferent than they were pror to the deleton of the outlers. It s advsable to re-compute the dagnostc measures to determne whether any observatons would now be dentfed as outlers, although they were not dentfed as outlers n the full sample. Fnally, note that there exst robust regresson methods that are desgned to be less senstve to effects of outlers. These methods do not use ordnary least squares estmaton as n the standard MLR methods. See text for dscusson and references.