Prediction Intervals in the Presence of Outliers

Similar documents
Prediction Intervals for Regression Models

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

A Simple Plot for Model Assessment

The Model Building Process Part I: Checking Model Assumptions Best Practice

Simple Linear Regression

Two Simple Resistant Regression Estimators

A Note on Visualizing Response Transformations in Regression

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

All models are wrong, but some are useful. Box (1979)

Regression Analysis for Data Containing Outliers and High Leverage Points

Simple Linear Regression

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Practical High Breakdown Regression

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Categorical Predictor Variables

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

What to do if Assumptions are Violated?

STAT 525 Fall Final exam. Tuesday December 14, 2010

CHAPTER 5. Outlier Detection in Multivariate Data

Remedial Measures, Brown-Forsythe test, F test

Prediction Intervals For Lasso and Relaxed Lasso Using D Variables

Math 423/533: The Main Theoretical Topics

Inference After Variable Selection

Confidence Intervals, Testing and ANOVA Summary

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Weighted Least Squares

A Short Introduction to Curve Fitting and Regression by Brad Morantz

The Effect of a Single Point on Correlation and Slope

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

Introduction to Statistical modeling: handout for Math 489/583

A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE

Model Checking and Improvement

Quantitative Methods I: Regression diagnostics

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Density Temp vs Ratio. temp

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

STAT 540: Data Analysis and Regression

Matrix Approach to Simple Linear Regression: An Overview

10 Model Checking and Regression Diagnostics

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Lecture 6: Linear Regression

Linear models and their mathematical foundations: Simple linear regression

Formal Statement of Simple Linear Regression Model

Multiple Linear Regression

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Regression Models - Introduction

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

The Masking and Swamping Effects Using the Planted Mean-Shift Outliers Models

ST 740: Linear Models and Multivariate Normal Inference

y response variable x 1, x 2,, x k -- a set of explanatory variables

Diagnostics and Remedial Measures

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

XVI. Transformations. by: David Scott and David M. Lane

High Breakdown Analogs of the Trimmed Mean

3. Diagnostics and Remedial Measures

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Multiple Linear Regression

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Chapter 3. Diagnostics and Remedial Measures

Chapter 10 Building the Regression Model II: Diagnostics

11 Hypothesis Testing

REGRESSION OUTLIERS AND INFLUENTIAL OBSERVATIONS USING FATHOM

Lec 1: An Introduction to ANOVA

MATH 644: Regression Analysis Methods

Chapter 4: Regression Models

Lecture 9: Linear Regression

Statistical Modelling in Stata 5: Linear Models

Lecture 19 Multiple (Linear) Regression

Regression With a Categorical Independent Variable

Lecture 6: Linear Regression (continued)

Simple and Multiple Linear Regression

Bootstrapping Hypotheses Tests

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

STAT 705 Chapter 16: One-way ANOVA

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

Ch 2: Simple Linear Regression

Applied Statistics and Econometrics

Inferences for Regression

Correlation and Simple Linear Regression

Remedial Measures for Multiple Linear Regression Models

ISyE 691 Data mining and analytics

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

9 Correlation and Regression

Exploring data sets using partial residual plots based on robust fits

Sociology 6Z03 Review I

8. Example: Predicting University of New Mexico Enrollment

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science

3rd Quartile. 1st Quartile) Minimum

Remedial Measures Wrap-Up and Transformations Box Cox

Transcription:

Prediction Intervals in the Presence of Outliers David J. Olive Southern Illinois University July 21, 2003 Abstract This paper presents a simple procedure for computing prediction intervals when the data comes from a population that produces a small percentage of easily detected randomly occurring outliers. The multiple linear regression model with normal errors is used to illustrate the procedure. KEY WORDS: Regression, Robust Statistics David J. Olive is Assistant Professor, Department of Mathematics, Southern Illinois University, Mailcode 4408, Carbondale, IL 62901-4408, USA. E-mail address: dolive@math.siu.edu. This research was supported by NSF grant DMS 0202922. 1

1 INTRODUCTION Outliers are observations that are far from the bulk of the data and can cause classical estimators to perform very poorly. Typing and recording errors may create outliers, and a data set can have a large proportion of outliers if there is an omitted categorical variable (e.g. gender, species, or geographical location) where the data behaves differently for each category. Outliers should always be examined to see if they follow a pattern, are recording errors, or if they could be explained adequately by a more complicated model. Recording errors can sometimes be corrected and omitted variables can be included, but often there is no simple explanation for a group of data which differs from the bulk of the data. Assume that the population that generates the data is such that a certain proportion γ of the cases will be easily identified but randomly occurring unexplained outliers where γ < α < 0.2, and assume that remaining proportion 1 γ of the cases will be well approximated by the statistical model. A common suggestion for examining a data set that has unexplained outliers is to run the analysis on the full data set and to run the analysis on the cleaned data set with the outliers deleted. Then the statistician may consult with the collectors of the data in order to decide which analysis is more appropriate. Although the analysis of the cleaned data may be useful for describing the bulk of the data, the analysis is not very useful if prediction or description of the entire population is of interest. Similarly, the analysis of the full data set will likely be unsatisfactory for prediction since numerical statistical methods tend to be inadequate when outliers are present. 2

Classical estimators will frequently fit neither the bulk of the data nor the outliers well, while an analysis from a good practical robust estimator (if available) should be similar to the analysis of the cleaned data set. Hence neither of the two analyses alone is appropriate for prediction or description of the actual population. Instead, information from both analyses should be used. The cleaned data will be used to show that the bulk of the data is well approximated by the statistical model, but the full data set will be used along with the cleaned data for prediction and for description of the entire population. To illustrate the above discussion, consider the multiple linear regression model Y = Xβ + e (1.1) where Y is an n 1 vector of dependent variables, X is an n p matrix of predictors, β is a p 1 vector of unknown coefficients, and e is an n 1 vector of errors. The ith case (Y i, x T i ) corresponds to the ith row xt i of X and the ith element Y i of Y. Since prediction intervals are desired, also assume that the errors e i are independent identically distributed (iid) random variables from a normal distribution with zero mean and variance σ 2. Finding prediction intervals for future observations is a standard problem in multiple linear regression. Let ˆβ denote the ordinary least squares (OLS) estimator of β and let MSE = ni=1 r 2 i n p where r i = Y i x T i ˆβ is the ith residual. Following Neter, Wasserman and Kutner (1983, p. 246), if the errors are iid normal, then a (1 α)100% prediction interval (PI) for a new observation Y h(new) corresponding to a vector of predictors x h is given by Ŷ h ± t 1 α/2,n p se(pred) (1.2) 3

where Ŷh = x T h ˆβ, P (t t 1 α/2,n p )=1 α/2 where t has a t distribution with n p degrees of freedom, and se(pred) = MSE(1 + x T h (XT X) 1 x h ). For discussion, suppose that 1 γ =0.92 so that 8% of the cases are outliers. If interest is in a 95% PI, then using the full data set will fail because outliers are present, and using the clean data set with the outliers deleted will fail since only 92% of future observations will behave like the clean data. A simple remedy is to create a nominal 100(1 α)% PI for future cases from this population by making a classical 100(1 α ) PI from the clean cases where 1 α =(1 α)/(1 γ). (1.3) Since this PI is valid when Y h(new) is clean, if no outliers will fall in the PI then P(Y h(new) is in the PI) P(Y h(new) is in the PI and clean) = P(Y h(new) is in the PI Y h(new) is clean) P(Y h(new) is clean) (1 α )(1 γ) =(1 α). Assume that there are n c clean cases and n o outlying cases where n c + n o = n. Then the formula for this PI is Ŷ h ± t 1 α /2,n c pse(pred) (1.4) where Ŷh and se(pred) are obtained after performing OLS on the n c clean cases. For example, if α =0.1 and γ =0.08, then 1 α 0.98. Since γ will be estimated from the data, the coverage will be approximately valid. 4

2 An Example The following example illustrates the procedure. STATLIB provides a data set that is available from the website (http://lib.stat.cmu.edu/datasets/bodyfat). The data set (contributed by Roger W. Johnson) includes 252 cases, 14 predictor variables, and a response variable Y = bodyfat. The correlation between Y and the first predictor x 1 = density is extremely high, and the plot of x 1 versus Y looks like a straight line except for four points. If simple linear regression is used, the residual plot of the fitted values versus the residuals is curved and five outliers are apparent. The curvature suggests that x 2 1 should be added to the model, but the least squares fit does not resist outliers well. If the five outlying cases are deleted, four more outliers show up in the plot. The residual plot for the quadratic fit looks reasonable after deleting cases 6, 48, 71, 76, 96, 139, 169, 182 and 200. Cases 71 and 139 were much less discrepant than the other seven outliers. These nine cases appear to be outlying at random: if the purpose of the analysis was description, we could say that a quadratic fits 96% of the cases well, but 4% of the cases are not fit especially well. If the purpose of the analysis was prediction, deleting the outliers and then using the clean data to find a 99% PI would not make sense if 4% of future cases are outliers. To create a nominal 90% PI for future cases from this population, make a classical 100(1 α ) PI from the clean cases where 1 α =0.9/(1 γ). For the bodyfat data, we can take 1 γ 1 9/252 0.964 and 1 α 0.94. Notice that (0.94)(0.96) 0.9. Figure 1 is useful for presenting the analysis. The top two plots have the nine outliers deleted. Figure 1a is a forward response plot of the fitted values Ŷi versus the response 5

Y i while Figure 1b is a residual plot of the fitted values Ŷi versus the residuals r i. These two plots suggest that the multiple linear regression model fits the bulk of the data well. Next consider using weighted least squares where cases 6, 48, 71, 76, 96, 139, 169, 182 and 200 are given weight zero and the remaining cases weight one. Figure 1c and 1d give the forward response plot and residual plot for the entire data set. Notice that seven of the nine outlying cases can be seen in these plots. ARC (Cook and Weisberg 1999) was used to make the residual plots to find outliers and to find prediction intervals. After making the residual plot, the outliers can be highlighted and deleted from the data set. It took less that five minutes to detect the outliers graphically. The classical 90% PI using x =(1, 1, 1) T and all 252 cases was Ŷ h ± t 0.95,249 se(pred) =46.3152 ± 1.651(1.3295) = (44.12, 48.51). When the 9 outliers are deleted, n c = 243 cases remain. Hence the 90% PI using equation (1.4) with 9 cases deleted was Ŷh ± t 0.97,240 se(pred) =44.961 ± 1.88972(0.0371) = (44.89, 45.03). Notice that the classical PI is about 31 times longer than the new PI. The focus of this article is on prediction intervals for multiple linear regression, but similar ideas hold for prediction intervals in the location model. Whitmore (1986) gives a useful introduction to prediction intervals in the location model, and Horn (1988) gives a partial solution for obtaining prediction intervals when the underlying distribution has heavy tails. Fisher and Horn (1994) discuss robust prediction intervals in the regression setting. 6

3 References Cook, R.D., Weisberg, S. 1999. Applied Regression Including Computing and Graphics. Wiley, New York. Horn, P.S., 1988. A biweight prediction interval for random samples. J. Amer. Statist. Assoc. 83, 249-256. Fisher, A., Horn, P.S., 1994. Robust prediction intervals in a regression setting. Comput. Statist. Data Anal. 17, 129-140. Neter, J., Wasserman, W., Kutner, M.H., 1983. Applied Linear Regression Models. Irwin, Homewood, IL. Whitmore, G.A., 1986. Prediction limits for a univariate normal observation. Amer. Statist. 40, 141-143. 7

a) No Outliers b) No Outliers Y R -0.08-0.04 0.0 0.04 FIT FIT c) Full Data d) Full Data Y 182 96 76 48 169 R -5 0 5 10 15 96 182 76 48 6 200 169 FIT FIT Figure 1: Plots for Summarizing the Entire Population 8