Assumptions in Regression Modeling

Similar documents
Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

Unit 10: Simple Linear Regression and Correlation

Box-Cox Transformations

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and dcorrelation

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

LECTURE 04: LINEAR REGRESSION PT. 2. September 20, 2017 SDS 293: Machine Learning

How To: Deal with Heteroscedasticity Using STATGRAPHICS Centurion

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 9: Simple Regression

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Polynomial Regression

Unit 6 - Introduction to linear regression

Inference with Simple Regression

+ Statistical Methods in

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Machine Learning Linear Regression. Prof. Matteo Matteucci

Basic Business Statistics 6 th Edition

Correlation and Regression

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Bivariate data analysis

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

STAT 212 Business Statistics II 1

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Unit 6 - Simple linear regression

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Statistics for Managers using Microsoft Excel 6 th Edition

Exponential Smoothing. INSR 260, Spring 2009 Bob Stine

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Steps to take to do the descriptive part of regression analysis:

9. Linear Regression and Correlation

Introduction to Basic Statistics Version 2

appstats8.notebook October 11, 2016

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

download instant at

Stat 101: Lecture 6. Summer 2006

Lecture 6: Linear Regression (continued)

Chapter 13. Multiple Regression and Model Building

Lecture #11: Classification & Logistic Regression

AP Final Review II Exploring Data (20% 30%)

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

Stat 101 L: Laboratory 5

Chapter 12 - Part I: Correlation Analysis

Important Dates. Non-instructional days. No classes. College offices closed.

A Second Course in Statistics: Regression Analysis

For Bonnie and Jesse (again)

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Simple Linear Regression for the Climate Data

Lecture 6: Linear Regression

Multiple Linear Regression

9 Correlation and Regression

Linear Regression Communication, skills, and understanding Calculator Use

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Simple Linear Regression Model

7.0 Lesson Plan. Regression. Residuals

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

SMAM 319 Exam 1 Name. 1.Pick the best choice for the multiple choice questions below (10 points 2 each)

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Harvard University. Rigorous Research in Engineering Education

Welcome to Physics 161 Elements of Physics Fall 2018, Sept 4. Wim Kloet

AMS 7 Correlation and Regression Lecture 8

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

Stat 328 Final Exam (Regression) Summer 2002 Professor Vardeman

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Chapter 4. Regression Models. Learning Objectives

INFERENCE FOR REGRESSION

22S39: Class Notes / November 14, 2000 back to start 1

Correlation Analysis

Chapter 3 Multiple Regression Complete Example

Regression Analysis. BUS 735: Business Decision Making and Research

2. Outliers and inference for regression

AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1

Single and multiple linear regression analysis

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Inference for Regression Inference about the Regression Model and Using the Regression Line

Course Review. Kin 304W Week 14: April 9, 2013

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Chapter 14 Student Lecture Notes 14-1

Chapter 3: Examining Relationships

Applied Regression Modeling

STATISTICS 110/201 PRACTICE FINAL EXAM

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lab 1 Uniform Motion - Graphing and Analyzing Motion

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Lecture 18 Miscellaneous Topics in Multiple Regression

Chapter 4: Regression Models

Weighted Least Squares

Business Statistics. Lecture 5: Confidence Intervals

Chapter 1 Statistical Inference

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Transcription:

Fall Semester, 2001 Statistics 621 Lecture 2 Robert Stine 1 Assumptions in Regression Modeling Preliminaries Preparing for class Read the casebook prior to class Pace in class is too fast to absorb without some prior reading Identify questions, find places where you did not follow the analysis. Focus on managerial questions that motivate and guide the analysis. What conclusion should be drawn from the analysis? Under what conditions? Once in class Take notes directly in the casebook rather than redrawing figures. Participate in the discussion. Everything will not make sense the first time you see it! Feedback Please fill out feedback forms, based on last digit of student ID number. Contact either - Quality circle for the class - Cohort academic reps (please identify yourselves) Office hours After class on Monday and Wednesday (3-5 p.m.) 3014 SH-DH in the Statistics Department. Check e-mail in the evening; use the FAQ from my web page www-stat.wharton.upenn.edu/~bob/stat621 Teaching assistants Room 3009 SH-DH (as in Stat 603/604) Hours for TAs appear in the course syllabus

Fall Semester, 2001 2 Discussion of Assignment #1 Due on Friday Drop off printed copy in Statistics Department. One copy per team. Do not send electronically. Questions about the assignment Get the data from the web. Question #1: Regression estimates of the beta of a stock. Subset the data either by manual selection of the rows and using the exclude rows, or by using the tables menu to build a new data set. Question #2: Transformations and fuel consumption. Use a single plot with three models for the data. For example, this plot compares a log-log model (red) to the linear model (green). 40 35 MPG City 30 25 20 15 1500 2000 2500 3000 3500 4000 In the assignment, discuss which model is best in the context of some question that can be answered with any of the models, such as prediction for cars of some weight.

Fall Semester, 2001 3 Discussion from Class 1 (Fitting Equations) Key points Smoothing (using splines) produces a summary of how the average of the response (Y) depends upon the predictor (X). Regression produces an equation for this summary. The equation gives the average (or expected ) value of the response at a value of a predictor. The equation can be linear (easy interpretation) or nonlinear (harder). Interpreting a linear regression equation Intercept and slope have units (unlike a correlation) Intercept takes units from response, slope from response/predictor. Dangers of extrapolating outside the range of data: e.g., prices for very small or very large diamonds Transformations Intercept and slope when variables are transformed Heuristic for logs and percentage change (pages 19-22) Transform original measurements to new scales (e.g. log or reciprocal) Understanding curvature in data (e.g., diminishing returns to scale) Subtle: Curved model on original scales is linear on new scales. (p 32) Note the different spacing of values on the horizontal axis. Sales 450 400 350 300 250 200 150 100 50 0 0 1 2 3 4 5 6 7 8 Display Feet Sales 450 400 350 300 250 200 150 100 50 0.0.5 1.0 1.5 2.0 Log Feet

Fall Semester, 2001 4 JMP-IN Notes Fit Y by X view Behavior changes when have categorical or continuous predictor categorical you get a t-test when you have two groups (Stat603) continuous you get a regression modeling tool Options for this view include the ability to fit models with transformations ( Fit special ) polynomials ( Fit polynomial ) smooth curves ( Splines ) Get residual plot in this view by using the button associated with the specific fitted model (each fitted model in the view has its own residual plot). Subsets Manual selection of desired observations, then use Exclude from the Rows menu. Use the subset command from the Tables menu.

Fall Semester, 2001 5 New Application for Today Providing a margin for error Confidence interval for optimal shelf space? What is the probable accuracy of a forecast or prediction? Confidence intervals provide the best answers to these types questions. To get these from regression, we need to embellish the regression equation with assumptions that complete the statistical model. Regression Model Four key elements of the standard regression model 0. Equation, stated either in terms of the average of the response ave( Y X )= b + b X i i 0 1 i or for the values of the response as Y = b + b X + e i 0 1 i i 1. Independent observations (independent errors) 2. Constant variance observations (equal error variance s 2 ) 3. Normally distributed around the regression line (errors e i are N(0, s 2 )) Standard model is a utopian model Idealized data generating process or model. Closer to this ideal model, the more reliable inference becomes. Model indexed by 3 unknown parameters that define the specific characteristics of the process: Intercept b 0, slope b 1, and SD of errors s.

Fall Semester, 2001 6 Supporting Concepts and Terminology Discussion: What are the errors in a regression model? Represent the net effect of omitted factors. Model is a statement only about averages ave( Y X )= b + b X i i 0 1 i and the errors represent the deviations from these averages. Errors combine all of the omitted factors that determine the response. Problem with checking the assumptions Assumptions 1-3 are statements about the errors, but The errors are not directly observed. Residual ε = Y β β X i i 0 1 i Estimate or proxy for the unobservable error term Observed vertical deviation from a fitted line e = Y ˆ β ˆ β X i i 0 1 i RMSE root mean squared error is SD of the residuals Transformations and residuals Assumptions presume you know how to transform data. Check the residuals on the associated transformed scales. Least squares Best fitted line is chosen to minimize the sum of squared residuals. Assumption Diagnostic(graphical) Problem Linearity Scatterplots(data/residuals) nonlinear Independence Plots showing trend autocorrelation Constant variance Residual plots heteroscedasticity Normality Quantile plots outliers, skewness

Fall Semester, 2001 7 Autocorrelation Occurs most clearly with time series data, in which case the residuals appear to track over time Cellular case study (various places in case book, p. 29, 53, 304) Transformed response (uninterpretable Y 1/4 ) Linear model (on transformed scales) and residuals shown on next Subscribe to 1/4 80 70 60 50 40 30 20 10 0 5 10 15 20 25 Period Residual 2.0 1.0 0.0-1.0-2.0 0 5 10 15 20 25 Period Residuals and transformation Subtle point: Residual plots are more useful if you plot the data after applying the transformation to the column, rather than looking at the data on the raw scales The view of the fitted regression model on the original scale stresses the qualitative features such as diminishing returns to scale. However, when looking for problems in the residuals, it s often best to look at the data on scales that make the equation linear. The assumptions live on the linear scale.

Fall Semester, 2001 8 Heteroscedasticity Lack of constant variance implies not using data efficiently. Shows up as residuals that do not have constant variation. Remedied by a weighted analysis (we ll not explore this in class) Room cleaning example (p. 7, 57) RoomsClean 80 70 60 50 40 30 20 10 0 0 5 10 15 Number of Crews Residual 20 15 10 5 0-5 -10-15 -20 0 5 10 15 Number of Crews

Fall Semester, 2001 9 Outliers (unusually small/large values of X or Y) Distinguish what makes an observation unusual: Unusual in X = leveraged Unusual in Y = residual of large magnitude (negative or positive) Large absolute residual + leveraged location > influential observation. Least squares fit is VERY sensitive to influential observations. Goal: don t just remove, understand why a point is unusual. JMP script illustrating leverage, outliers See the impact of moving points Data ellipse quantifies the size of the correlation (more on the correlation next time) Squared error of large residuals dominate least squares procedure Y 130 120 110 100 90 80 70 corr=0.87 60 20 30 40 50 60 70 80 90 100

Fall Semester, 2001 10 Examples for Today (and into next class) Lots of examples to illustrate issues Utopian data What happens when all goes well? Cellular phones Dependence apparent in residuals. Cleaning crews Variance changes for larger operations. Philadelphia housing Unusual points and their effects. Building profits The ideal regression model Utopia.jmp, page 47 What do the plots look like if everything is working properly? Visual training, getting comfortable with the graphical displays. Simulated data for which we know the true model. Note the interpretation of RMSE (p 48) Discussed above Predicting cellular phone use Cellular.jmp, page 57 How many subscribers do you predict by the end of this year? Nonlinear on original scale; linear on the transformed scale. (p 54) RMSE of fit is small compared to scale of response (p 54) Fit appears quite good, until one zooms in with residual plots. (p 55) Residuals show autocorrelation, tracking over time. Predictions will likely be too small due to local trend at end of data. Revisited in Class 12 (see also Class 1, p 29-37)

Fall Semester, 2001 11 Discussed above Efficiency of cleaning crews Cleaning.jmp, page 57 How well can we predict the number of rooms cleaned? Variability in residuals increases with size, heteroscedastic (p 58) Not a big deal for estimating the slope, but Lack of constant variance is very important when assessing predictions. Philadelphia housing prices Phila.jmp, page 62 How do crime rates impact the average selling price of houses? Initial plot shows that Center City is leveraged (unusual in X). Initial analysis with all data finds $577 impact per crime (p 64). Residuals show lack of normality (p 65). Without CC, regression has much steeper decay, $2289/crime (p 66). Residuals remain non-normal (p 67). Why is CC an outlier? What do we learn from this point? Alternative analysis with transformation suggests may be not so unusual. (see pages 68-70) Housing construction Cottages.jmp, page 78 Should the company pursue the market for larger homes? Whole analysis hinges on how we treat one very large cottage. With this large cottage, find a steep, positive slope. (p 80) as though we only have two points. Without this cottage, we find a much less clear relationship. Continue the analysis of this data in Class 3 on Thursday and discover the impact of this single point on prediction and the statistical properties of our model.

Fall Semester, 2001 12 Key Take-Away Points Regression model adds assumptions to the equation Independent observations Constant variance around regression line Ideally, normally distributed around regression line Residuals provide the key to checking the assumptions Poor equation: Zoom in with the plot of the residuals Dependence: Plots of residuals over time Constant variance: Plots of residuals versus predictor Normality: Quantile plots of residuals One outlier can exert much influence on a regression Ideas of leverage and influence Difficult choice of what to do about an outlier Next Time Build on the assumptions we have developed today. Confidence intervals Prediction intervals General inference (making statements about the underlying population)