Lecture 6: Linear Regression

Similar documents
Lecture 6: Linear Regression (continued)

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Statistical Methods for Data Mining

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning for OR & FE

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Linear Regression In God we trust, all others bring data. William Edwards Deming

4. Nonlinear regression functions

Multiple Linear Regression

Exam Applied Statistical Regression. Good Luck!

LECTURE 04: LINEAR REGRESSION PT. 2. September 20, 2017 SDS 293: Machine Learning

11 Hypothesis Testing

Data Analysis 1 LINEAR REGRESSION. Chapter 03

MS&E 226: Small Data

Scatter plot of data from the study. Linear Regression

Chapter 4. Regression Models. Learning Objectives

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Scatter plot of data from the study. Linear Regression

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Lecture 9: Classification, LDA

Chapter 3 Multiple Regression Complete Example

2. Regression Review

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Chapter 4: Regression Models

Linear model selection and regularization

22s:152 Applied Linear Regression

Categorical Predictor Variables

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Sociology 593 Exam 2 Answer Key March 28, 2002

Unit 6 - Simple linear regression

Midterm 2 - Solutions

Multiple Linear Regression

Lecture 14: Shrinkage

Section 3: Simple Linear Regression

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Unit 6 - Introduction to linear regression

Final Exam - Solutions

Hypothesis testing. Data to decisions

Lecture 6 Multiple Linear Regression, cont.

Chapter 13. Multiple Regression and Model Building

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Statistical View of Least Squares

Regression With a Categorical Independent Variable

Regression, Ridge Regression, Lasso

Lecture 14 Simple Linear Regression

Data Mining Stat 588

Business Statistics. Lecture 9: Simple Regression

Lecture 15. Hypothesis testing in the linear model

AMS 7 Correlation and Regression Lecture 8

Business Statistics. Lecture 10: Correlation and Linear Regression

MATH 644: Regression Analysis Methods

WISE International Masters

Multiple Regression: Chapter 13. July 24, 2015

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Ch 3: Multiple Linear Regression

Multiple Linear Regression. Chapter 12

Chapter 14 Student Lecture Notes 14-1

Correlation Analysis

ECON 5350 Class Notes Functional Form and Structural Change

LECTURE 5 HYPOTHESIS TESTING

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Review of Econometrics

Simple Linear Regression for the MPG Data

Chapter 12 - Lecture 2 Inferences about regression coefficient

Multiple Linear Regression for the Salary Data

Homework 2: Simple Linear Regression

Lecture 9: Linear Regression

Chapter 14 Simple Linear Regression (A)

Correlation & Simple Regression

Simple Linear Regression

Lecture 28 Chi-Square Analysis

Math 423/533: The Main Theoretical Topics

Inference for Regression Inference about the Regression Model and Using the Regression Line

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Simple Linear Regression for the Climate Data

Ch 2: Simple Linear Regression

Unit 7: Multiple linear regression 1. Introduction to multiple linear regression

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Linear Regression Model. Badr Missaoui

Chapter 2 Multiple Regression I (Part 1)

Lecture 4 Multiple linear regression

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Regression Models for Quantitative and Qualitative Predictors: An Overview

Density Temp vs Ratio. temp

Statistical Machine Learning I

Lecture 8. Using the CLR Model. Relation between patent applications and R&D spending. Variables

9. Linear Regression and Correlation

EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Review of Statistics

Announcements. Unit 7: Multiple linear regression Lecture 3: Confidence and prediction intervals + Transformations. Uncertainty of predictions

Statistics for Engineers Lecture 9 Linear Regression

Transcription:

Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30

Simple linear regression Model: y i = β 0 + β 1 x i + ε i Sales 5 10 15 20 25 ε i N (0, σ) i.i.d. 0 50 100 150 200 250 300 TV Figure 3.1 2 / 30

Simple linear regression Model: y i = β 0 + β 1 x i + ε i Sales 5 10 15 20 25 ε i N (0, σ) i.i.d. The estimates ˆβ 0 and ˆβ 1 are chosen to minimize the residual sum of squares (RSS): 0 50 100 150 200 250 300 TV Figure 3.1 RSS = n (y i ŷ i ) 2 i=1 2 / 30

Simple linear regression Model: y i = β 0 + β 1 x i + ε i Sales 5 10 15 20 25 ε i N (0, σ) i.i.d. The estimates ˆβ 0 and ˆβ 1 are chosen to minimize the residual sum of squares (RSS): 0 50 100 150 200 250 300 TV Figure 3.1 RSS = = n (y i ŷ i ) 2 i=1 n (y i β 0 β 1 x i ) 2. i=1 2 / 30

Estimates ˆβ 0 and ˆβ 1 A little calculus shows that the minimizers of the RSS are: ˆβ 1 = n i=1 (x i x)(y i y) n i=1 (x i x) 2, ˆβ 0 = y ˆβ 1 x. 3 / 30

Assesing the accuracy of ˆβ 0 and ˆβ 1 Y 10 5 0 5 10 Y 10 5 0 5 10 The Standard Errors for the parameters are: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ] SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2. 2 1 0 1 2 X 2 1 0 1 2 X Figure 3.3 4 / 30

Assesing the accuracy of ˆβ 0 and ˆβ 1 Y 10 5 0 5 10 Y 10 5 0 5 10 The Standard Errors for the parameters are: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ] SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2. 2 1 0 1 2 2 1 0 1 2 X X Figure 3.3 The 95% confidence intervals: ˆβ 0 ± 2 SE( ˆβ 0 ) ˆβ 1 ± 2 SE( ˆβ 1 ) Calculations depend on the model 4 / 30

Hypothesis test H 0 : There is no relationship between X and Y. H a : There is some relationship between X and Y. 5 / 30

Hypothesis test H 0 : β 1 = 0. H a : β 1 0. 5 / 30

Hypothesis test H 0 : β 1 = 0. H a : β 1 0. Test statistic: t = ˆβ 1 0 SE( ˆβ 1 ). Under the null hypothesis (special case of the model), this has a t-distribution with n 2 degrees of freedom. 5 / 30

Hypothesis test H 0 : β 1 = 0. H a : β 1 0. Test statistic: t = ˆβ 1 0 SE( ˆβ 1 ). Under the null hypothesis (special case of the model), this has a t-distribution with n 2 degrees of freedom. 5 / 30

Interpreting the hypothesis test If we reject the null hypothesis, can we assume there is an exact linear relationship? 6 / 30

Interpreting the hypothesis test If we reject the null hypothesis, can we assume there is an exact linear relationship? No. A quadratic relationship may be a better fit, for example. There is evidence of linear association. 6 / 30

Interpreting the hypothesis test If we reject the null hypothesis, can we assume there is an exact linear relationship? No. A quadratic relationship may be a better fit, for example. There is evidence of linear association. If we don t reject the null hypothesis, can we assume there is no relationship between X and Y? 6 / 30

Interpreting the hypothesis test If we reject the null hypothesis, can we assume there is an exact linear relationship? No. A quadratic relationship may be a better fit, for example. There is evidence of linear association. If we don t reject the null hypothesis, can we assume there is no relationship between X and Y? No. This test is only powerful against certain monotone alternatives. There could be more complex non-linear relationships. 6 / 30

Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. X2 X1 Figure 3.4 7 / 30

Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. or, in matrix notation: X2 Ey = Xβ, Figure 3.4 X1 where y = (y 1,..., y n ) T, β = (β 0,..., β p ) T and X is our usual data matrix with an extra column of ones on the left to account for the intercept. 7 / 30

Multiple linear regression can answer several questions Is at least one of the variables X i useful for predicting the outcome Y? 8 / 30

Multiple linear regression can answer several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? 8 / 30

Multiple linear regression can answer several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? How good is a linear model for these data? 8 / 30

Multiple linear regression can answer several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? How good is a linear model for these data? Given a set of predictor values, what is a likely value for Y, and how accurate is this prediction? 8 / 30

The estimates ˆβ Our goal again is to minimize the RSS: RSS = n (y i ŷ i ) 2 i=1 9 / 30

The estimates ˆβ Our goal again is to minimize the RSS: RSS = = n (y i ŷ i ) 2 i=1 n (y i β 0 β 1 x i,1 β p x i,p ) 2. i=1 9 / 30

The estimates ˆβ Our goal again is to minimize the RSS: RSS = = n (y i ŷ i ) 2 i=1 n (y i β 0 β 1 x i,1 β p x i,p ) 2. i=1 One can show that this is minimized by the vector ˆβ: ˆβ = (X T X) 1 X T y. We also write RSS for the minimized sum of squares. 9 / 30

Consider the hypothesis: Which variables are important? H 0 : The last q predictors have no relation with Y. 10 / 30

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. 10 / 30

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. 10 / 30

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). 10 / 30

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. 10 / 30

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. Example: If q = p, we test whether any of the variables is important. n RSS 0 = (y i y) 2 i=1 10 / 30

Which variables are important? A multiple linear regression in R has the following output: 11 / 30

Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. 12 / 30

Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. A low p-value indicates that the predictor is important (it adds something to the model not explained by other variables). 12 / 30

Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. A low p-value indicates that the predictor is important (it adds something to the model not explained by other variables). P-hacking warning: If there are many predictors, even under the null hypothesis, some of the t-tests will have low p-values. 12 / 30

How many variables are important? When we select a subset of the predictors, we have 2 p choices. 13 / 30

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. 13 / 30

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. 13 / 30

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. 13 / 30

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. 13 / 30

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. Choosing model this way is a form of tuning. P-hacking: hypothesis tests, confidence intervals not valid after selection! 13 / 30

How good is the fit? To assess the fit, we focus on the residuals. 14 / 30

How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. 14 / 30

How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. The residual standard error (RSE) corrects this: 1 RSE = n p 1 RSS. 14 / 30

How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. The residual standard error (RSE) corrects this: 1 RSE = n p 1 RSS. Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions: Sales TV Radio 14 / 30

How good are the predictions? The function predict in R output predictions from a linear model: Confidence intervals reflect the uncertainty on ˆβ. 15 / 30

How good are the predictions? The function predict in R output predictions from a linear model: Confidence intervals reflect the uncertainty on ˆβ. Prediction intervals reflect uncertainty on ˆβ and the irreducible error ε as well. 15 / 30

Dealing with categorical or qualitative predictors Example: Credit dataset 20 40 60 80 100 5 10 15 20 2000 8000 14000 Balance 0 500 1500 20 40 60 80 100 Age Cards 2 4 6 8 5 10 15 20 Education Income 50 100 150 2000 8000 14000 Limit Rating 200 600 1000 0 500 1500 2 4 6 8 50 100 150 200 600 1000 16 / 30

Dealing with categorical or qualitative predictors Example: Credit dataset Balance 20 40 60 80 100 5 10 15 20 2000 8000 14000 0 500 1500 In addition, there are 4 qualitative variables: 20 40 60 80 100 Age gender: male, female. 5 10 15 20 Cards Education Income 2 4 6 8 50 100 150 student: student or not. status: married, single, divorced. 2000 8000 14000 Limit Rating 200 600 1000 ethnicity: African American, Asian, Caucasian. 0 500 1500 2 4 6 8 50 100 150 200 600 1000 16 / 30

Dealing with categorical or qualitative predictors For each qualitative predictor, e.g. ethnicity: Choose a baseline category, e.g. African American 17 / 30

Dealing with categorical or qualitative predictors For each qualitative predictor, e.g. ethnicity: Choose a baseline category, e.g. African American For every other category, define a new predictor: XAsian is 1 if the person is Asian and 0 otherwise. X Caucasian is 1 if the person is Caucasian and 0 otherwise. 17 / 30

Dealing with categorical or qualitative predictors For each qualitative predictor, e.g. ethnicity: Choose a baseline category, e.g. African American For every other category, define a new predictor: XAsian is 1 if the person is Asian and 0 otherwise. X Caucasian is 1 if the person is Caucasian and 0 otherwise. The model will be: Y = β 0 +β 1 X 1 + +β 7 X 7 +β Asian X Asian + β Caucasian X Caucasian +ε. 17 / 30

Dealing with categorical or qualitative predictors For each qualitative predictor, e.g. ethnicity: Choose a baseline category, e.g. African American For every other category, define a new predictor: XAsian is 1 if the person is Asian and 0 otherwise. X Caucasian is 1 if the person is Caucasian and 0 otherwise. The model will be: Y = β 0 +β 1 X 1 + +β 7 X 7 +β Asian X Asian + β Caucasian X Caucasian +ε. β Asian is the relative effect on balance for being Asian compared to the baseline category. 17 / 30

Dealing with categorical or qualitative predictors The model fit and predictions are independent of the choice of the baseline category. 18 / 30

Dealing with categorical or qualitative predictors The model fit and predictions are independent of the choice of the baseline category. However, hypothesis tests derived from these variables are affected by the choice. 18 / 30

Dealing with categorical or qualitative predictors The model fit and predictions are independent of the choice of the baseline category. However, hypothesis tests derived from these variables are affected by the choice. Solution: To check whether ethnicity is important, use an F -test for the hypothesis β Asian = β Caucasian = 0. This does not depend on the coding. 18 / 30

Dealing with categorical or qualitative predictors The model fit and predictions are independent of the choice of the baseline category. However, hypothesis tests derived from these variables are affected by the choice. Solution: To check whether ethnicity is important, use an F -test for the hypothesis β Asian = β Caucasian = 0. This does not depend on the coding. Other ways to encode qualitative predictors produce the same fit ˆf, but the coefficients have different interpretations. 18 / 30

Recap So far, we have: Defined Multiple Linear Regression 19 / 30

Recap So far, we have: Defined Multiple Linear Regression Discussed how to test the importance of variables. 19 / 30

Recap So far, we have: Defined Multiple Linear Regression Discussed how to test the importance of variables. Described one approach to choose a subset of variables. 19 / 30

Recap So far, we have: Defined Multiple Linear Regression Discussed how to test the importance of variables. Described one approach to choose a subset of variables. Explained how to code qualitative variables. 19 / 30

Recap So far, we have: Defined Multiple Linear Regression Discussed how to test the importance of variables. Described one approach to choose a subset of variables. Explained how to code qualitative variables. Now, how do we evaluate model fit? Is the linear model any good? What can go wrong? 19 / 30

How good is the fit? To assess the fit, we focus on the residuals. 20 / 30

How good is the fit? To assess the fit, we focus on the residuals. R 2 = Corr(Y, Ŷ ), always increases as we add more variables. 20 / 30

How good is the fit? To assess the fit, we focus on the residuals. R 2 = Corr(Y, Ŷ ), always increases as we add more variables. The residual standard error (RSE) does not always improve with more predictors: 1 RSE = n p 1 RSS. 20 / 30

How good is the fit? To assess the fit, we focus on the residuals. R 2 = Corr(Y, Ŷ ), always increases as we add more variables. The residual standard error (RSE) does not always improve with more predictors: 1 RSE = n p 1 RSS. Visualizing the residuals can reveal phenomena that are not accounted for by the model. 20 / 30

Potential issues in linear regression 1. Interactions between predictors 2. Non-linear relationships 3. Correlation of error terms 4. Non-constant variance of error (heteroskedasticity). 5. Outliers 6. High leverage points 7. Colinearity 21 / 30

Interactions between predictors Linear regression has an additive assumption: sales = β 0 + β 1 tv + β 2 radio + ε 22 / 30

Interactions between predictors Linear regression has an additive assumption: sales = β 0 + β 1 tv + β 2 radio + ε i.e. An increase of $100 dollars in TV ads causes a fixed increase in sales, regardless of how much you spend on radio ads. 22 / 30

Interactions between predictors Linear regression has an additive assumption: sales = β 0 + β 1 tv + β 2 radio + ε i.e. An increase of $100 dollars in TV ads causes a fixed increase in sales, regardless of how much you spend on radio ads. If we visualize the residuals, it is clear that this is false: Sales TV Radio 22 / 30

Interactions between predictors One way to deal with this is to include multiplicative variables in the model: sales = β 0 + β 1 tv + β 2 radio + β 3 (tv radio) + ε The interaction variable is high when both tv and radio are high. 23 / 30

Interactions between predictors R makes it easy to include interaction variables in the model: 24 / 30

Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. 50 100 150 200 Horsepower 25 / 30

Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. Solution: include polynomial terms in the model. MPG = β 0 + β 1 horsepower + ε 50 100 150 200 Horsepower 25 / 30

Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. Solution: include polynomial terms in the model. MPG = β 0 + β 1 horsepower + β 2 horsepower 2 + ε 50 100 150 200 Horsepower 25 / 30

Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 50 100 150 200 Horsepower Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. Solution: include polynomial terms in the model. MPG = β 0 + β 1 horsepower + β 2 horsepower 2 + β 3 horsepower 3 + ε 25 / 30

Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 50 100 150 200 Horsepower Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. Solution: include polynomial terms in the model. MPG = β 0 + β 1 horsepower + β 2 horsepower 2 + β 3 horsepower 3 +... + ε 25 / 30

Non-linearities In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors? 26 / 30

Non-linearities In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors? Plot the residuals against the response and look for a pattern: Residual Plot for Linear Fit Residual Plot for Quadratic Fit Residuals 15 10 5 0 5 10 15 20 334 323 330 Residuals 15 10 5 0 5 10 15 334 323 155 5 10 15 20 25 30 15 20 25 30 35 Fitted values Fitted values 26 / 30

Correlation of error terms We assumed that the errors for each sample are independent: What if this breaks down? y i = f(x i ) + ε i ; ε i N (0, σ) i.i.d. 27 / 30

Correlation of error terms We assumed that the errors for each sample are independent: What if this breaks down? y i = f(x i ) + ε i ; ε i N (0, σ) i.i.d. The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests: Example: Suppose that by accident, we double the data (we use each sample twice). 27 / 30

Correlation of error terms We assumed that the errors for each sample are independent: What if this breaks down? y i = f(x i ) + ε i ; ε i N (0, σ) i.i.d. The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests: Example: Suppose that by accident, we double the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of 2. 27 / 30

Correlation of error terms When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. 28 / 30

Correlation of error terms When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. 28 / 30

Correlation of error terms When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. Study on predicting height from weight at birth. 28 / 30

Correlation of error terms When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. Study on predicting height from weight at birth. Suppose some of the subjects in the study are in the same family, their shared environment could make them deviate from f(x) in similar ways. 28 / 30

Correlation of error terms Simulations of time series with increasing correlations between ε i. ρ=0.0 Residual 3 1 0 1 2 3 0 20 40 60 80 100 ρ=0.5 Residual 4 2 0 1 2 0 20 40 60 80 100 ρ=0.9 Residual 1.5 0.5 0.5 1.5 0 20 40 60 80 100 Observation 29 / 30

Non-constant variance of error (heteroskedasticity) The variance of the error depends on the input. 30 / 30

Non-constant variance of error (heteroskedasticity) The variance of the error depends on the input. To diagnose this, we can plot residuals vs. fitted values: Response Y Response log(y) Residuals 10 5 0 5 10 15 998 975 845 Residuals 0.8 0.6 0.4 0.2 0.0 0.2 0.4 437 605 671 10 15 20 25 30 2.4 2.6 2.8 3.0 3.2 3.4 Fitted values Fitted values 30 / 30

Non-constant variance of error (heteroskedasticity) The variance of the error depends on the input. To diagnose this, we can plot residuals vs. fitted values: Response Y Response log(y) Residuals 10 5 0 5 10 15 998 975 845 Residuals 0.8 0.6 0.4 0.2 0.0 0.2 0.4 437 605 671 10 15 20 25 30 2.4 2.6 2.8 3.0 3.2 3.4 Fitted values Fitted values Solution: If the trend in variance is relatively simple, we can transform the response using a logarithm, for example. 30 / 30