Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30
Simple linear regression Model: y i = β 0 + β 1 x i + ε i Sales 5 10 15 20 25 ε i N (0, σ) i.i.d. 0 50 100 150 200 250 300 TV Figure 3.1 2 / 30
Simple linear regression Model: y i = β 0 + β 1 x i + ε i Sales 5 10 15 20 25 ε i N (0, σ) i.i.d. The estimates ˆβ 0 and ˆβ 1 are chosen to minimize the residual sum of squares (RSS): 0 50 100 150 200 250 300 TV Figure 3.1 RSS = n (y i ŷ i ) 2 i=1 2 / 30
Simple linear regression Model: y i = β 0 + β 1 x i + ε i Sales 5 10 15 20 25 ε i N (0, σ) i.i.d. The estimates ˆβ 0 and ˆβ 1 are chosen to minimize the residual sum of squares (RSS): 0 50 100 150 200 250 300 TV Figure 3.1 RSS = = n (y i ŷ i ) 2 i=1 n (y i β 0 β 1 x i ) 2. i=1 2 / 30
Estimates ˆβ 0 and ˆβ 1 A little calculus shows that the minimizers of the RSS are: ˆβ 1 = n i=1 (x i x)(y i y) n i=1 (x i x) 2, ˆβ 0 = y ˆβ 1 x. 3 / 30
Assesing the accuracy of ˆβ 0 and ˆβ 1 Y 10 5 0 5 10 Y 10 5 0 5 10 The Standard Errors for the parameters are: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ] SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2. 2 1 0 1 2 X 2 1 0 1 2 X Figure 3.3 4 / 30
Assesing the accuracy of ˆβ 0 and ˆβ 1 Y 10 5 0 5 10 Y 10 5 0 5 10 The Standard Errors for the parameters are: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ] SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2. 2 1 0 1 2 2 1 0 1 2 X X Figure 3.3 The 95% confidence intervals: ˆβ 0 ± 2 SE( ˆβ 0 ) ˆβ 1 ± 2 SE( ˆβ 1 ) Calculations depend on the model 4 / 30
Hypothesis test H 0 : There is no relationship between X and Y. H a : There is some relationship between X and Y. 5 / 30
Hypothesis test H 0 : β 1 = 0. H a : β 1 0. 5 / 30
Hypothesis test H 0 : β 1 = 0. H a : β 1 0. Test statistic: t = ˆβ 1 0 SE( ˆβ 1 ). Under the null hypothesis (special case of the model), this has a t-distribution with n 2 degrees of freedom. 5 / 30
Hypothesis test H 0 : β 1 = 0. H a : β 1 0. Test statistic: t = ˆβ 1 0 SE( ˆβ 1 ). Under the null hypothesis (special case of the model), this has a t-distribution with n 2 degrees of freedom. 5 / 30
Interpreting the hypothesis test If we reject the null hypothesis, can we assume there is an exact linear relationship? 6 / 30
Interpreting the hypothesis test If we reject the null hypothesis, can we assume there is an exact linear relationship? No. A quadratic relationship may be a better fit, for example. There is evidence of linear association. 6 / 30
Interpreting the hypothesis test If we reject the null hypothesis, can we assume there is an exact linear relationship? No. A quadratic relationship may be a better fit, for example. There is evidence of linear association. If we don t reject the null hypothesis, can we assume there is no relationship between X and Y? 6 / 30
Interpreting the hypothesis test If we reject the null hypothesis, can we assume there is an exact linear relationship? No. A quadratic relationship may be a better fit, for example. There is evidence of linear association. If we don t reject the null hypothesis, can we assume there is no relationship between X and Y? No. This test is only powerful against certain monotone alternatives. There could be more complex non-linear relationships. 6 / 30
Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. X2 X1 Figure 3.4 7 / 30
Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. or, in matrix notation: X2 Ey = Xβ, Figure 3.4 X1 where y = (y 1,..., y n ) T, β = (β 0,..., β p ) T and X is our usual data matrix with an extra column of ones on the left to account for the intercept. 7 / 30
Multiple linear regression can answer several questions Is at least one of the variables X i useful for predicting the outcome Y? 8 / 30
Multiple linear regression can answer several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? 8 / 30
Multiple linear regression can answer several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? How good is a linear model for these data? 8 / 30
Multiple linear regression can answer several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? How good is a linear model for these data? Given a set of predictor values, what is a likely value for Y, and how accurate is this prediction? 8 / 30
The estimates ˆβ Our goal again is to minimize the RSS: RSS = n (y i ŷ i ) 2 i=1 9 / 30
The estimates ˆβ Our goal again is to minimize the RSS: RSS = = n (y i ŷ i ) 2 i=1 n (y i β 0 β 1 x i,1 β p x i,p ) 2. i=1 9 / 30
The estimates ˆβ Our goal again is to minimize the RSS: RSS = = n (y i ŷ i ) 2 i=1 n (y i β 0 β 1 x i,1 β p x i,p ) 2. i=1 One can show that this is minimized by the vector ˆβ: ˆβ = (X T X) 1 X T y. We also write RSS for the minimized sum of squares. 9 / 30
Consider the hypothesis: Which variables are important? H 0 : The last q predictors have no relation with Y. 10 / 30
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. 10 / 30
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. 10 / 30
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). 10 / 30
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. 10 / 30
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. Example: If q = p, we test whether any of the variables is important. n RSS 0 = (y i y) 2 i=1 10 / 30
Which variables are important? A multiple linear regression in R has the following output: 11 / 30
Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. 12 / 30
Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. A low p-value indicates that the predictor is important (it adds something to the model not explained by other variables). 12 / 30
Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. A low p-value indicates that the predictor is important (it adds something to the model not explained by other variables). P-hacking warning: If there are many predictors, even under the null hypothesis, some of the t-tests will have low p-values. 12 / 30
How many variables are important? When we select a subset of the predictors, we have 2 p choices. 13 / 30
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. 13 / 30
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. 13 / 30
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. 13 / 30
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. 13 / 30
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. Choosing model this way is a form of tuning. P-hacking: hypothesis tests, confidence intervals not valid after selection! 13 / 30
How good is the fit? To assess the fit, we focus on the residuals. 14 / 30
How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. 14 / 30
How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. The residual standard error (RSE) corrects this: 1 RSE = n p 1 RSS. 14 / 30
How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. The residual standard error (RSE) corrects this: 1 RSE = n p 1 RSS. Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions: Sales TV Radio 14 / 30
How good are the predictions? The function predict in R output predictions from a linear model: Confidence intervals reflect the uncertainty on ˆβ. 15 / 30
How good are the predictions? The function predict in R output predictions from a linear model: Confidence intervals reflect the uncertainty on ˆβ. Prediction intervals reflect uncertainty on ˆβ and the irreducible error ε as well. 15 / 30
Dealing with categorical or qualitative predictors Example: Credit dataset 20 40 60 80 100 5 10 15 20 2000 8000 14000 Balance 0 500 1500 20 40 60 80 100 Age Cards 2 4 6 8 5 10 15 20 Education Income 50 100 150 2000 8000 14000 Limit Rating 200 600 1000 0 500 1500 2 4 6 8 50 100 150 200 600 1000 16 / 30
Dealing with categorical or qualitative predictors Example: Credit dataset Balance 20 40 60 80 100 5 10 15 20 2000 8000 14000 0 500 1500 In addition, there are 4 qualitative variables: 20 40 60 80 100 Age gender: male, female. 5 10 15 20 Cards Education Income 2 4 6 8 50 100 150 student: student or not. status: married, single, divorced. 2000 8000 14000 Limit Rating 200 600 1000 ethnicity: African American, Asian, Caucasian. 0 500 1500 2 4 6 8 50 100 150 200 600 1000 16 / 30
Dealing with categorical or qualitative predictors For each qualitative predictor, e.g. ethnicity: Choose a baseline category, e.g. African American 17 / 30
Dealing with categorical or qualitative predictors For each qualitative predictor, e.g. ethnicity: Choose a baseline category, e.g. African American For every other category, define a new predictor: XAsian is 1 if the person is Asian and 0 otherwise. X Caucasian is 1 if the person is Caucasian and 0 otherwise. 17 / 30
Dealing with categorical or qualitative predictors For each qualitative predictor, e.g. ethnicity: Choose a baseline category, e.g. African American For every other category, define a new predictor: XAsian is 1 if the person is Asian and 0 otherwise. X Caucasian is 1 if the person is Caucasian and 0 otherwise. The model will be: Y = β 0 +β 1 X 1 + +β 7 X 7 +β Asian X Asian + β Caucasian X Caucasian +ε. 17 / 30
Dealing with categorical or qualitative predictors For each qualitative predictor, e.g. ethnicity: Choose a baseline category, e.g. African American For every other category, define a new predictor: XAsian is 1 if the person is Asian and 0 otherwise. X Caucasian is 1 if the person is Caucasian and 0 otherwise. The model will be: Y = β 0 +β 1 X 1 + +β 7 X 7 +β Asian X Asian + β Caucasian X Caucasian +ε. β Asian is the relative effect on balance for being Asian compared to the baseline category. 17 / 30
Dealing with categorical or qualitative predictors The model fit and predictions are independent of the choice of the baseline category. 18 / 30
Dealing with categorical or qualitative predictors The model fit and predictions are independent of the choice of the baseline category. However, hypothesis tests derived from these variables are affected by the choice. 18 / 30
Dealing with categorical or qualitative predictors The model fit and predictions are independent of the choice of the baseline category. However, hypothesis tests derived from these variables are affected by the choice. Solution: To check whether ethnicity is important, use an F -test for the hypothesis β Asian = β Caucasian = 0. This does not depend on the coding. 18 / 30
Dealing with categorical or qualitative predictors The model fit and predictions are independent of the choice of the baseline category. However, hypothesis tests derived from these variables are affected by the choice. Solution: To check whether ethnicity is important, use an F -test for the hypothesis β Asian = β Caucasian = 0. This does not depend on the coding. Other ways to encode qualitative predictors produce the same fit ˆf, but the coefficients have different interpretations. 18 / 30
Recap So far, we have: Defined Multiple Linear Regression 19 / 30
Recap So far, we have: Defined Multiple Linear Regression Discussed how to test the importance of variables. 19 / 30
Recap So far, we have: Defined Multiple Linear Regression Discussed how to test the importance of variables. Described one approach to choose a subset of variables. 19 / 30
Recap So far, we have: Defined Multiple Linear Regression Discussed how to test the importance of variables. Described one approach to choose a subset of variables. Explained how to code qualitative variables. 19 / 30
Recap So far, we have: Defined Multiple Linear Regression Discussed how to test the importance of variables. Described one approach to choose a subset of variables. Explained how to code qualitative variables. Now, how do we evaluate model fit? Is the linear model any good? What can go wrong? 19 / 30
How good is the fit? To assess the fit, we focus on the residuals. 20 / 30
How good is the fit? To assess the fit, we focus on the residuals. R 2 = Corr(Y, Ŷ ), always increases as we add more variables. 20 / 30
How good is the fit? To assess the fit, we focus on the residuals. R 2 = Corr(Y, Ŷ ), always increases as we add more variables. The residual standard error (RSE) does not always improve with more predictors: 1 RSE = n p 1 RSS. 20 / 30
How good is the fit? To assess the fit, we focus on the residuals. R 2 = Corr(Y, Ŷ ), always increases as we add more variables. The residual standard error (RSE) does not always improve with more predictors: 1 RSE = n p 1 RSS. Visualizing the residuals can reveal phenomena that are not accounted for by the model. 20 / 30
Potential issues in linear regression 1. Interactions between predictors 2. Non-linear relationships 3. Correlation of error terms 4. Non-constant variance of error (heteroskedasticity). 5. Outliers 6. High leverage points 7. Colinearity 21 / 30
Interactions between predictors Linear regression has an additive assumption: sales = β 0 + β 1 tv + β 2 radio + ε 22 / 30
Interactions between predictors Linear regression has an additive assumption: sales = β 0 + β 1 tv + β 2 radio + ε i.e. An increase of $100 dollars in TV ads causes a fixed increase in sales, regardless of how much you spend on radio ads. 22 / 30
Interactions between predictors Linear regression has an additive assumption: sales = β 0 + β 1 tv + β 2 radio + ε i.e. An increase of $100 dollars in TV ads causes a fixed increase in sales, regardless of how much you spend on radio ads. If we visualize the residuals, it is clear that this is false: Sales TV Radio 22 / 30
Interactions between predictors One way to deal with this is to include multiplicative variables in the model: sales = β 0 + β 1 tv + β 2 radio + β 3 (tv radio) + ε The interaction variable is high when both tv and radio are high. 23 / 30
Interactions between predictors R makes it easy to include interaction variables in the model: 24 / 30
Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. 50 100 150 200 Horsepower 25 / 30
Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. Solution: include polynomial terms in the model. MPG = β 0 + β 1 horsepower + ε 50 100 150 200 Horsepower 25 / 30
Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. Solution: include polynomial terms in the model. MPG = β 0 + β 1 horsepower + β 2 horsepower 2 + ε 50 100 150 200 Horsepower 25 / 30
Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 50 100 150 200 Horsepower Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. Solution: include polynomial terms in the model. MPG = β 0 + β 1 horsepower + β 2 horsepower 2 + β 3 horsepower 3 + ε 25 / 30
Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 50 100 150 200 Horsepower Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. Solution: include polynomial terms in the model. MPG = β 0 + β 1 horsepower + β 2 horsepower 2 + β 3 horsepower 3 +... + ε 25 / 30
Non-linearities In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors? 26 / 30
Non-linearities In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors? Plot the residuals against the response and look for a pattern: Residual Plot for Linear Fit Residual Plot for Quadratic Fit Residuals 15 10 5 0 5 10 15 20 334 323 330 Residuals 15 10 5 0 5 10 15 334 323 155 5 10 15 20 25 30 15 20 25 30 35 Fitted values Fitted values 26 / 30
Correlation of error terms We assumed that the errors for each sample are independent: What if this breaks down? y i = f(x i ) + ε i ; ε i N (0, σ) i.i.d. 27 / 30
Correlation of error terms We assumed that the errors for each sample are independent: What if this breaks down? y i = f(x i ) + ε i ; ε i N (0, σ) i.i.d. The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests: Example: Suppose that by accident, we double the data (we use each sample twice). 27 / 30
Correlation of error terms We assumed that the errors for each sample are independent: What if this breaks down? y i = f(x i ) + ε i ; ε i N (0, σ) i.i.d. The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests: Example: Suppose that by accident, we double the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of 2. 27 / 30
Correlation of error terms When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. 28 / 30
Correlation of error terms When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. 28 / 30
Correlation of error terms When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. Study on predicting height from weight at birth. 28 / 30
Correlation of error terms When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. Study on predicting height from weight at birth. Suppose some of the subjects in the study are in the same family, their shared environment could make them deviate from f(x) in similar ways. 28 / 30
Correlation of error terms Simulations of time series with increasing correlations between ε i. ρ=0.0 Residual 3 1 0 1 2 3 0 20 40 60 80 100 ρ=0.5 Residual 4 2 0 1 2 0 20 40 60 80 100 ρ=0.9 Residual 1.5 0.5 0.5 1.5 0 20 40 60 80 100 Observation 29 / 30
Non-constant variance of error (heteroskedasticity) The variance of the error depends on the input. 30 / 30
Non-constant variance of error (heteroskedasticity) The variance of the error depends on the input. To diagnose this, we can plot residuals vs. fitted values: Response Y Response log(y) Residuals 10 5 0 5 10 15 998 975 845 Residuals 0.8 0.6 0.4 0.2 0.0 0.2 0.4 437 605 671 10 15 20 25 30 2.4 2.6 2.8 3.0 3.2 3.4 Fitted values Fitted values 30 / 30
Non-constant variance of error (heteroskedasticity) The variance of the error depends on the input. To diagnose this, we can plot residuals vs. fitted values: Response Y Response log(y) Residuals 10 5 0 5 10 15 998 975 845 Residuals 0.8 0.6 0.4 0.2 0.0 0.2 0.4 437 605 671 10 15 20 25 30 2.4 2.6 2.8 3.0 3.2 3.4 Fitted values Fitted values Solution: If the trend in variance is relatively simple, we can transform the response using a logarithm, for example. 30 / 30