Lecture 6: Linear Regression (continued)

Similar documents
Lecture 6: Linear Regression

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression

Statistical Methods for Data Mining

Lecture 5: Clustering, Linear Regression

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning for OR & FE

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Lecture 14: Shrinkage

Linear Regression In God we trust, all others bring data. William Edwards Deming

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Linear model selection and regularization

4. Nonlinear regression functions

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning

Regression, Ridge Regression, Lasso

LECTURE 04: LINEAR REGRESSION PT. 2. September 20, 2017 SDS 293: Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Chapter 14 Student Lecture Notes 14-1

Data Mining Stat 588

Multiple Linear Regression

Statistical Machine Learning I

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

11 Hypothesis Testing

Machine Learning for OR & FE

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Chapter 3 Multiple Regression Complete Example

2. Regression Review

Unit 6 - Introduction to linear regression

Linear Model Selection and Regularization

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction

Python 데이터분석 보충자료. 윤형기

Exam Applied Statistical Regression. Good Luck!

Unit 6 - Simple linear regression

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Chapter 4: Regression Models

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Day 4: Shrinkage Estimators

ISyE 691 Data mining and analytics

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Statistical View of Least Squares

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

MATH 644: Regression Analysis Methods

Chapter 13. Multiple Regression and Model Building

Categorical Predictor Variables

Final Exam - Solutions

CS6220: DATA MINING TECHNIQUES

Multiple Linear Regression for the Salary Data

ECON 497 Midterm Spring

ECON 5350 Class Notes Functional Form and Structural Change

Simple Linear Regression. Y = f (X) + ε. Regression as a term. Linear Regression. Basic structure. Ivo Ugrina. September 29, 2016

Midterm 2 - Solutions

Introduction to Statistical modeling: handout for Math 489/583

Chapter 4. Regression Models. Learning Objectives

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Simple Linear Regression for the Climate Data

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Sociology 593 Exam 2 Answer Key March 28, 2002

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Math 423/533: The Main Theoretical Topics

CS6220: DATA MINING TECHNIQUES

MS-C1620 Statistical inference

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Multiple Linear Regression. Chapter 12

Econometrics I Lecture 7: Dummy Variables

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Regression Model Building

Applied Statistics and Econometrics

A Modern Look at Classical Multivariate Techniques

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

10. Alternative case influence statistics

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Econometrics Problem Set 3

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Multiple Regression: Chapter 13. July 24, 2015

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Lecture 28 Chi-Square Analysis

MS&E 226: Small Data

WISE International Masters

Unit 11: Multiple Linear Regression

IEOR165 Discussion Week 5

Lecture 8. Using the CLR Model. Relation between patent applications and R&D spending. Variables

CHAPTER 6: SPECIFICATION VARIABLES

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Unit 7: Multiple linear regression 1. Introduction to multiple linear regression

Lecture 6 Multiple Linear Regression, cont.

Lecture 3: Multiple Regression

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

CHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics

Transcription:

Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23

Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. or, in matrix notation: X2 y = Xβ + ε, Figure 3.4 X1 where y = (y 1,..., y n ) T, β = (β 0,..., β p ) T and X is our usual data matrix with an extra column of ones on the left to account for the intercept. 2 / 23

The estimates ˆβ Our goal is to minimize the RSS (training error): RSS = = n (y i ŷ i ) 2 i=1 n (y i β 0 β 1 x i,1 β p x i,p ) 2. i=1 This is minimized by the vector ˆβ: ˆβ = (X T X) 1 X T y. This only exists when X T X is invertible. This requires n p. 3 / 23

Testing whether a group of variables is important F-test: H 0 : β p q+1 = β p q+2 = = β p = 0. RSS 0 is the residual sum of squares for the model in H 0. F = (RSS 0 RSS)/q RSS/(n p 1). Special case: q = p. Test whether any of the predictors are related to Y. Special case: q = 1, exclude a single variable. Test whether this variable is related to Y after linearly correcting for all other variables. Equivalent to t-tests in R output. Must be careful with multiple testing. 4 / 23

How many variables are important? When choosing a subset of the predictors, we have 2 p choices. We cannot test every possible subset! Instead we will use a stepwise approach: 1. Construct a sequence of p models with increasing number of variables. 2. Select the best model among them. 5 / 23

Three variants of stepwise selection Forward selection: Starting from a null model (the intercept), include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest t-test p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. 6 / 23

How many variables are important? The output of a stepwise selection method is a range of models: {} {tv} {tv, newspaper} {tv, newspaper, radio} {tv, newspaper, radio, facebook} {tv, newspaper, radio, facebook, twitter} 6 choices are better than 2 6 = 64. We use different tuning methods to decide which model to use; e.g. cross-validation, AIC, BIC. 7 / 23

Dealing with categorical or qualitative predictors Example: Credit dataset Balance 20 40 60 80 100 5 10 15 20 2000 8000 14000 0 500 1500 In addition, there are 4 qualitative variables: 20 40 60 80 100 Age gender: male, female. 5 10 15 20 Cards Education Income 2 4 6 8 50 100 150 student: student or not. status: married, single, divorced. 2000 8000 14000 Limit Rating 200 600 1000 ethnicity: African American, Asian, Caucasian. 0 500 1500 2 4 6 8 50 100 150 200 600 1000 8 / 23

Dealing with categorical or qualitative predictors For each qualitative predictor, e.g. status: Choose a baseline category, e.g. single For every other category, define a new predictor: Xmarried is 1 if the person is married and 0 otherwise. X divorced is 1 if the person is divorced and 0 otherwise. The model will be: Y = β 0 +β 1 X 1 + +β 7 X 7 +β married X married + β divorced X divorced +ε. β married is the relative effect on balance for being married compared to the baseline category. 9 / 23

Dealing with categorical or qualitative predictors The model fit ˆf and predictions ˆf(x 0 ) are independent of the choice of the baseline category. However, the interpretation of parameters and associated hypothesis tests depend on the baseline category. Solution: To check whether status is important, use an F -test for the hypothesis β married = β divorced = 0. This does not depend on the coding of the baseline category. 10 / 23

How uncertain are the predictions? The function predict in R output predictions from a linear model; eg. x 0 = (5, 10, 15): Confidence intervals reflect the uncertainty on ˆβ; ie. confidence interval for f(x 0 ). Prediction intervals reflect uncertainty on ˆβ and the irreducible error ε as well; i.e. confidence interval for y 0. 11 / 23

Recap So far, we have: Defined Multiple Linear Regression Discussed how to test the relevance of variables. Described one approach to choose a subset of variables. Explained how to code qualitative variables. Discussed confidence intervals surrounding preditions. Now, how do we evaluate model fit? Is the linear model any good? What can go wrong? 12 / 23

How good is the fit? To assess the fit, we focus on the residuals. R 2 = Corr(Y, Ŷ ), always increases as we add more variables. The residual standard error (RSE) does not always improve with more predictors: 1 RSE = n p 1 RSS. Visualizing the residuals can reveal phenomena that are not accounted for by the model. 13 / 23

Potential issues in linear regression 1. Interactions between predictors 2. Non-linear relationships 3. Correlation of error terms 4. Non-constant variance of error (heteroskedasticity). 5. Outliers 6. High leverage points 7. Collinearity 14 / 23

Interactions between predictors Linear regression has an additive assumption: sales = β 0 + β 1 tv + β 2 radio + ε i.e. An increase of $100 dollars in TV ads causes a fixed increase in sales, regardless of how much you spend on radio ads. When we visualize the residuals, we see a pronounced non-linear relationship: Sales TV Radio 15 / 23

Interactions between predictors One way to deal with this is to include multiplicative variables in the model: sales = β 0 + β 1 tv + β 2 radio + β 3 (tv radio) + ε The interaction variable is high when both tv and radio are high. 16 / 23

Interactions between predictors R makes it easy to include interaction variables in the model: 17 / 23

Non-linearities Example: Auto dataset. Miles per gallon 10 20 30 40 50 50 100 150 200 Horsepower Linear Degree 2 Degree 5 A scatterplot between a predictor and the response may reveal a non-linear relationship. Solution: include polynomial terms in the model. MPG = β 0 + β 1 horsepower + ε + β 2 horsepower 2 + ε + β 3 horsepower 3 + ε +... + ε 18 / 23

Non-linearities In 2 or 3 dimensions, this is easy to visualize. What do we do when we have many predictors? Plot the residuals against the fitted values and look for a pattern: Residual Plot for Linear Fit Residual Plot for Quadratic Fit Residuals 15 10 5 0 5 10 15 20 334 323 330 Residuals 15 10 5 0 5 10 15 334 323 155 5 10 15 20 25 30 15 20 25 30 35 Fitted values Fitted values 19 / 23

Correlation of error terms We assumed that the errors for each sample are independent: y i = f(x i ) + ε i ; ε i N (0, σ) i.i.d. What if this assumption breaks down? The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests: Example: Suppose that by accident, we double the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of 2. 20 / 23

Correlation of error terms When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. Study on predicting height from weight at birth. Suppose some of the subjects in the study are in the same family, their shared environment could make them deviate from f(x) in similar ways. 21 / 23

Correlation of error terms Simulations of time series with increasing correlations between ε i. ρ=0.0 Residual 3 1 0 1 2 3 0 20 40 60 80 100 ρ=0.5 Residual 4 2 0 1 2 0 20 40 60 80 100 ρ=0.9 Residual 1.5 0.5 0.5 1.5 0 20 40 60 80 100 Observation 22 / 23

Non-constant variance of error (heteroskedasticity) The variance of the error depends on the input. To diagnose this, we can plot residuals vs. fitted values: Response Y Response log(y) Residuals 10 5 0 5 10 15 998 975 845 Residuals 0.8 0.6 0.4 0.2 0.0 0.2 0.4 437 605 671 10 15 20 25 30 2.4 2.6 2.8 3.0 3.2 3.4 Fitted values Fitted values Solution: If the trend in variance is relatively simple, we can transform the response using a logarithm, for example. 23 / 23