BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Size: px

Start display at page:

Download "BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1"

Dwayne Harrington
5 years ago
Views:

1 BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture 2. Linear Regression 1 / 21

2 Linear Regression Fundamental Learning Algorithm Supervised learning method It assumes the dependence of Y on X is linear Largely used in many disciplines Simple and interpretable Fundamental in data science Data Mining I Lecture 2. Linear Regression 2 / 21

3 Linear Regression Models Simple linear regression Y = β 0 + β 1 X + ɛ Multiple linear regression Y = β 0 + β 1 X β p X p + ɛ Y : dependent variable (response, outcome) X 's: independent variable (covariates, explanatory variable) β's: regression coecients ɛ: random error (irreducible error) Data Mining I Lecture 2. Linear Regression 3 / 21

4 Linear Regression Models Using matrix format Y = Xβ + ɛ X is called design matrix with rst column being 1's The estimated linear regression model is Ŷ = E(Y X ) = Xˆβ Goal: estimate regression coecient β Data Mining I Lecture 2. Linear Regression 4 / 21

5 Model Assumptions E(Y X ) is a linear function of X or its basis expansion such as X 2 1, X 3 2,... The error term {ɛ i,..., ɛ n } i.i.d. N(0, σ 2 ) Data Mining I Lecture 2. Linear Regression 5 / 21

6 Least Square Solution Data Mining I Lecture 2. Linear Regression 6 / 21

7 Least Square Solution We want to minimize residual sum squares RSS(β) = n (y i x T i β) 2 i=1 = (y Xβ) T (y Xβ) Take rst-order derivative with respect to β and set to 0 0 = RSS(β) β X T y = X T Xβ This is called normal equation. = 2X T (y Xβ) Data Mining I Lecture 2. Linear Regression 7 / 21

8 Least Square Solution By assuming p < n, the solution is ˆβ = (X T X) 1 X T y The predicted value is ŷ = X(X T X) 1 X T y H = X(X T X) 1 X T is called hat matrix or projection matrix Data Mining I Lecture 2. Linear Regression 8 / 21

9 Least Square Solution Least square produces unbiased estimates, i.e., E(ˆβ) = β Variance-covariance matrix Var(β) = (X T X) 1 σ 2 Typically, the estimates of σ 2 is ˆσ 2 = 1 n p 1 n (Y i Ŷ i ) 2 i=1 This is unbiased estimate (Exercise: show that E(ˆσ 2 ) = σ 2 ) It is commonly called mean squared error (MSE) Exercise: show that (n p 1)ˆσ 2 /σ 2 χ 2 n p 1 Data Mining I Lecture 2. Linear Regression 9 / 21

10 R Square It is proportion of variation in Y explained by the model R 2 = TSS RSS TSS = 1 RSS TSS However, R 2 increases monotonically as number of explanatory variable increasing. Adjusted R 2 Radj 2 = 1 n 1 RSS n p 1 TSS Data Mining I Lecture 2. Linear Regression 10 / 21

11 Hypothesis Testing Test individual coecients Based on least squares, the distribution of coecient estimates is ˆβ N(β, (X T X) 1 σ 2 ) Testing for individual β - H 0 : β j = 0; H 1 : β j 0 - Using T-test since the true variance is unknown T = ˆβ j se( ˆβ j ) = ˆβ j ˆσ v j t n p 1 where v j is the jth diagonal element of (X T X) 1 - Reject H 0 if p-value < α or T > T (n p 1) 1 α Condence interval: ˆβ ± se( ˆβ) T (n p 1) 1 α Data Mining I Lecture 2. Linear Regression 11 / 21

12 Hypothesis Testing Test multiple coecients F -test for overall signicance - H 0 : β 1 =... = β p = 0; H 1 : at least one β 0 - F statistics F = (TSS RSS)/p RSS/(n p 1) F p,n p 1 where TSS = n i=1 (y i ȳ) 2, is total sum squares F -test for a group of coecients - H 0 : alternative model has signicant improvement F = (RSS 0 RSS 1 )/(p 1 p 0 ) RSS/(n p 1 1) F p1 p 0,n p 1 1 where RSS 0 is residual sum squares of reduced model - An application is to test the signicance of categorical variable (dummy variable) Data Mining I Lecture 2. Linear Regression 12 / 21

13 Bootstrap Resampling method A powerful tool to quantify uncertainty - standard error - condence interval Random sampling with replacement Example: - train a model with 1000 bootstrap samples - store all the parameter estimates - calculate standard error and condence interval Data Mining I Lecture 2. Linear Regression 13 / 21

14 Model Diagnostics Check the assumptions on the error term - Independent normal distribution? - E(ɛ i ) = 0? - Var(ɛ i ) = σ 2 =constant? Residual plot (an ideal residual plot looks like this) 2 2 Resource: Camm, et al., Essentials of Business Analytics Data Mining I Lecture 2. Linear Regression 14 / 21

15 Residual plot Which type of assumption is violated? 3 3 Resource: Camm, et al., Essentials of Business Analytics Data Mining I Lecture 2. Linear Regression 15 / 21

16 Other Diagnostic Plots Normal Quantile-Quantile Plot - It plots the standardized residual vs. theoretical quantiles - An easy way to visually test the normality assumption - If residual follows normal distribution, you should expect all dots lie on the diagonal straight line. Residual-Leverage Plot - This plot checks if there are any inuential points, which could alter your analysis by excluding them - The points that lie outside the dashed line, Cook's distance, are considered as inuential points Data Mining I Lecture 2. Linear Regression 16 / 21

17 Cross-Validation Resampling method Ret models of interest Provide more reliable prediction error Repeat training-testing procedure, and average all testing errors (MSPE), hence the variance of MSPE is reduced Applications: - model evaluation/comparison - tuning parameter selection Note: Cross-validation is NOT to build a model Data Mining I Lecture 2. Linear Regression 17 / 21

18 K-Fold Cross-validation Instead of doing training vs. testing once, we do it K times Use 2,3,4,5 as training and 1 as testing Use 1,3,4,5 as training and 2 as testing Keep doing this loop... Average 5 testing errors, that is CV score Data Mining I Lecture 2. Linear Regression 18 / 21

19 Leave-one-out Cross-validation By the name, it requires to repeat training-testing procedure n times However, for least square linear model, there is a short cut that makes LOOCV the same that of a single model t n ( yi ŷ i ) 2 CV n = 1 n i=1 1 h i where h i is the diagonal element of hat" matrix. In general, the estimates from LOOCV are highly correlated hence their average can have high variance In practice, K = 5 or 10 is recommended Data Mining I Lecture 2. Linear Regression 19 / 21

20 An Example Data Mining I Lecture 2. Linear Regression 20 / 21

21 An Example Data Mining I Lecture 2. Linear Regression 20 / 21

22 An Example Data Mining I Lecture 2. Linear Regression 20 / 21

23 Case Study Boston Housing Data Investigate how dierent attributes aect median housing values in suburbs of Boston Summary statistics More details are provided in the lab session Data Mining I Lecture 2. Linear Regression 21 / 21

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Linear Regression Specication Let Y be a univariate quantitative response variable. We model Y as follows: Y = f(x) + ε where