Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Similar documents
Linear Methods for Regression. Lijun Zhang

Data Mining Stat 588

Machine Learning. Lecture 9: Learning Theory. Feng Li.

ISyE 691 Data mining and analytics

Prediction & Feature Selection in GLM

Machine Learning for OR & FE

Lecture 14: Shrinkage

The prediction of house price

Linear Regression Linear Regression with Shrinkage

Linear Model Selection and Regularization

Machine Learning for OR & FE

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Dimension Reduction Methods

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Day 4: Shrinkage Estimators

Linear model selection and regularization

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Regularization: Ridge Regression and the LASSO

Linear Model Selection and Regularization

A Short Introduction to the Lasso Methodology

Linear Regression Linear Regression with Shrinkage

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Applied Machine Learning Annalisa Marsico

A Modern Look at Classical Multivariate Techniques

Machine Learning Linear Regression. Prof. Matteo Matteucci

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

LINEAR REGRESSION, RIDGE, LASSO, SVR

Proteomics and Variable Selection

Lecture 6: Methods for high-dimensional problems

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Homoskedasticity. Var (u X) = σ 2. (23)

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Regression Shrinkage and Selection via the Lasso

MS-C1620 Statistical inference

Introduction to Machine Learning and Cross-Validation

COMS 4771 Regression. Nakul Verma

Making sense of Econometrics: Basics

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Lecture Data Science

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Review of Econometrics

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Linear models and their mathematical foundations: Simple linear regression

An Introduction to Statistical Machine Learning - Theoretical Aspects -

MS&E 226: Small Data

Regression, Ridge Regression, Lasso

Shrinkage Methods: Ridge and Lasso

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: Statistical Decision Theory (Part II)

MS&E 226: Small Data

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Sparse Linear Models (10/7/13)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

STK-IN4300 Statistical Learning Methods in Data Science

Applied Statistics and Econometrics

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Homework 1: Solutions

Statistical Methods for Data Mining

A simulation study of model fitting to high dimensional data using penalized logistic regression

12 Statistical Justifications; the Bias-Variance Decomposition

ECE521 week 3: 23/26 January 2017

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Linear Regression (9/11/13)

9. Least squares data fitting

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Lecture 6: Linear Regression (continued)

Polynomial Regression and Regularization

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Understanding Generalization Error: Bounds and Decompositions

STAT 462-Computational Data Analysis

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Regression I: Mean Squared Error and Measuring Quality of Fit

CMSC858P Supervised Learning Methods

High-Dimensional Statistical Learning: Introduction

Introduction to Machine Learning

Ridge and Lasso Regression

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Bias-Variance Tradeoff. David Dalpiaz STAT 430, Fall 2017

L7: Multicollinearity

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Ch 2: Simple Linear Regression

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Machine Learning Basics: Estimators, Bias and Variance

Ridge Regression and Ill-Conditioning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

PDEEC Machine Learning 2016/17

Transcription:

TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin Guest Speaker, Imperial College London 2017.11.15

Multicollinearity Reminder from previous lectures: Linear regression model with y-intercept in matrix form For n N, p N, i [1,..., n], or, in matrix form, y i = β 0 + β 1 x i1 +... + β p x ip + ɛ i, y = X β + ɛ, where y 1 1 x 1 1 x 11 x 12 x 1p β 0 ɛ 1 y 2 1 x y =, X = 2 1 x 21 x 22 x 2p β 1 ɛ 2 =......, β =, ɛ =......... y n 1 x n 1 x n1 x n2 x np β p ɛ n ˆβ = (X X) 1 X y [ ] Var ˆβ = σ 2 (X X) 1

Multicollinearity Now, suppose we have correlated regressors: Perfect collinearity - a situation when there are two regressors X k, X l (or more) that are perfectly collinear: 0, 1 such that X l = 0 + 1 X k In this case ˆβ = (X X) 1 X y can not be calculated as X X is not invertible (Why?) Collinearity - a situation when there are two regressors X k, X l that have high degree of correlation with each other, but X k, X l are not perfectly collinear Multicollinearity - a situation when two or more regressors have high degree of correlation, but not perfect [ ] As Var ˆβ = σ 2 (X X) 1, multicollinearity means large amount of uncertainty around coefficient which means that estimator is not precise large confidence intervals and less accurate hypothesis testing

Multicollinearity Multicollinearity leads to large variance of estimators (i): Let us show that for j [1,..., p] [ ] σ Var ˆβj = 2 SST j (1 Rj 2 ) where SST j = n i=1(x ij x j ) 2 is a total sample variation in x j, and Rj 2 is r-squared from regressing x j on all other regressors including intercept From OLS minimization problem we know that: for all j [1,..., p] n n û i = i=1 i=1 n n x ij û i = i=1 i=1 (y i βˆ 0 βˆ 1 x i1... βˆ p x ip ) = 0 x ij (y i ˆ β 0 ˆ β 1 x i1... ˆ β p x ip ) = 0 Let us prove the equality for β 1 : write x i1 in terms of its fitted value and residual from the regression of x 1 on x 2, x 3,..., x p : x i1 = ˆx i1 + ˆr i1 Plug in and get n i=1 (ˆx i1 + ˆr i1 )(y i βˆ 0 βˆ 1 x i1... βˆ p x ip ) = 0 Since ˆx i1 is just a linear function of the explanatory variables n i=1 ˆx i1û i = 0. Thus we get: n i=1 ˆr i1 (y i ˆ β 0 ˆ β 1 x i1... ˆ β p x ip ) = 0

Multicollinearity Multicollinearity leads to large variance of estimators (ii): As ˆr i1 are the residuals n i=1 ˆr i1x ij = 0 for j [2,..., p]. And we have Substituting y i and rearranging: n i=1 ˆr i1 (y i ˆ β 1 x i1 ) = 0 ˆβ 1 = β 1 + n i=1 ˆr i1u i n i=1 ˆr2 i1 Given random sample u i are independent, ˆr i1 are non random conditional on X: Var [ ˆβ 1 X ] = n i=1 ˆr2 i1 Var [u i X] ( n = σ2 i=1 ˆr2 i1 )2 n i=1 ˆr2 i1 = σ 2 SST 1 (1 R 2 1 )

Multicollinearity Consequences of multicollinearity: Good news: Multicollinearity has no much impact on overall predictability of the model and overall R 2 Bad news: Some variables may be dropped from the model although, they are important in the population High variance of coefficients may reduce the precision of its estimation Multicollinearity can result in coefficients appearing to have the wrong sign Estimates of coefficients may be sensitive to particular sets of sample data Overfitting issues

Bias-Variance Trade-off Overfitting vs Underfitting When we deal with supervised machine learning it is important to understand the errors of predictions, which can be decomposed into the two main components: Error due to Bias : The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Bias measures how far off in general these models predictions are from the correct value. High bias can cause an algorithm to miss the relevant relations between features and target outputs Underfitting issues Error due to Variance : The error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs Overfitting issues Consider general case: assume we want to predict Y using X. Assume there is relationship: Y = f(x) + ɛ, where ɛ (0, σ 2 ɛ ) We may estimate our model f(x) as ˆf(X) and want to understand how well ˆf( ) fits some future random observation: (x 0, y 0 )? If ˆf(X) is a good model, then ˆf(x0 ) should be close to y 0 - this is a notion of prediction error Prediction error is estimated as PE(x 0 ) = E[(Y ˆf(x0 )) 2 ]

Bias-Variance Trade-off Bias-Variance trade-off: derivation Let us prove the Bias-Variance trade-off: PE(x 0 ) = E[(Y ˆf(x0 )) 2 ] = E[Y 2 + ˆf(x0 ) 2 2Yˆf(x0 )] = E[Y 2 ] + E[ˆf(x0 ) 2 ] E[2Yˆf(x0 )] = Var [Y] + (E[Y]) 2 + Var [ˆf(x0 ) ] + (E[ˆf(x0 )]) 2 2f(x 0 )E[ˆf(x0 )] = Var [Y] + Var [ˆf(x0 ) ] + (f(x o ) E[ˆf(x0 )]) 2 = σ 2 ɛ + Variance + Bias 2 σɛ 2 - the irreducible error, the noise term that cannot fundamentally be reduced by any model Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0. However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance As model becomes more complex (more terms included), local structure/curvature can be picked up But coefficient estimates suffer from high variance as more terms are included in the model

Cross-Validation Cross-Validation In Statistics, method that splits the data into training and validation sets is defined as cross-validation Typical split for cross-validation could be 70% / 30% or 80% / 20% When working with data, given we have enough of it, we can split the data into three sets: Training sample - is used to estimate or fit (relatively small) set of models. For example, we can use set of linear model with different number of features and estimate ˆβ OLS Validation sample - is used to pick which model that is best based on how it predicting the Y-values for validation data, to determine the right level of complexity (number of regressors in linear model) or the structure (linear vs non-linear) Test sample - once the model is selected from previous two samples one can use the training sample and validation sample to re-fit the data and do final check in test sample There are several ways to do cross-validation: Leave-N-out cross-validation - involves using N observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of N observations and a training set (How many?) K-fold cross-validation - the original sample is randomly partitioned into K equal sized subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds can then be averaged to produce a single measure of fit Measure of fit for cross-validation could be: MSE - mean squared error: 1 n n i=1(ŷ i Y i ) 2 1 RMSE - root of mean squared error: n n i=1(ŷ i Y i ) 2 MAD - median absolute deviation: median Ŷ i Y i

Cross-Validation Training and Validation Samples On the training sample, the MSE always goes down as the model ˆf(X) becomes more complex as has more parameters The more flexibility you have in fitting the model, the closer you should be able to come to a perfect fit - extreme case when n = p for least squares estimation On the validation sample, the MSE goes down up to some point as the model complexity increases After that point the MSE starts to increase because of overfitting the training sample when the model complexity is high Overfitting - when a model is fit to a dataset and that has sufficient complexity to start fitting random features in the data. The bias of the model will typically go down, but the variance will have increased because the model is fitting the noise in the data Very important idea: MSE will decrease but then increase on the validation sample while it will continue to decrease on the training data

Cross-Validation Bias-Variance: Example (i)

Cross-Validation Bias-Variance: Example (ii)

Cross-Validation Bias-Variance: Example (iii) MSE for training and validation sets: Coefficients have higher variance as complexity increases:

Ridge Regression Ridge regression - L 2 regularization Regularization is a method for solving problems of overfitting or problems with large variance The method involves introducing additional penalty in the form of shrinkage of the coefficient estimates Generally, L p regularization is used: L p = ( β i p ) 1 p For regularization it is importance to have the data normalized - as we do not want to punish the coefficient just because the corresponding regressors are large in magnitude For Ridge regression L 2 norm is used: n minimize (y i β T x i ) 2 i=1 s.t. p β 2 j j=1 t minimize[(y X β) T (Y X β) + β T β]

Ridge Regression Derivation of Ridge-estimator Theorem ˆβ Ridge = (X X + I p ) 1 X y Proof. ( RSS = y X β Ridge = y y 2y X β Ridge ) ( y X β Ridge ) + β Ridge X X β Ridge + β Ridge Ridge β + β Ridge Ridge β RSS = β Ridge 2X y + 2X X β Ridge + 2I p β Ridge = 0 ˆβ Ridge = (X X + I p ) 1 X y Several observations on - a shrinkage parameter: control the size of coefficients control amount of regularization As 0 we get ˆβ Ridge = ˆβ OLS As we get ˆβ Ridge = 0

Ridge Regression Statistical properties of Ridge-estimator Ridge estimator is biased: Let A = X X, then ˆβ Ridge = (X X + I p ) 1 X y = (A + I p ) 1 A ( A 1 X y ) = [A ( I p + A 1) ] 1 A[(X X) 1 X y] = ( I p + A 1) 1 A 1 A ˆβ OLS = ( Ip + A 1) 1 ˆβ OLS Now, let us find the variance of ˆβ Ridge : Let W = ( I p + A 1) 1, then Var [ ˆβ Ridge E[ ˆβ Ridge ] = ( I p + A 1) 1 β ] = Var [ W ˆβ OLS ] = W Var [ ˆβ OLS ] W = σ 2 W (X X) 1 W = σ 2 [X X + I p ] 1 X X ( [X X + I p ] 1) T Recall the Bias-Variance formula for MSE of estimators: MSE = E[( ˆΘ Θ) 2 ] = E[(( ˆΘ E[ ˆΘ]) (Θ E[ ˆΘ])) 2 ] = E[ ˆΘ E[ ˆΘ]] 2 + E[(Θ E[ ˆΘ]) 2 ] 2E[( ˆΘ E[ ˆΘ])(Θ E[ ˆΘ])] = Var[ ˆΘ] + [Bias( ˆΘ)] 2 It appears that : MSE[ ˆβ Ridge ] < MSE[ ˆβ OLS ] [Theobald, 1974] - this is the reason why the Ridge regression is usefull

Ridge Regression Geometric interpretation(i) Assume centered inputs around its principal components: so X = UD, where U and D come from SVD - singular-value decomposition: X = UDV. We can assume that Consider: X = ( X 1,..., X p ) = (Xv 1,..., Xv p ) ( ) ŷ = X ˆβ Ridge = X X 1 ( ) p X + Ip X y = UD D 2 1 + I p DU dj y = 2 u j j=1 dj 2 + u j y where u j are principal components of X and d j - its singular values (How they relate to eigenvalues?) Ridge regression shrinks the coordinates with respect to the orthogonal basis formed by the principal components From above we have: ˆβ Ridge j = d2 j dj 2 + u j y Var [ ˆβ OLS ] = σ2 d 2 j d j 2 Shrinkage factor: d j 2 + Coordinates with respect to principal components with smaller variance shrinks more (What does it mean with respect to multicollinearity?)

Ridge Regression Geometric interpretation(ii)

Ridge Regression Ridge regression: Example(i) Suppose we want to understand the curve in the tail of losses for the purpose to calculate Expected Shortfall. We have the following observations in-sample and want to fit them as good as possible out-of-sample:

Ridge Regression Ridge regression: Example(ii) We can fit the curve by polynomial regression with some number of powers, in this case 25

Ridge Regression Ridge regression: Example(iii) Apply R-packages glmnet to perform Ridge regression and cv.glmnet to perform 10-fold cross-validation:

Ridge Regression Ridge regression: Example(iiii) Ridge regression improves out-of-sample MSE by 30% as compared to OLS regression:

Lasso Regression Lasso regression - L 1 regularization For Lasso regression L 1 norm is used: n minimize (y i β T x i ) 2 i=1 s.t. p j=1 β j t minimize[(y X β) T (Y X β) + p j=1 β j ] Large enough will set some of the coefficients exactly to 0 So the Lasso performs model selection - it is said to produce sparse solutions for us Difficult to solve analytically as cost function contains absolute values

Lasso Regression Lasso regression: Example(i) Lasso regression equal some coefficients to zero for large enough whereas Ridge just shrinks them:

Lasso Regression Lasso regression: Example(ii) Overall out-of-sample performance of Lasso regression is comparable to Ridge regression:

Conclusion Take it from the lecture Multicollinearity increases the variance of coefficients leading to misunderstanding of them but does not impact overall predictability of the model - total R 2 There is Bias-Variance trade-off: the model can underfit leading to high Bias when it is too simple, but when it is too complex it can overfit leading to high Variance of prediction errors With in crease in complexity prediction errors on training set always decrease, but on validation set it decreases up to some point after which prediction errors increase In Machine Learning one needs to use the cross validation techniques as the model is crap if we only can predict the past Ridge regression penalises the coefficients with large variance by L 2 norm shrinking them, that leads to the Bias in coefficient s estimators however decreases the Variance of them There is optimal that allows Ridge to outperform Least Squares in terms of MSE of the estimator Lasso regression penalises the coefficients by L 1 norm shrinking them and equating some of them to zero that allows to perform feature selection