Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin Guest Speaker, Imperial College London 2017.11.15

Multicollinearity Reminder from previous lectures: Linear regression model with y-intercept in matrix form For n N, p N, i [1,..., n], or, in matrix form, y i = β 0 + β 1 x i1 +... + β p x ip + ɛ i, y = X β + ɛ, where y 1 1 x 1 1 x 11 x 12 x 1p β 0 ɛ 1 y 2 1 x y =, X = 2 1 x 21 x 22 x 2p β 1 ɛ 2 =......, β =, ɛ =......... y n 1 x n 1 x n1 x n2 x np β p ɛ n ˆβ = (X X) 1 X y [ ] Var ˆβ = σ 2 (X X) 1

Multicollinearity Now, suppose we have correlated regressors: Perfect collinearity - a situation when there are two regressors X k, X l (or more) that are perfectly collinear: 0, 1 such that X l = 0 + 1 X k In this case ˆβ = (X X) 1 X y can not be calculated as X X is not invertible (Why?) Collinearity - a situation when there are two regressors X k, X l that have high degree of correlation with each other, but X k, X l are not perfectly collinear Multicollinearity - a situation when two or more regressors have high degree of correlation, but not perfect [ ] As Var ˆβ = σ 2 (X X) 1, multicollinearity means large amount of uncertainty around coefficient which means that estimator is not precise large confidence intervals and less accurate hypothesis testing

Multicollinearity Multicollinearity leads to large variance of estimators (i): Let us show that for j [1,..., p] [ ] σ Var ˆβj = 2 SST j (1 Rj 2 ) where SST j = n i=1(x ij x j ) 2 is a total sample variation in x j, and Rj 2 is r-squared from regressing x j on all other regressors including intercept From OLS minimization problem we know that: for all j [1,..., p] n n û i = i=1 i=1 n n x ij û i = i=1 i=1 (y i βˆ 0 βˆ 1 x i1... βˆ p x ip ) = 0 x ij (y i ˆ β 0 ˆ β 1 x i1... ˆ β p x ip ) = 0 Let us prove the equality for β 1 : write x i1 in terms of its fitted value and residual from the regression of x 1 on x 2, x 3,..., x p : x i1 = ˆx i1 + ˆr i1 Plug in and get n i=1 (ˆx i1 + ˆr i1 )(y i βˆ 0 βˆ 1 x i1... βˆ p x ip ) = 0 Since ˆx i1 is just a linear function of the explanatory variables n i=1 ˆx i1û i = 0. Thus we get: n i=1 ˆr i1 (y i ˆ β 0 ˆ β 1 x i1... ˆ β p x ip ) = 0

Multicollinearity Multicollinearity leads to large variance of estimators (ii): As ˆr i1 are the residuals n i=1 ˆr i1x ij = 0 for j [2,..., p]. And we have Substituting y i and rearranging: n i=1 ˆr i1 (y i ˆ β 1 x i1 ) = 0 ˆβ 1 = β 1 + n i=1 ˆr i1u i n i=1 ˆr2 i1 Given random sample u i are independent, ˆr i1 are non random conditional on X: Var [ ˆβ 1 X ] = n i=1 ˆr2 i1 Var [u i X] ( n = σ2 i=1 ˆr2 i1 )2 n i=1 ˆr2 i1 = σ 2 SST 1 (1 R 2 1 )

Multicollinearity Consequences of multicollinearity: Good news: Multicollinearity has no much impact on overall predictability of the model and overall R 2 Bad news: Some variables may be dropped from the model although, they are important in the population High variance of coefficients may reduce the precision of its estimation Multicollinearity can result in coefficients appearing to have the wrong sign Estimates of coefficients may be sensitive to particular sets of sample data Overfitting issues

Bias-Variance Trade-off Overfitting vs Underfitting When we deal with supervised machine learning it is important to understand the errors of predictions, which can be decomposed into the two main components: Error due to Bias : The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Bias measures how far off in general these models predictions are from the correct value. High bias can cause an algorithm to miss the relevant relations between features and target outputs Underfitting issues Error due to Variance : The error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs Overfitting issues Consider general case: assume we want to predict Y using X. Assume there is relationship: Y = f(x) + ɛ, where ɛ (0, σ 2 ɛ ) We may estimate our model f(x) as ˆf(X) and want to understand how well ˆf( ) fits some future random observation: (x 0, y 0 )? If ˆf(X) is a good model, then ˆf(x0 ) should be close to y 0 - this is a notion of prediction error Prediction error is estimated as PE(x 0 ) = E[(Y ˆf(x0 )) 2 ]

Bias-Variance Trade-off Bias-Variance trade-off: derivation Let us prove the Bias-Variance trade-off: PE(x 0 ) = E[(Y ˆf(x0 )) 2 ] = E[Y 2 + ˆf(x0 ) 2 2Yˆf(x0 )] = E[Y 2 ] + E[ˆf(x0 ) 2 ] E[2Yˆf(x0 )] = Var [Y] + (E[Y]) 2 + Var [ˆf(x0 ) ] + (E[ˆf(x0 )]) 2 2f(x 0 )E[ˆf(x0 )] = Var [Y] + Var [ˆf(x0 ) ] + (f(x o ) E[ˆf(x0 )]) 2 = σ 2 ɛ + Variance + Bias 2 σɛ 2 - the irreducible error, the noise term that cannot fundamentally be reduced by any model Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0. However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance As model becomes more complex (more terms included), local structure/curvature can be picked up But coefficient estimates suffer from high variance as more terms are included in the model

Cross-Validation Cross-Validation In Statistics, method that splits the data into training and validation sets is defined as cross-validation Typical split for cross-validation could be 70% / 30% or 80% / 20% When working with data, given we have enough of it, we can split the data into three sets: Training sample - is used to estimate or fit (relatively small) set of models. For example, we can use set of linear model with different number of features and estimate ˆβ OLS Validation sample - is used to pick which model that is best based on how it predicting the Y-values for validation data, to determine the right level of complexity (number of regressors in linear model) or the structure (linear vs non-linear) Test sample - once the model is selected from previous two samples one can use the training sample and validation sample to re-fit the data and do final check in test sample There are several ways to do cross-validation: Leave-N-out cross-validation - involves using N observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of N observations and a training set (How many?) K-fold cross-validation - the original sample is randomly partitioned into K equal sized subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds can then be averaged to produce a single measure of fit Measure of fit for cross-validation could be: MSE - mean squared error: 1 n n i=1(ŷ i Y i ) 2 1 RMSE - root of mean squared error: n n i=1(ŷ i Y i ) 2 MAD - median absolute deviation: median Ŷ i Y i

Cross-Validation Training and Validation Samples On the training sample, the MSE always goes down as the model ˆf(X) becomes more complex as has more parameters The more flexibility you have in fitting the model, the closer you should be able to come to a perfect fit - extreme case when n = p for least squares estimation On the validation sample, the MSE goes down up to some point as the model complexity increases After that point the MSE starts to increase because of overfitting the training sample when the model complexity is high Overfitting - when a model is fit to a dataset and that has sufficient complexity to start fitting random features in the data. The bias of the model will typically go down, but the variance will have increased because the model is fitting the noise in the data Very important idea: MSE will decrease but then increase on the validation sample while it will continue to decrease on the training data

Cross-Validation Bias-Variance: Example (i)

Cross-Validation Bias-Variance: Example (ii)

Cross-Validation Bias-Variance: Example (iii) MSE for training and validation sets: Coefficients have higher variance as complexity increases:

Ridge Regression Ridge regression - L 2 regularization Regularization is a method for solving problems of overfitting or problems with large variance The method involves introducing additional penalty in the form of shrinkage of the coefficient estimates Generally, L p regularization is used: L p = ( β i p ) 1 p For regularization it is importance to have the data normalized - as we do not want to punish the coefficient just because the corresponding regressors are large in magnitude For Ridge regression L 2 norm is used: n minimize (y i β T x i ) 2 i=1 s.t. p β 2 j j=1 t minimize[(y X β) T (Y X β) + β T β]

Ridge Regression Derivation of Ridge-estimator Theorem ˆβ Ridge = (X X + I p ) 1 X y Proof. ( RSS = y X β Ridge = y y 2y X β Ridge ) ( y X β Ridge ) + β Ridge X X β Ridge + β Ridge Ridge β + β Ridge Ridge β RSS = β Ridge 2X y + 2X X β Ridge + 2I p β Ridge = 0 ˆβ Ridge = (X X + I p ) 1 X y Several observations on - a shrinkage parameter: control the size of coefficients control amount of regularization As 0 we get ˆβ Ridge = ˆβ OLS As we get ˆβ Ridge = 0

Ridge Regression Statistical properties of Ridge-estimator Ridge estimator is biased: Let A = X X, then ˆβ Ridge = (X X + I p ) 1 X y = (A + I p ) 1 A ( A 1 X y ) = [A ( I p + A 1) ] 1 A[(X X) 1 X y] = ( I p + A 1) 1 A 1 A ˆβ OLS = ( Ip + A 1) 1 ˆβ OLS Now, let us find the variance of ˆβ Ridge : Let W = ( I p + A 1) 1, then Var [ ˆβ Ridge E[ ˆβ Ridge ] = ( I p + A 1) 1 β ] = Var [ W ˆβ OLS ] = W Var [ ˆβ OLS ] W = σ 2 W (X X) 1 W = σ 2 [X X + I p ] 1 X X ( [X X + I p ] 1) T Recall the Bias-Variance formula for MSE of estimators: MSE = E[( ˆΘ Θ) 2 ] = E[(( ˆΘ E[ ˆΘ]) (Θ E[ ˆΘ])) 2 ] = E[ ˆΘ E[ ˆΘ]] 2 + E[(Θ E[ ˆΘ]) 2 ] 2E[( ˆΘ E[ ˆΘ])(Θ E[ ˆΘ])] = Var[ ˆΘ] + [Bias( ˆΘ)] 2 It appears that : MSE[ ˆβ Ridge ] < MSE[ ˆβ OLS ] [Theobald, 1974] - this is the reason why the Ridge regression is usefull

Ridge Regression Geometric interpretation(i) Assume centered inputs around its principal components: so X = UD, where U and D come from SVD - singular-value decomposition: X = UDV. We can assume that Consider: X = ( X 1,..., X p ) = (Xv 1,..., Xv p ) ( ) ŷ = X ˆβ Ridge = X X 1 ( ) p X + Ip X y = UD D 2 1 + I p DU dj y = 2 u j j=1 dj 2 + u j y where u j are principal components of X and d j - its singular values (How they relate to eigenvalues?) Ridge regression shrinks the coordinates with respect to the orthogonal basis formed by the principal components From above we have: ˆβ Ridge j = d2 j dj 2 + u j y Var [ ˆβ OLS ] = σ2 d 2 j d j 2 Shrinkage factor: d j 2 + Coordinates with respect to principal components with smaller variance shrinks more (What does it mean with respect to multicollinearity?)

Ridge Regression Geometric interpretation(ii)

Ridge Regression Ridge regression: Example(i) Suppose we want to understand the curve in the tail of losses for the purpose to calculate Expected Shortfall. We have the following observations in-sample and want to fit them as good as possible out-of-sample:

Ridge Regression Ridge regression: Example(ii) We can fit the curve by polynomial regression with some number of powers, in this case 25

Ridge Regression Ridge regression: Example(iii) Apply R-packages glmnet to perform Ridge regression and cv.glmnet to perform 10-fold cross-validation:

Ridge Regression Ridge regression: Example(iiii) Ridge regression improves out-of-sample MSE by 30% as compared to OLS regression:

Lasso Regression Lasso regression - L 1 regularization For Lasso regression L 1 norm is used: n minimize (y i β T x i ) 2 i=1 s.t. p j=1 β j t minimize[(y X β) T (Y X β) + p j=1 β j ] Large enough will set some of the coefficients exactly to 0 So the Lasso performs model selection - it is said to produce sparse solutions for us Difficult to solve analytically as cost function contains absolute values

Lasso Regression Lasso regression: Example(i) Lasso regression equal some coefficients to zero for large enough whereas Ridge just shrinks them:

Lasso Regression Lasso regression: Example(ii) Overall out-of-sample performance of Lasso regression is comparable to Ridge regression:

Conclusion Take it from the lecture Multicollinearity increases the variance of coefficients leading to misunderstanding of them but does not impact overall predictability of the model - total R 2 There is Bias-Variance trade-off: the model can underfit leading to high Bias when it is too simple, but when it is too complex it can overfit leading to high Variance of prediction errors With in crease in complexity prediction errors on training set always decrease, but on validation set it decreases up to some point after which prediction errors increase In Machine Learning one needs to use the cross validation techniques as the model is crap if we only can predict the past Ridge regression penalises the coefficients with large variance by L 2 norm shrinking them, that leads to the Bias in coefficient s estimators however decreases the Variance of them There is optimal that allows Ridge to outperform Least Squares in terms of MSE of the estimator Lasso regression penalises the coefficients by L 1 norm shrinking them and equating some of them to zero that allows to perform feature selection