STK-IN4300 Statistical Learning Methods in Data Science

Outline of the lecture STK-I4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no Model Assessment and Selection Cross-Validation Bootstrap Methods Methods using Derived Input Directions Principal Component Regression Partial Least Squares Shrinkage Methods Ridge Regression STK-I4300: lecture 3 1/ 42 STK-I4300: lecture 3 2/ 42 Cross-Validation: k-fold cross-validation The cross-validation aims at estimating the expected test error, Err ErLpY, ˆfpXqqs. with enough data, we can split them in a training and test set; since usually it is not the case, we mimic this split by using the limited amount of data we have, split data in K folds F1,..., F K, approximatively same size; use, in turn, K 1 folds to train the model (derive ˆf k pxq); evaluate the model in the remaining fold, CV p ˆf k q 1 ÿ Lpy i, F k ˆf k px i qq ipf k estimate the expected test error as an average, CV p ˆfq 1 Kÿ 1 ÿ Lpy i, K F k 1 k ˆf k px i qq Fk K 1 ÿ Lpy i, ˆf k px i qq. ipf k STK-I4300: lecture 3 3/ 42 Cross-Validation: k-fold cross-validation (figure from http://qingkaikong.blogspot.com/2017/02/machine-learning-9-more-on-artificial.html) STK-I4300: lecture 3 4/ 42

Cross-Validation: choice of K Cross-Validation: choice of K How to choose K? there is no a clear solution; bias-variance trade-off: smaller the K, smaller the variance (but larger bias); larger the K, smaller the bias (but larger variance); extreme cases: K 2, half observations for training, half for testing; K, leave-one-out cross-validation (LOOCV); LOOCV estimates the expected test error approximatively unbiased; LOOCV has very large variance (the training sets are very similar to one another); usual choices are K 5 and K 10. STK-I4300: lecture 3 5/ 42 STK-I4300: lecture 3 6/ 42 Cross-Validation: further aspects If we want to select a tuning parameter (e.g., no. of neighbours) train ˆf k px, αq for each α; compute CV p ˆf, αq 1 K ř K k 1 obtain ˆα argmin α CV p ˆf, αq. The generalized cross-validation (GCV), GCV p ˆfq 1 1 ř F k ipf k Lpy i, ˆf k px i, αqq; «ÿ y i ˆfpx i q 1 tracepsq{ is a convenient approximation of LOOCV for linear fitting under square loss; has computational advantages. STK-I4300: lecture 3 7/ 42 ff 2 Cross-Validation: the wrong and the right way to do cross-validation Consider the following procedure: 1. find a subset of good (= most correlated with the outcome) predictors; 2. use the selected predictors to build a classifier; 3. use cross-validation to compute the prediction error. Practical example (see R file): generated X, an r 50s ˆ rp 5000s data matrix; generate independently y i, i 1,..., 50, y i P t0, 1u; the true error test is 0.50; implementing the procedure above. What does it happen? STK-I4300: lecture 3 8/ 42

Cross-Validation: the wrong and the right way to do cross-validation Why it is not correct? Training and test sets are OT independent! observations on the test sets are used twice. Correct way to proceed: divide the sample in K folds; both perform variable selection and build the classifier using observations from K 1 folds; possible choice of the tuning parameter included; compute the prediction error on the remaining fold. STK-I4300: lecture 3 9/ 42 Bootstrap Methods: bootstrap IDEA: generate pseudo-samples from the empirical distribution function computed on the original sample; by sampling with replacement from the original dataset; mimic new experiments. Suppose Z tpx loomoon 1, y 1 q,..., py looomooon, x qu be the training set: z 1 z by sampling with replacement, Z 1 tpy 1 looomooon, x 1q,..., py, looomooon x q u;..................... by sampling with replacement, Z B tpy 1 looomooon, x 1q,..., py, looomooon x q u; use the B bootstrap samples Z 1,..., Z B to estimate any aspect of the distribution of a map SpZq. STK-I4300: lecture 3 10/ 42 z 1 z 1 z z Bootstrap Methods: bootstrap Bootstrap Methods: bootstrap For example, to estimate the variance of SpZq, xvarrspzqs 1 B 1 where S 1 ř B B b 1 SpZ b q. ote that: Bÿ b 1 pspz b q S q 2 x VarrSpZqs is the Monte Carlo estimate of VarrSpZqs under sampling from the empirical distribution ˆF. STK-I4300: lecture 3 11/ 42 STK-I4300: lecture 3 12/ 42

Very simple: Bootstrap Methods: estimate prediction error generate B bootstrap samples Z 1,..., Z B ; apply the prediction rule to each bootstrap sample to derive the predictions ˆf b px iq, b 1,..., B; compute the error for each point, and take the average, Is it correct? O!!! xerr boot 1 B Bÿ b 1 1 ÿ Lpy i, ˆf b px iqq. Again, training and test set are OT independent! Bootstrap Methods: example Consider a classification problem: two classes with the same number of observations; predictors and class label independent ñ Err 0.5. Using the 1-nearest neighbour: if y i P Z b Ñ Err ˆ 0; if y i R Z b Ñ Err ˆ 0.5; Therefore, xerr boot 0 ˆ PrrY i P Z b s ` 0.5 ˆ PrrY i R Z b s looooomooooon 0.368 0.184 STK-I4300: lecture 3 13/ 42 STK-I4300: lecture 3 14/ 42 Bootstrap Methods: why 0.368 Bootstrap Methods: correct estimate prediction error Prrobservation i does not belong to the boostrap sample bs 0.368 Since PrrZ brs y is 1, is true for each position rs, then Consequently, ˆ 1 PrrY i R Z b s Ñ8 ÝÝÝÝÑ e 1 «0.368, Prrobservation i is in the boostrap sample bs «0.632. ote: each bootstrap sample has observations; some of the original observations are included more than once; some of them (in average, 0.368) are not included at all; these are not used to compute the predictions; they can be used as a test set, xerr p1q 1 ÿ 1 C r is ÿ bpc r is Lpy i, ˆf b px iqq where C r is is the set of indeces of the bootstrap samples which do not contain the observation i and C r is denotes its cardinality. STK-I4300: lecture 3 15/ 42 STK-I4300: lecture 3 16/ 42

Bootstrap Methods: 0.632 bootstrap Issue: the average number of unique observations in the bootstrap sample is 0.632 Ñ not so far from 0.5 of 2-fold CV; similar bias issues of 2-fold CV; x Err p1q slightly overestimates the prediction error. To solve this, the 0.632 bootstrap estimator has been developed, xerr p0.632q 0.368 Ďerr ` 0.632 x Err p1q in practice it works well; in case of strong overfit, it can break down; consider again the previous classification problem example; with 1-nearest neighbour, Ďerr 0; x Err p0.632q 0.632 x Err p1q 0.632 ˆ 0.5 0.316 0.5. STK-I4300: lecture 3 17/ 42 Bootstrap Methods: 0.632+ bootstrap Further improvement, 0.632+ bootstrap: based on the no-information error rate γ; γ takes into account the amount of overfitting; γ is the error rate if predictors and response were independent; computed by considering all combinations of x i and y i, ˆγ 1 ÿ 1 ÿ Lpy i, ˆfpx i 1qq. STK-I4300: lecture 3 18/ 42 i 1 1 Bootstrap Methods: 0.632+ bootstrap Methods using Derived Input Directions: summary The quantity ˆγ is used to estimate the relative overfitting rate, Err ˆR x p1q Ďerr, ˆγ Ďerr which is then use in the 0.632+ bootstrap estimator, xerr p0.632`q p1 ŵq Ďerr ` ŵ x Err p1q, Principal Components Regression Partial Least Squares where ŵ 0.632 1 0.368 ˆR. STK-I4300: lecture 3 19/ 42 STK-I4300: lecture 3 20/ 42

Principal Component Regression: singular value decomposition Consider the singular value decomposition (SVD) of the ˆ p (standardized) input matrix X, where: X UDV T U is the ˆ p orthogonal matrix whose columns span the column space of X; D is a p ˆ p diagonal matrix, whose diagonal entries d 1 ě d 2 ě ě d p ě 0 are the singular values of X; V is the p ˆ p orthogonal matrix whose columns span the row space of X. STK-I4300: lecture 3 21/ 42 Principal Component Regression: principal components Simple algebra leads to X T X V D 2 V T, the eigen decomposition of X T X (and, up to a constant, of the sample covariance matrix S X T X{). Using the eigenvectors v (columns of V ), we can define the principal components of X, z Xv. the first principal component z 1 has the largest sample variance (among all linear combinations of the columns of X); Varpz 1 q VarpXv 1 q d2 1 since d 1 ě ě d p ě 0, then Varpz 1 q ě ě Varpz p q. STK-I4300: lecture 3 22/ 42 Principal Component Regression: principal components Principal Component Regression: principal components Principal component regression (PCR): use M ď p principal components as input; regress y on z 1,..., z M ; since the principal components are orthogonal, Mÿ ŷ pcr pmq ȳ ` ˆθ m z m, m 1 where ˆθ m xz m, yy{xz m, z m y. Since z m are linear combinations of x, Mÿ ˆβ pcr pmq ˆθ m v m. m 1 STK-I4300: lecture 3 23/ 42 STK-I4300: lecture 3 24/ 42

Principal Component Regression: remarks Partial Least Squares: idea ote that: PCR can be used in high-dimensions, as long as M ă n; idea: remove the directions with less information; if M, ˆβ pcr pmq ˆβ OLS ; M is a tuning parameter, may be chosen via cross-validation; shrinkage effect (clearer later); Partial least square (PLS) is based on an idea similar to PCR: construct a set of linear combinations of X; PCR only uses X, ignoring y; in PLS we want to also consider the information on y; as for PCR, it is important to first standardize X. principal component are scale dependent, it is important to standardize X! STK-I4300: lecture 3 25/ 42 STK-I4300: lecture 3 26/ 42 Partial Least Squares: algorithm Partial Least Squares: step by step 1. standardize each x, set ŷ r0s ȳ and x r0s x ; 2. For m 1, 2,..., p, (a) z m ř p, with ˆϕ m xx rm 1s, yy; 1 ˆϕ mx rm 1s (b) ˆθ m xz m, yy{xz m, z m y; (c) ŷ rms ŷ rm 1s ` ˆθz m ; (d) orthogonalize each x rm 1s with respect to z m, x rms x rm 1s xzm, x rm 1s xz m, z m y 3. output the sequence of fitted vectors tŷ rms u p 1. y z m, 1,..., p; First step: (a) compute the first PLS direction, z 1 ř p 1 ˆϕ 1x, based on the relation between each x and y, ˆϕ 1 xx, yy; (b) estimate the related regression coefficient, ˆθ 1 xz 1,yy (c) model after the first iteration: ŷ r1s ȳ ` ˆθ 1 z 1 ; (d) orthogonalize x 1,..., x p w.r.t. z 1, x r2s x We are now ready for the second step... xz 1,z 1 y Ěz 1y Ďz 1 2 xz1,x y xz 1,z 1 y z 1 ; ; STK-I4300: lecture 3 27/ 42 STK-I4300: lecture 3 28/ 42

Partial Least Squares: step by step Partial Least Squares: PLS versus PCR... using x r2s instead of x : (a) compute the second PLS direction, z 2 ř p 1 ˆϕ 2x r2s, based on the relation between each x r2s and y, ˆϕ 2 xx r2s, yy; (b) estimate the related regression coefficient, ˆθ 2 xz 2,yy xz 2,z 2 y ; (c) model after the second iteration: ŷ r2s ȳ ` ˆθ 1 z 1 ` ˆθ 2 z 2 ; (d) orthogonalize x r2s 1,..., xr2s p w.r.t. z 2, x r2s x r2s z 2 ; ˆ xz 2,x r2s y xz 2,z 2 y and so on, until the M ď p step Ñ M derived inputs. Differences: PCR the derived input directions are the principal components of X, constructed by looking at the variability of X; PLS the input directions take into consideration both the variability of X and the correlation between X and y. Mathematically: PCR max α VarpXαq, s.t. α 1 and α T Sv l 0, l 1,..., M 1; PLS max α Cor 2 py, XαqVarpXαq, s.t. α 1 and α T Sϕ l 0, @l ă M. In practice, the variance tends to dominate Ñ similar results! STK-I4300: lecture 3 29/ 42 STK-I4300: lecture 3 30/ 42 Ridge Regression: historical notes When two predictors are strongly correlated Ñ collinearity; in the extreme case of linear dependency Ñ super-collinearity; in the case of super-collinearity, X T X is not invertible (not full rank); Hoerl & Kennard (1970): X T X Ñ X T X ` λi p, where λ ą 0 and 1 0... 0 0 1... 0 I p....... 0 0... 1 With λ ą 0, px T X ` λi p q 1 exists. Ridge Regression: estimator Substituting X T X with X T X ` λi p in the LS estimator, ˆβ ridge pλq px T X ` λi p q 1 X T y. Alternatively, the ridge estimator can be seen as the minimizer of ÿ py i β 0 subect to ř p 1 β2 ď t. Which is the same as pÿ β x i q 2, 1 ˆβ ridge pλq argmin β # ÿ py i β 0 pÿ β x i q 2 ` λ 1 pÿ β 2 1 +. STK-I4300: lecture 3 31/ 42 STK-I4300: lecture 3 32/ 42

Ridge Regression: visually Ridge Regression: visually STK-I4300: lecture 3 33/ 42 STK-I4300: lecture 3 34/ 42 Ridge Regression: remarks Ridge Regression: bias ote: ridge solution is not equivariant under scaling Ñ X must be standardized before applying the minimizer; the intercept is not involved in the penalization; Bayesian interpretation: Yi pβ 0 ` x T i β, σ2 q; β p0, τ 2 q; λ σ 2 {τ 2 ; ˆβridge pλq as the posterior mean. STK-I4300: lecture 3 35/ 42 Er ˆβ ridge pλqs ErpX T X ` λi q 1 p X T ys ErpI p ` λpx T Xq 1 q 1 px loooooooomoooooooon T Xq 1 X T ys T Xq 1 q 1 loooooooooooomoooooooooooon pi p ` λpx Er ˆβ LS s w λ w λ β ùñ Er ˆβ ridge pλqs β for λ ą 0. λ Ñ 0, Er ˆβ ridge pλqs Ñ β; λ Ñ 8, Er ˆβ ridge pλqs Ñ 0 (without intercept); due to correlation, λ a ą λ b œ ˆβ ridge pλq ą ˆβ ridge pλq. STK-I4300: lecture 3 36/ 42 ˆβ LS

Ridge Regression: variance Consider the variance of the ridge estimator, Varr ˆβ ridge pλqs Varrw λ ˆβLS s w λ Varr ˆβ LS sw T λ σ 2 w λ px T Xq 1 w T λ. Then, Varr ˆβ LS s Varr ˆβ ridge pλqs σ 2 px T Xq 1 w λ px T Xq 1 wλ T σ 2 w λ pip ` λpx T Xq 1 qpx T Xq 1 pi p ` λpx T Xq 1 q T px T Xq 1 w T λ σ 2 w λ ppx T Xq 1 ` 2λpX T Xq 2 ` λ 2 px T Xq 3 q px T Xq 1 w T λ σ 2 w λ 2λpX T Xq 2 ` λ 2 px T Xq 3 q w T λ ą 0 (since all terms are quadratic and therefore positive) Ridge Regression: degrees of freedom ote that the ridge solution is a linear combination of y, as the least squares one: ŷ LS XpX loooooooomoooooooon T Xq 1 X T y ÝÑ df tracephq p; H ŷ ridge XpX T X ` λi p q 1 X T looooooooooooomooooooooooooon H λ y ÝÑ dfpλq traceph λ q; tracephλ q ř p d 2 d 2 `λ; d is the diagonal element of D in the SVD of X; λ Ñ 0, dfpλq Ñ p; λ Ñ 8, dfpλq Ñ 0. ùñ Varr ˆβ ridge pλqs ĺ Varr ˆβ LS s STK-I4300: lecture 3 37/ 42 STK-I4300: lecture 3 38/ 42 Ridge Regression: more about shrinkage Ridge Regression: more about shrinkage Recall the SVD decomposition X UDV T, and the properties U T U I p V T V. ˆβ LS px T Xq 1 X T y ŷ LS X ˆβ LS pv DU T UDV T q 1 V DU T y UDV T V D 2 DU T y pv D 2 V T q 1 V DU T y UDD 2 DU T y V D 2 V T V DU T y UU T y V D 2 DU T y ˆβ ridge px T X ` λi p q 1 X T y pv DU T UDV T ` λi p q 1 V DU T y pv D 2 V T ` λv V T q 1 V DU T y V pd 2 ` λi p q 1 V T V DU T y V pd 2 ` λi p q 1 U T y So: ŷ ridge X ˆβ ridge UDV T V pd 2 ` λi p q 1 U T y UV T V D 2 pd 2 ` λi p q 1 U T y U D 2 pd 2 ` λi p q 1 looooooooomooooooooon U T y pÿ 1 d 2 d 2 ` λ small singular values d correspond to directions of the column space of X with low variance; ridge regression penalizes the most these directions. STK-I4300: lecture 3 39/ 42 STK-I4300: lecture 3 40/ 42

Ridge Regression: more about shrinkage References I Hoerl, A. E. & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55 67. (picture from https://onlinecourses.science.psu.edu/stat857/node/155/) STK-I4300: lecture 3 41/ 42 STK-I4300: lecture 3 42/ 42