STK-IN4300 Statistical Learning Methods in Data Science

Similar documents
STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science

Lecture 6: Methods for high-dimensional problems

Linear Regression Linear Regression with Shrinkage

Chapter 7: Model Assessment and Selection

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

STK-IN4300 Statistical Learning Methods in Data Science

Machine Learning for OR & FE

Regularization: Ridge Regression and the LASSO

A Modern Look at Classical Multivariate Techniques

Linear Regression Linear Regression with Shrinkage

Biostatistics Advanced Methods in Biostatistics IV

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Least Squares Regression

High-dimensional regression

Theorems. Least squares regression

CS540 Machine learning Lecture 5

MS&E 226: Small Data

Least Squares Regression

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Bayes Estimators & Ridge Regression

Theoretical Exercises Statistical Learning, 2009

ECE521 week 3: 23/26 January 2017

Regression, Ridge Regression, Lasso

Eigenvalues and diagonalization

Linear Model Selection and Regularization

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Methods for Regression. Lijun Zhang

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

ISyE 691 Data mining and analytics

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

LECTURE NOTE #3 PROF. ALAN YUILLE

High-dimensional regression modeling

Linear Models for Regression

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Lecture 14: Shrinkage

4 Bias-Variance for Ridge Regression (24 points)

MSA220 Statistical Learning for Big Data

A Bias Correction for the Minimum Error Rate in Cross-validation

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Linear Regression (continued)

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Statistical Data Mining and Machine Learning Hilary Term 2016

Learning with Singular Vectors

4 Bias-Variance for Ridge Regression (24 points)

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Linear model selection and regularization

A Magiv CV Theory for Large-Margin Classifiers

Proteomics and Variable Selection


Dimensionality reduction

Classification 2: Linear discriminant analysis (continued); logistic regression

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

System Identification

Applied Machine Learning Annalisa Marsico

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bootstrap, Jackknife and other resampling methods

Focused fine-tuning of ridge regression

PDEEC Machine Learning 2016/17

CMSC858P Supervised Learning Methods

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Probabilistic Latent Semantic Analysis

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

Ridge Regression Revisited

STAT 462-Computational Data Analysis

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

The linear model is the most fundamental of all serious statistical models encompassing:

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Statistics 910, #5 1. Regression Methods

Lecture Notes 6: Linear Models

Draft. Lecture 14 Eigenvalue Problems. MATH 562 Numerical Analysis II. Songting Luo. Department of Mathematics Iowa State University

The prediction of house price

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Machine learning - HT Basis Expansion, Regularization, Validation

Linear Models in Machine Learning

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

An Introduction to Statistical Machine Learning - Theoretical Aspects -

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

Linear regression methods

Regularized Least Squares

Overfitting, Bias / Variance Analysis

COMBINING THE LIU-TYPE ESTIMATOR AND THE PRINCIPAL COMPONENT REGRESSION ESTIMATOR

Matrix Factorizations

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Econ 204 Supplement to Section 3.6 Diagonalization and Quadratic Forms. 1 Diagonalization and Change of Basis

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Improving ridge regression via model selection and focussed fine-tuning

Machine Learning Linear Regression. Prof. Matteo Matteucci

Improved Liu Estimators for the Poisson Regression Model

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Multicollinearity and A Ridge Parameter Estimation Approach

Transcription:

Outline of the lecture STK-I4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no Model Assessment and Selection Cross-Validation Bootstrap Methods Methods using Derived Input Directions Principal Component Regression Partial Least Squares Shrinkage Methods Ridge Regression STK-I4300: lecture 3 1/ 42 STK-I4300: lecture 3 2/ 42 Cross-Validation: k-fold cross-validation The cross-validation aims at estimating the expected test error, Err ErLpY, ˆfpXqqs. with enough data, we can split them in a training and test set; since usually it is not the case, we mimic this split by using the limited amount of data we have, split data in K folds F1,..., F K, approximatively same size; use, in turn, K 1 folds to train the model (derive ˆf k pxq); evaluate the model in the remaining fold, CV p ˆf k q 1 ÿ Lpy i, F k ˆf k px i qq ipf k estimate the expected test error as an average, CV p ˆfq 1 Kÿ 1 ÿ Lpy i, K F k 1 k ˆf k px i qq Fk K 1 ÿ Lpy i, ˆf k px i qq. ipf k STK-I4300: lecture 3 3/ 42 Cross-Validation: k-fold cross-validation (figure from http://qingkaikong.blogspot.com/2017/02/machine-learning-9-more-on-artificial.html) STK-I4300: lecture 3 4/ 42

Cross-Validation: choice of K Cross-Validation: choice of K How to choose K? there is no a clear solution; bias-variance trade-off: smaller the K, smaller the variance (but larger bias); larger the K, smaller the bias (but larger variance); extreme cases: K 2, half observations for training, half for testing; K, leave-one-out cross-validation (LOOCV); LOOCV estimates the expected test error approximatively unbiased; LOOCV has very large variance (the training sets are very similar to one another); usual choices are K 5 and K 10. STK-I4300: lecture 3 5/ 42 STK-I4300: lecture 3 6/ 42 Cross-Validation: further aspects If we want to select a tuning parameter (e.g., no. of neighbours) train ˆf k px, αq for each α; compute CV p ˆf, αq 1 K ř K k 1 obtain ˆα argmin α CV p ˆf, αq. The generalized cross-validation (GCV), GCV p ˆfq 1 1 ř F k ipf k Lpy i, ˆf k px i, αqq; «ÿ y i ˆfpx i q 1 tracepsq{ is a convenient approximation of LOOCV for linear fitting under square loss; has computational advantages. STK-I4300: lecture 3 7/ 42 ff 2 Cross-Validation: the wrong and the right way to do cross-validation Consider the following procedure: 1. find a subset of good (= most correlated with the outcome) predictors; 2. use the selected predictors to build a classifier; 3. use cross-validation to compute the prediction error. Practical example (see R file): generated X, an r 50s ˆ rp 5000s data matrix; generate independently y i, i 1,..., 50, y i P t0, 1u; the true error test is 0.50; implementing the procedure above. What does it happen? STK-I4300: lecture 3 8/ 42

Cross-Validation: the wrong and the right way to do cross-validation Why it is not correct? Training and test sets are OT independent! observations on the test sets are used twice. Correct way to proceed: divide the sample in K folds; both perform variable selection and build the classifier using observations from K 1 folds; possible choice of the tuning parameter included; compute the prediction error on the remaining fold. STK-I4300: lecture 3 9/ 42 Bootstrap Methods: bootstrap IDEA: generate pseudo-samples from the empirical distribution function computed on the original sample; by sampling with replacement from the original dataset; mimic new experiments. Suppose Z tpx loomoon 1, y 1 q,..., py looomooon, x qu be the training set: z 1 z by sampling with replacement, Z 1 tpy 1 looomooon, x 1q,..., py, looomooon x q u;..................... by sampling with replacement, Z B tpy 1 looomooon, x 1q,..., py, looomooon x q u; use the B bootstrap samples Z 1,..., Z B to estimate any aspect of the distribution of a map SpZq. STK-I4300: lecture 3 10/ 42 z 1 z 1 z z Bootstrap Methods: bootstrap Bootstrap Methods: bootstrap For example, to estimate the variance of SpZq, xvarrspzqs 1 B 1 where S 1 ř B B b 1 SpZ b q. ote that: Bÿ b 1 pspz b q S q 2 x VarrSpZqs is the Monte Carlo estimate of VarrSpZqs under sampling from the empirical distribution ˆF. STK-I4300: lecture 3 11/ 42 STK-I4300: lecture 3 12/ 42

Very simple: Bootstrap Methods: estimate prediction error generate B bootstrap samples Z 1,..., Z B ; apply the prediction rule to each bootstrap sample to derive the predictions ˆf b px iq, b 1,..., B; compute the error for each point, and take the average, Is it correct? O!!! xerr boot 1 B Bÿ b 1 1 ÿ Lpy i, ˆf b px iqq. Again, training and test set are OT independent! Bootstrap Methods: example Consider a classification problem: two classes with the same number of observations; predictors and class label independent ñ Err 0.5. Using the 1-nearest neighbour: if y i P Z b Ñ Err ˆ 0; if y i R Z b Ñ Err ˆ 0.5; Therefore, xerr boot 0 ˆ PrrY i P Z b s ` 0.5 ˆ PrrY i R Z b s looooomooooon 0.368 0.184 STK-I4300: lecture 3 13/ 42 STK-I4300: lecture 3 14/ 42 Bootstrap Methods: why 0.368 Bootstrap Methods: correct estimate prediction error Prrobservation i does not belong to the boostrap sample bs 0.368 Since PrrZ brs y is 1, is true for each position rs, then Consequently, ˆ 1 PrrY i R Z b s Ñ8 ÝÝÝÝÑ e 1 «0.368, Prrobservation i is in the boostrap sample bs «0.632. ote: each bootstrap sample has observations; some of the original observations are included more than once; some of them (in average, 0.368) are not included at all; these are not used to compute the predictions; they can be used as a test set, xerr p1q 1 ÿ 1 C r is ÿ bpc r is Lpy i, ˆf b px iqq where C r is is the set of indeces of the bootstrap samples which do not contain the observation i and C r is denotes its cardinality. STK-I4300: lecture 3 15/ 42 STK-I4300: lecture 3 16/ 42

Bootstrap Methods: 0.632 bootstrap Issue: the average number of unique observations in the bootstrap sample is 0.632 Ñ not so far from 0.5 of 2-fold CV; similar bias issues of 2-fold CV; x Err p1q slightly overestimates the prediction error. To solve this, the 0.632 bootstrap estimator has been developed, xerr p0.632q 0.368 Ďerr ` 0.632 x Err p1q in practice it works well; in case of strong overfit, it can break down; consider again the previous classification problem example; with 1-nearest neighbour, Ďerr 0; x Err p0.632q 0.632 x Err p1q 0.632 ˆ 0.5 0.316 0.5. STK-I4300: lecture 3 17/ 42 Bootstrap Methods: 0.632+ bootstrap Further improvement, 0.632+ bootstrap: based on the no-information error rate γ; γ takes into account the amount of overfitting; γ is the error rate if predictors and response were independent; computed by considering all combinations of x i and y i, ˆγ 1 ÿ 1 ÿ Lpy i, ˆfpx i 1qq. STK-I4300: lecture 3 18/ 42 i 1 1 Bootstrap Methods: 0.632+ bootstrap Methods using Derived Input Directions: summary The quantity ˆγ is used to estimate the relative overfitting rate, Err ˆR x p1q Ďerr, ˆγ Ďerr which is then use in the 0.632+ bootstrap estimator, xerr p0.632`q p1 ŵq Ďerr ` ŵ x Err p1q, Principal Components Regression Partial Least Squares where ŵ 0.632 1 0.368 ˆR. STK-I4300: lecture 3 19/ 42 STK-I4300: lecture 3 20/ 42

Principal Component Regression: singular value decomposition Consider the singular value decomposition (SVD) of the ˆ p (standardized) input matrix X, where: X UDV T U is the ˆ p orthogonal matrix whose columns span the column space of X; D is a p ˆ p diagonal matrix, whose diagonal entries d 1 ě d 2 ě ě d p ě 0 are the singular values of X; V is the p ˆ p orthogonal matrix whose columns span the row space of X. STK-I4300: lecture 3 21/ 42 Principal Component Regression: principal components Simple algebra leads to X T X V D 2 V T, the eigen decomposition of X T X (and, up to a constant, of the sample covariance matrix S X T X{). Using the eigenvectors v (columns of V ), we can define the principal components of X, z Xv. the first principal component z 1 has the largest sample variance (among all linear combinations of the columns of X); Varpz 1 q VarpXv 1 q d2 1 since d 1 ě ě d p ě 0, then Varpz 1 q ě ě Varpz p q. STK-I4300: lecture 3 22/ 42 Principal Component Regression: principal components Principal Component Regression: principal components Principal component regression (PCR): use M ď p principal components as input; regress y on z 1,..., z M ; since the principal components are orthogonal, Mÿ ŷ pcr pmq ȳ ` ˆθ m z m, m 1 where ˆθ m xz m, yy{xz m, z m y. Since z m are linear combinations of x, Mÿ ˆβ pcr pmq ˆθ m v m. m 1 STK-I4300: lecture 3 23/ 42 STK-I4300: lecture 3 24/ 42

Principal Component Regression: remarks Partial Least Squares: idea ote that: PCR can be used in high-dimensions, as long as M ă n; idea: remove the directions with less information; if M, ˆβ pcr pmq ˆβ OLS ; M is a tuning parameter, may be chosen via cross-validation; shrinkage effect (clearer later); Partial least square (PLS) is based on an idea similar to PCR: construct a set of linear combinations of X; PCR only uses X, ignoring y; in PLS we want to also consider the information on y; as for PCR, it is important to first standardize X. principal component are scale dependent, it is important to standardize X! STK-I4300: lecture 3 25/ 42 STK-I4300: lecture 3 26/ 42 Partial Least Squares: algorithm Partial Least Squares: step by step 1. standardize each x, set ŷ r0s ȳ and x r0s x ; 2. For m 1, 2,..., p, (a) z m ř p, with ˆϕ m xx rm 1s, yy; 1 ˆϕ mx rm 1s (b) ˆθ m xz m, yy{xz m, z m y; (c) ŷ rms ŷ rm 1s ` ˆθz m ; (d) orthogonalize each x rm 1s with respect to z m, x rms x rm 1s xzm, x rm 1s xz m, z m y 3. output the sequence of fitted vectors tŷ rms u p 1. y z m, 1,..., p; First step: (a) compute the first PLS direction, z 1 ř p 1 ˆϕ 1x, based on the relation between each x and y, ˆϕ 1 xx, yy; (b) estimate the related regression coefficient, ˆθ 1 xz 1,yy (c) model after the first iteration: ŷ r1s ȳ ` ˆθ 1 z 1 ; (d) orthogonalize x 1,..., x p w.r.t. z 1, x r2s x We are now ready for the second step... xz 1,z 1 y Ěz 1y Ďz 1 2 xz1,x y xz 1,z 1 y z 1 ; ; STK-I4300: lecture 3 27/ 42 STK-I4300: lecture 3 28/ 42

Partial Least Squares: step by step Partial Least Squares: PLS versus PCR... using x r2s instead of x : (a) compute the second PLS direction, z 2 ř p 1 ˆϕ 2x r2s, based on the relation between each x r2s and y, ˆϕ 2 xx r2s, yy; (b) estimate the related regression coefficient, ˆθ 2 xz 2,yy xz 2,z 2 y ; (c) model after the second iteration: ŷ r2s ȳ ` ˆθ 1 z 1 ` ˆθ 2 z 2 ; (d) orthogonalize x r2s 1,..., xr2s p w.r.t. z 2, x r2s x r2s z 2 ; ˆ xz 2,x r2s y xz 2,z 2 y and so on, until the M ď p step Ñ M derived inputs. Differences: PCR the derived input directions are the principal components of X, constructed by looking at the variability of X; PLS the input directions take into consideration both the variability of X and the correlation between X and y. Mathematically: PCR max α VarpXαq, s.t. α 1 and α T Sv l 0, l 1,..., M 1; PLS max α Cor 2 py, XαqVarpXαq, s.t. α 1 and α T Sϕ l 0, @l ă M. In practice, the variance tends to dominate Ñ similar results! STK-I4300: lecture 3 29/ 42 STK-I4300: lecture 3 30/ 42 Ridge Regression: historical notes When two predictors are strongly correlated Ñ collinearity; in the extreme case of linear dependency Ñ super-collinearity; in the case of super-collinearity, X T X is not invertible (not full rank); Hoerl & Kennard (1970): X T X Ñ X T X ` λi p, where λ ą 0 and 1 0... 0 0 1... 0 I p....... 0 0... 1 With λ ą 0, px T X ` λi p q 1 exists. Ridge Regression: estimator Substituting X T X with X T X ` λi p in the LS estimator, ˆβ ridge pλq px T X ` λi p q 1 X T y. Alternatively, the ridge estimator can be seen as the minimizer of ÿ py i β 0 subect to ř p 1 β2 ď t. Which is the same as pÿ β x i q 2, 1 ˆβ ridge pλq argmin β # ÿ py i β 0 pÿ β x i q 2 ` λ 1 pÿ β 2 1 +. STK-I4300: lecture 3 31/ 42 STK-I4300: lecture 3 32/ 42

Ridge Regression: visually Ridge Regression: visually STK-I4300: lecture 3 33/ 42 STK-I4300: lecture 3 34/ 42 Ridge Regression: remarks Ridge Regression: bias ote: ridge solution is not equivariant under scaling Ñ X must be standardized before applying the minimizer; the intercept is not involved in the penalization; Bayesian interpretation: Yi pβ 0 ` x T i β, σ2 q; β p0, τ 2 q; λ σ 2 {τ 2 ; ˆβridge pλq as the posterior mean. STK-I4300: lecture 3 35/ 42 Er ˆβ ridge pλqs ErpX T X ` λi q 1 p X T ys ErpI p ` λpx T Xq 1 q 1 px loooooooomoooooooon T Xq 1 X T ys T Xq 1 q 1 loooooooooooomoooooooooooon pi p ` λpx Er ˆβ LS s w λ w λ β ùñ Er ˆβ ridge pλqs β for λ ą 0. λ Ñ 0, Er ˆβ ridge pλqs Ñ β; λ Ñ 8, Er ˆβ ridge pλqs Ñ 0 (without intercept); due to correlation, λ a ą λ b œ ˆβ ridge pλq ą ˆβ ridge pλq. STK-I4300: lecture 3 36/ 42 ˆβ LS

Ridge Regression: variance Consider the variance of the ridge estimator, Varr ˆβ ridge pλqs Varrw λ ˆβLS s w λ Varr ˆβ LS sw T λ σ 2 w λ px T Xq 1 w T λ. Then, Varr ˆβ LS s Varr ˆβ ridge pλqs σ 2 px T Xq 1 w λ px T Xq 1 wλ T σ 2 w λ pip ` λpx T Xq 1 qpx T Xq 1 pi p ` λpx T Xq 1 q T px T Xq 1 w T λ σ 2 w λ ppx T Xq 1 ` 2λpX T Xq 2 ` λ 2 px T Xq 3 q px T Xq 1 w T λ σ 2 w λ 2λpX T Xq 2 ` λ 2 px T Xq 3 q w T λ ą 0 (since all terms are quadratic and therefore positive) Ridge Regression: degrees of freedom ote that the ridge solution is a linear combination of y, as the least squares one: ŷ LS XpX loooooooomoooooooon T Xq 1 X T y ÝÑ df tracephq p; H ŷ ridge XpX T X ` λi p q 1 X T looooooooooooomooooooooooooon H λ y ÝÑ dfpλq traceph λ q; tracephλ q ř p d 2 d 2 `λ; d is the diagonal element of D in the SVD of X; λ Ñ 0, dfpλq Ñ p; λ Ñ 8, dfpλq Ñ 0. ùñ Varr ˆβ ridge pλqs ĺ Varr ˆβ LS s STK-I4300: lecture 3 37/ 42 STK-I4300: lecture 3 38/ 42 Ridge Regression: more about shrinkage Ridge Regression: more about shrinkage Recall the SVD decomposition X UDV T, and the properties U T U I p V T V. ˆβ LS px T Xq 1 X T y ŷ LS X ˆβ LS pv DU T UDV T q 1 V DU T y UDV T V D 2 DU T y pv D 2 V T q 1 V DU T y UDD 2 DU T y V D 2 V T V DU T y UU T y V D 2 DU T y ˆβ ridge px T X ` λi p q 1 X T y pv DU T UDV T ` λi p q 1 V DU T y pv D 2 V T ` λv V T q 1 V DU T y V pd 2 ` λi p q 1 V T V DU T y V pd 2 ` λi p q 1 U T y So: ŷ ridge X ˆβ ridge UDV T V pd 2 ` λi p q 1 U T y UV T V D 2 pd 2 ` λi p q 1 U T y U D 2 pd 2 ` λi p q 1 looooooooomooooooooon U T y pÿ 1 d 2 d 2 ` λ small singular values d correspond to directions of the column space of X with low variance; ridge regression penalizes the most these directions. STK-I4300: lecture 3 39/ 42 STK-I4300: lecture 3 40/ 42

Ridge Regression: more about shrinkage References I Hoerl, A. E. & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55 67. (picture from https://onlinecourses.science.psu.edu/stat857/node/155/) STK-I4300: lecture 3 41/ 42 STK-I4300: lecture 3 42/ 42