COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Similar documents
Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Linear Methods for Regression. Lijun Zhang

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Linear regression methods

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Lecture 2 Part 1 Optimization

Eigenvalues and diagonalization

Prediction & Feature Selection in GLM

STAT 462-Computational Data Analysis

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Linear Model Selection and Regularization

Regularization: Ridge Regression and the LASSO

Regression Shrinkage and Selection via the Lasso

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

DATA MINING AND MACHINE LEARNING

Is the test error unbiased for these programs? 2017 Kevin Jamieson

ISyE 691 Data mining and analytics

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

High-dimensional regression

6. Regularized linear regression

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Learning with Singular Vectors


Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

ORIE 4741: Learning with Big Messy Data. Regularization

1 Regression with High Dimensional Data

MSA220/MVE440 Statistical Learning for Big Data

Sparse representation classification and positive L1 minimization

Lecture 6: Methods for high-dimensional problems

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

A A x i x j i j (i, j) (j, i) Let. Compute the value of for and

MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.

Week 3: Linear Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CS540 Machine learning Lecture 5

Biostatistics Advanced Methods in Biostatistics IV

Statistical Learning with the Lasso, spring The Lasso

ESL Chap3. Some extensions of lasso

Lasso Regression: Regularization for feature selection

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Eigenvalues and Eigenvectors A =

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lasso Regression: Regularization for feature selection

Regression Retrieval Overview. Larry McMillin

Lecture 5: A step back

Lasso, Ridge, and Elastic Net

Parameter Norm Penalties. Sargur N. Srihari

Lecture 8 : Eigenvalues and Eigenvectors

Linear regression COMS 4771

4 Bias-Variance for Ridge Regression (24 points)

An Introduction to Sparse Approximation

A Short Introduction to the Lasso Methodology

CPSC 540: Machine Learning

Sparse analysis Lecture II: Hardness results for sparse approximation problems

Oslo Class 6 Sparsity based regularization

Introduction to Machine Learning

Machine learning - HT Basis Expansion, Regularization, Validation

Dimension Reduction Methods

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Multivariate Statistical Analysis

Shrinkage Methods: Ridge and Lasso

Optimization methods

MSA200/TMS041 Multivariate Analysis

Machine Learning for Biomedical Engineering. Enrico Grisan

Regularization and Variable Selection via the Elastic Net

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

COMS 4771 Regression. Nakul Verma

COS 424: Interacting with Data

Ridge Regression Revisited

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

A Modern Look at Classical Multivariate Techniques

Linear Regression. Volker Tresp 2018

Generalization theory

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Machine Learning for OR & FE

Lecture 14: Shrinkage

Sparse Linear Models (10/7/13)

Data Mining Stat 588

Lecture 17 Intro to Lasso Regression

Lecture VIII Dim. Reduction (I)

l 1 and l 2 Regularization

Methods for sparse analysis of high-dimensional data, II

Lecture 5 : Projections

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

MLCC 2015 Dimensionality Reduction and PCA

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Linear Regression [DRAFT - In Progress]

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

CS6220: DATA MINING TECHNIQUES

Transductive Experiment Design

High-dimensional regression modeling

Transcription:

COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2

Fixed-design linear regression

Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with E(y (i) ) = x (i), w, var(y (i) ) = σ 2. 3 / 2

Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with y () y (2). y (n) E(y (i) ) = x (i), w, var(y (i) ) = σ 2. }{{} y R n = x () x (2). } x (n) {{ } X R n p w, w,2. w,p }{{} w R p + ε () ε (2). ε (n) }{{} ε R n where ε (), ε (2),..., ε (n) are independent, E(ε (i) ) = 0, and var(ε (i) ) = σ 2. 3 / 2

Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with y () y (2). y (n) E(y (i) ) = x (i), w, var(y (i) ) = σ 2. }{{} y R n = x () x (2). } x (n) {{ } X R n p w, w,2. w,p }{{} w R p + ε () ε (2). ε (n) }{{} ε R n where ε (), ε (2),..., ε (n) are independent, E(ε (i) ) = 0, and var(ε (i) ) = σ 2. Want to find ŵ R p based on (x (), y () ), (x (2), y (2) ),..., (x (n), y (n) ) so that [ ] E n Xw Xŵ 2 2 is small (e.g., o() as function of n). 3 / 2

Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. 4 / 2

Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols 4 / 2

Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) 4 / 2

Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε 4 / 2

Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε = Xw + Πε where Π R n n is the orthogonal projection operator for ran(x). 4 / 2

Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε = Xw + Πε where Π R n n is the orthogonal projection operator for ran(x). [ ] E n Xw Xŵ ols 2 2 = σ2 p n. 4 / 2

Ridge regression and principal components regression

Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. 6 / 2

Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, ŵ λ := arg min w R p n y Xw 2 2 + λ w 2 2. 6 / 2

Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw 2 2 + λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ŵ λ := (X X + nλi) X y. 6 / 2

Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw 2 2 + λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ŵ λ := (X X + nλi) X y. 6 / 2

Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw 2 2 + λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ( ) ( ) ŵ λ := n X X + λi n X y. 6 / 2

Ridge regression: geometry Ridge regression objective (as function of w) can be written as n X(w ŵ ols) 2 2 + λ w 2 2 + (stuff not depending on w) w 2 ŵ ols 0 Level sets for n X(w ŵ ols) 2 2 w Level sets for λ w 2 2 7 / 2

Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. j= 8 / 2

Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. Eigenvectors v, v 2,..., v p comprise an orthonormal basis for R p. j= We ll look at ŵ ols and w in this basis: e.g., ŵ ols = p v j, ŵ ols v j. j= 8 / 2

Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. Eigenvectors v, v 2,..., v p comprise an orthonormal basis for R p. j= We ll look at ŵ ols and w in this basis: e.g., ŵ ols = p v j, ŵ ols v j. j= The inverse of n X X + λi has the form ( ) n X X + λi = p j= λ j + λ vjv j. 8 / 2

Ridge regression vs. ordinary least squares If ŵ ols exists, then 9 / 2

Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = ( ) ( ) n X X + λi n X X ŵ ols 9 / 2

Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= j= ŵ ols 9 / 2

Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= λ j λ j + λ vjv j ) j= ŵ ols ŵ ols (by orthogonality) 9 / 2

Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= p j= λ j λ j + λ vjv j ) j= ŵ ols λ j λ j + λ vj, ŵ ols v j. ŵ ols (by orthogonality) 9 / 2

Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= p j= λ j λ j + λ vjv j ) j= ŵ ols λ j λ j + λ vj, ŵ ols v j. ŵ ols (by orthogonality) Interpretation: Shrink ŵ ols towards zero by λ j λ j factor in direction vj. +λ 9 / 2

Coefficient profile Horizontal axis: varying λ (large λ to left, small λ to right). Vertical axis: coefficient value in ŵ λ for eight different variables. 0 / 2

Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 / 2

Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 λ w 2 2 2 + σ2 tr ( X X ) n. 2nλ / 2

Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 For instance, using λ := [ ] E n Xw Xŵ λ 2 2 λ w 2 2 2 tr( n X X) guarantees n + σ2 tr ( X X ) n. 2nλ w 2 2 + σ 2 tr ( X X ) n 2 n. / 2

Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 For instance, using λ := [ ] E n Xw Xŵ λ 2 2 λ w 2 2 2 tr( n X X) guarantees n + σ2 tr ( X X ) n. 2nλ w 2 2 + σ 2 tr ( X X ) n 2 n No explicit dependence on p. Corollary can be meaningful even if p =, as long as w 2 2 and tr ( n X X ) are finite.. / 2

Principal components ( keep or kill ) regression Ridge regression as shrinkage: p λ j ŵ λ = λ j + λ vj, ŵ ols v j. j= 2 / 2

Principal components ( keep or kill ) regression Ridge regression as shrinkage: p λ j ŵ λ = λ j + λ vj, ŵ ols v j. j= Another approach: principal components regression. Instead of shrinking ŵ ols in all directions, either keep ŵ ols in direction v j (if λ j λ), or kill ŵ ols in direction v j (if λ j < λ). p ŵ pc λ := {λ j λ} v j, ŵ ols v j. j= shrinkage factor 0.8 0.6 0.4 0.2 ridge pc 0 0 0.2 0.4 0.6 0.8 λ j 2 / 2

Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= } {{ } bias p {λ j λ} n j= }{{} variance + σ2 3 / 2

Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 3 / 2

Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 Should pick λ large enough so that n p {λ j λ} j= (effective dimension at cut-off point λ). 3 / 2

Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 Should pick λ large enough so that n p {λ j λ} j= (effective dimension at cut-off point λ). Never (much) worse than ridge regression; often substantially better. 3 / 2

Sparse regression and Lasso

Sparsity Another way to deal with p > n is to only consider sparse w i.e., w with only a small number ( p) of non-zero entries. 5 / 2

Sparsity Another way to deal with p > n is to only consider sparse w i.e., w with only a small number ( p) of non-zero entries. Other advantages of sparsity (especially relative to ridge/p.c.-regression): Sparse solutions more interpretable. Can be more efficient to evaluate x, w (both in terms of computing variable values and computing inner product). 5 / 2

Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 6 / 2

Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 Gives you exactly what you want (for given value k). 6 / 2

Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 Gives you exactly what you want (for given value k). Only feasible for very small k, since complexity scales with ( p k). (NP-hard optimization problem.) 6 / 2

Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 7 / 2

Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 Gives you a k-sparse solution, no shrinkage within T. 7 / 2

Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 Gives you a k-sparse solution, no shrinkage within T. Primarily only effective when columns of X are close to orthogonal. 7 / 2

Lasso (Tibshirani, 994) Lasso: least absolute shrinkage and selection operator ŵ lasso λ := arg min w R p n y Xw 2 2 + λ w. (Convex, though not differentiable.) 8 / 2

Lasso (Tibshirani, 994) Lasso: least absolute shrinkage and selection operator ŵ lasso λ := arg min w R p n y Xw 2 2 + λ w. (Convex, though not differentiable.) Objective (as function of w) can be written as n X(w ŵ ols) 2 2 + λ w + (stuff not depending on w) w 2 ŵ ols Level sets for n w X(w ŵ ols) 2 2 0 Level sets for λ w 8 / 2

Coefficient profile Horizontal axis: varying λ (large λ to left, small λ to right). Vertical axis: coefficient value in ŵ λ for eight different variables. 9 / 2

Lasso: theory Many results, mostly roughly of the following flavor. Suppose w has k non-zero entries; X satisfies some special properties (typically not efficiently checkable); λ σ then 2 log(p) n ; [ ] E n Xw Xŵ lasso λ 2 2 ( σ 2 k log(p) O n ). 20 / 2

Lasso: theory Many results, mostly roughly of the following flavor. Suppose w has k non-zero entries; X satisfies some special properties (typically not efficiently checkable); λ σ then 2 log(p) n ; [ ] E n Xw Xŵ lasso λ 2 2 ( σ 2 k log(p) O n ). Very active subject of research; closely related to compressed sensing ; intersects with beautiful subject of high-dimensional convex geometry. 20 / 2

Recap Fixed-design setting for studying linear regression methods. Ridge and principal components regression make use of eigenvectors of n X X ( principal component directions ), and also corresponding eigenvalues. Ridge regression: shrink ŵ ols along principal component directions by amount related to eigenvalue and λ. Principal components regression: keep-or-kill ŵ ols along principal component directions, based on comparing eigenvalue to λ. Sparse regression: intractable, but some greedy strategies work. Lasso: shrink coefficients towards zero in a way that tends to lead to sparse solutions. 2 / 2