COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2
Fixed-design linear regression
Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with E(y (i) ) = x (i), w, var(y (i) ) = σ 2. 3 / 2
Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with y () y (2). y (n) E(y (i) ) = x (i), w, var(y (i) ) = σ 2. }{{} y R n = x () x (2). } x (n) {{ } X R n p w, w,2. w,p }{{} w R p + ε () ε (2). ε (n) }{{} ε R n where ε (), ε (2),..., ε (n) are independent, E(ε (i) ) = 0, and var(ε (i) ) = σ 2. 3 / 2
Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with y () y (2). y (n) E(y (i) ) = x (i), w, var(y (i) ) = σ 2. }{{} y R n = x () x (2). } x (n) {{ } X R n p w, w,2. w,p }{{} w R p + ε () ε (2). ε (n) }{{} ε R n where ε (), ε (2),..., ε (n) are independent, E(ε (i) ) = 0, and var(ε (i) ) = σ 2. Want to find ŵ R p based on (x (), y () ), (x (2), y (2) ),..., (x (n), y (n) ) so that [ ] E n Xw Xŵ 2 2 is small (e.g., o() as function of n). 3 / 2
Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. 4 / 2
Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols 4 / 2
Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) 4 / 2
Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε 4 / 2
Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε = Xw + Πε where Π R n n is the orthogonal projection operator for ran(x). 4 / 2
Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε = Xw + Πε where Π R n n is the orthogonal projection operator for ran(x). [ ] E n Xw Xŵ ols 2 2 = σ2 p n. 4 / 2
Ridge regression and principal components regression
Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. 6 / 2
Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, ŵ λ := arg min w R p n y Xw 2 2 + λ w 2 2. 6 / 2
Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw 2 2 + λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ŵ λ := (X X + nλi) X y. 6 / 2
Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw 2 2 + λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ŵ λ := (X X + nλi) X y. 6 / 2
Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw 2 2 + λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ( ) ( ) ŵ λ := n X X + λi n X y. 6 / 2
Ridge regression: geometry Ridge regression objective (as function of w) can be written as n X(w ŵ ols) 2 2 + λ w 2 2 + (stuff not depending on w) w 2 ŵ ols 0 Level sets for n X(w ŵ ols) 2 2 w Level sets for λ w 2 2 7 / 2
Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. j= 8 / 2
Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. Eigenvectors v, v 2,..., v p comprise an orthonormal basis for R p. j= We ll look at ŵ ols and w in this basis: e.g., ŵ ols = p v j, ŵ ols v j. j= 8 / 2
Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. Eigenvectors v, v 2,..., v p comprise an orthonormal basis for R p. j= We ll look at ŵ ols and w in this basis: e.g., ŵ ols = p v j, ŵ ols v j. j= The inverse of n X X + λi has the form ( ) n X X + λi = p j= λ j + λ vjv j. 8 / 2
Ridge regression vs. ordinary least squares If ŵ ols exists, then 9 / 2
Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = ( ) ( ) n X X + λi n X X ŵ ols 9 / 2
Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= j= ŵ ols 9 / 2
Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= λ j λ j + λ vjv j ) j= ŵ ols ŵ ols (by orthogonality) 9 / 2
Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= p j= λ j λ j + λ vjv j ) j= ŵ ols λ j λ j + λ vj, ŵ ols v j. ŵ ols (by orthogonality) 9 / 2
Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= p j= λ j λ j + λ vjv j ) j= ŵ ols λ j λ j + λ vj, ŵ ols v j. ŵ ols (by orthogonality) Interpretation: Shrink ŵ ols towards zero by λ j λ j factor in direction vj. +λ 9 / 2
Coefficient profile Horizontal axis: varying λ (large λ to left, small λ to right). Vertical axis: coefficient value in ŵ λ for eight different variables. 0 / 2
Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 / 2
Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 λ w 2 2 2 + σ2 tr ( X X ) n. 2nλ / 2
Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 For instance, using λ := [ ] E n Xw Xŵ λ 2 2 λ w 2 2 2 tr( n X X) guarantees n + σ2 tr ( X X ) n. 2nλ w 2 2 + σ 2 tr ( X X ) n 2 n. / 2
Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 For instance, using λ := [ ] E n Xw Xŵ λ 2 2 λ w 2 2 2 tr( n X X) guarantees n + σ2 tr ( X X ) n. 2nλ w 2 2 + σ 2 tr ( X X ) n 2 n No explicit dependence on p. Corollary can be meaningful even if p =, as long as w 2 2 and tr ( n X X ) are finite.. / 2
Principal components ( keep or kill ) regression Ridge regression as shrinkage: p λ j ŵ λ = λ j + λ vj, ŵ ols v j. j= 2 / 2
Principal components ( keep or kill ) regression Ridge regression as shrinkage: p λ j ŵ λ = λ j + λ vj, ŵ ols v j. j= Another approach: principal components regression. Instead of shrinking ŵ ols in all directions, either keep ŵ ols in direction v j (if λ j λ), or kill ŵ ols in direction v j (if λ j < λ). p ŵ pc λ := {λ j λ} v j, ŵ ols v j. j= shrinkage factor 0.8 0.6 0.4 0.2 ridge pc 0 0 0.2 0.4 0.6 0.8 λ j 2 / 2
Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= } {{ } bias p {λ j λ} n j= }{{} variance + σ2 3 / 2
Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 3 / 2
Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 Should pick λ large enough so that n p {λ j λ} j= (effective dimension at cut-off point λ). 3 / 2
Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 Should pick λ large enough so that n p {λ j λ} j= (effective dimension at cut-off point λ). Never (much) worse than ridge regression; often substantially better. 3 / 2
Sparse regression and Lasso
Sparsity Another way to deal with p > n is to only consider sparse w i.e., w with only a small number ( p) of non-zero entries. 5 / 2
Sparsity Another way to deal with p > n is to only consider sparse w i.e., w with only a small number ( p) of non-zero entries. Other advantages of sparsity (especially relative to ridge/p.c.-regression): Sparse solutions more interpretable. Can be more efficient to evaluate x, w (both in terms of computing variable values and computing inner product). 5 / 2
Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 6 / 2
Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 Gives you exactly what you want (for given value k). 6 / 2
Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 Gives you exactly what you want (for given value k). Only feasible for very small k, since complexity scales with ( p k). (NP-hard optimization problem.) 6 / 2
Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 7 / 2
Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 Gives you a k-sparse solution, no shrinkage within T. 7 / 2
Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 Gives you a k-sparse solution, no shrinkage within T. Primarily only effective when columns of X are close to orthogonal. 7 / 2
Lasso (Tibshirani, 994) Lasso: least absolute shrinkage and selection operator ŵ lasso λ := arg min w R p n y Xw 2 2 + λ w. (Convex, though not differentiable.) 8 / 2
Lasso (Tibshirani, 994) Lasso: least absolute shrinkage and selection operator ŵ lasso λ := arg min w R p n y Xw 2 2 + λ w. (Convex, though not differentiable.) Objective (as function of w) can be written as n X(w ŵ ols) 2 2 + λ w + (stuff not depending on w) w 2 ŵ ols Level sets for n w X(w ŵ ols) 2 2 0 Level sets for λ w 8 / 2
Coefficient profile Horizontal axis: varying λ (large λ to left, small λ to right). Vertical axis: coefficient value in ŵ λ for eight different variables. 9 / 2
Lasso: theory Many results, mostly roughly of the following flavor. Suppose w has k non-zero entries; X satisfies some special properties (typically not efficiently checkable); λ σ then 2 log(p) n ; [ ] E n Xw Xŵ lasso λ 2 2 ( σ 2 k log(p) O n ). 20 / 2
Lasso: theory Many results, mostly roughly of the following flavor. Suppose w has k non-zero entries; X satisfies some special properties (typically not efficiently checkable); λ σ then 2 log(p) n ; [ ] E n Xw Xŵ lasso λ 2 2 ( σ 2 k log(p) O n ). Very active subject of research; closely related to compressed sensing ; intersects with beautiful subject of high-dimensional convex geometry. 20 / 2
Recap Fixed-design setting for studying linear regression methods. Ridge and principal components regression make use of eigenvectors of n X X ( principal component directions ), and also corresponding eigenvalues. Ridge regression: shrink ŵ ols along principal component directions by amount related to eigenvalue and λ. Principal components regression: keep-or-kill ŵ ols along principal component directions, based on comparing eigenvalue to λ. Sparse regression: intractable, but some greedy strategies work. Lasso: shrink coefficients towards zero in a way that tends to lead to sparse solutions. 2 / 2