COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Size: px

Start display at page:

Download "COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso"

Elisabeth Hutchinson
5 years ago
Views:

1 COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2

2 Fixed-design linear regression

3 Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with E(y (i) ) = x (i), w, var(y (i) ) = σ 2. 3 / 2

4 Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with y () y (2). y (n) E(y (i) ) = x (i), w, var(y (i) ) = σ 2. }{{} y R n = x () x (2). } x (n) {{ } X R n p w, w,2. w,p }{{} w R p + ε () ε (2). ε (n) }{{} ε R n where ε (), ε (2),..., ε (n) are independent, E(ε (i) ) = 0, and var(ε (i) ) = σ 2. 3 / 2

5 Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with y () y (2). y (n) E(y (i) ) = x (i), w, var(y (i) ) = σ 2. }{{} y R n = x () x (2). } x (n) {{ } X R n p w, w,2. w,p }{{} w R p + ε () ε (2). ε (n) }{{} ε R n where ε (), ε (2),..., ε (n) are independent, E(ε (i) ) = 0, and var(ε (i) ) = σ 2. Want to find ŵ R p based on (x (), y () ), (x (2), y (2) ),..., (x (n), y (n) ) so that [ ] E n Xw Xŵ 2 2 is small (e.g., o() as function of n). 3 / 2

6 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. 4 / 2

7 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols 4 / 2

8 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) 4 / 2

9 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε 4 / 2

10 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε = Xw + Πε where Π R n n is the orthogonal projection operator for ran(x). 4 / 2

11 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε = Xw + Πε where Π R n n is the orthogonal projection operator for ran(x). [ ] E n Xw Xŵ ols 2 2 = σ2 p n. 4 / 2

12 Ridge regression and principal components regression

13 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. 6 / 2

14 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, ŵ λ := arg min w R p n y Xw λ w / 2

15 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ŵ λ := (X X + nλi) X y. 6 / 2

16 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ŵ λ := (X X + nλi) X y. 6 / 2

17 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ( ) ( ) ŵ λ := n X X + λi n X y. 6 / 2

18 Ridge regression: geometry Ridge regression objective (as function of w) can be written as n X(w ŵ ols) λ w (stuff not depending on w) w 2 ŵ ols 0 Level sets for n X(w ŵ ols) 2 2 w Level sets for λ w / 2

19 Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. j= 8 / 2

20 Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. Eigenvectors v, v 2,..., v p comprise an orthonormal basis for R p. j= We ll look at ŵ ols and w in this basis: e.g., ŵ ols = p v j, ŵ ols v j. j= 8 / 2

21 Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. Eigenvectors v, v 2,..., v p comprise an orthonormal basis for R p. j= We ll look at ŵ ols and w in this basis: e.g., ŵ ols = p v j, ŵ ols v j. j= The inverse of n X X + λi has the form ( ) n X X + λi = p j= λ j + λ vjv j. 8 / 2

22 Ridge regression vs. ordinary least squares If ŵ ols exists, then 9 / 2

23 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = ( ) ( ) n X X + λi n X X ŵ ols 9 / 2

24 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= j= ŵ ols 9 / 2

25 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= λ j λ j + λ vjv j ) j= ŵ ols ŵ ols (by orthogonality) 9 / 2

26 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= p j= λ j λ j + λ vjv j ) j= ŵ ols λ j λ j + λ vj, ŵ ols v j. ŵ ols (by orthogonality) 9 / 2

27 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= p j= λ j λ j + λ vjv j ) j= ŵ ols λ j λ j + λ vj, ŵ ols v j. ŵ ols (by orthogonality) Interpretation: Shrink ŵ ols towards zero by λ j λ j factor in direction vj. +λ 9 / 2

28 Coefficient profile Horizontal axis: varying λ (large λ to left, small λ to right). Vertical axis: coefficient value in ŵ λ for eight different variables. 0 / 2

29 Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 / 2

30 Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 λ w σ2 tr ( X X ) n. 2nλ / 2

31 Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 For instance, using λ := [ ] E n Xw Xŵ λ 2 2 λ w tr( n X X) guarantees n + σ2 tr ( X X ) n. 2nλ w σ 2 tr ( X X ) n 2 n. / 2

32 Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 For instance, using λ := [ ] E n Xw Xŵ λ 2 2 λ w tr( n X X) guarantees n + σ2 tr ( X X ) n. 2nλ w σ 2 tr ( X X ) n 2 n No explicit dependence on p. Corollary can be meaningful even if p =, as long as w 2 2 and tr ( n X X ) are finite.. / 2

33 Principal components ( keep or kill ) regression Ridge regression as shrinkage: p λ j ŵ λ = λ j + λ vj, ŵ ols v j. j= 2 / 2

34 Principal components ( keep or kill ) regression Ridge regression as shrinkage: p λ j ŵ λ = λ j + λ vj, ŵ ols v j. j= Another approach: principal components regression. Instead of shrinking ŵ ols in all directions, either keep ŵ ols in direction v j (if λ j λ), or kill ŵ ols in direction v j (if λ j < λ). p ŵ pc λ := {λ j λ} v j, ŵ ols v j. j= shrinkage factor ridge pc λ j 2 / 2

35 Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= } {{ } bias p {λ j λ} n j= }{{} variance + σ2 3 / 2

36 Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 3 / 2

37 Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 Should pick λ large enough so that n p {λ j λ} j= (effective dimension at cut-off point λ). 3 / 2

38 Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 Should pick λ large enough so that n p {λ j λ} j= (effective dimension at cut-off point λ). Never (much) worse than ridge regression; often substantially better. 3 / 2

39 Sparse regression and Lasso

40 Sparsity Another way to deal with p > n is to only consider sparse w i.e., w with only a small number ( p) of non-zero entries. 5 / 2

41 Sparsity Another way to deal with p > n is to only consider sparse w i.e., w with only a small number ( p) of non-zero entries. Other advantages of sparsity (especially relative to ridge/p.c.-regression): Sparse solutions more interpretable. Can be more efficient to evaluate x, w (both in terms of computing variable values and computing inner product). 5 / 2

42 Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) / 2

43 Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 Gives you exactly what you want (for given value k). 6 / 2

44 Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 Gives you exactly what you want (for given value k). Only feasible for very small k, since complexity scales with ( p k). (NP-hard optimization problem.) 6 / 2

45 Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) / 2

46 Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 Gives you a k-sparse solution, no shrinkage within T. 7 / 2

47 Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 Gives you a k-sparse solution, no shrinkage within T. Primarily only effective when columns of X are close to orthogonal. 7 / 2

48 Lasso (Tibshirani, 994) Lasso: least absolute shrinkage and selection operator ŵ lasso λ := arg min w R p n y Xw λ w. (Convex, though not differentiable.) 8 / 2

49 Lasso (Tibshirani, 994) Lasso: least absolute shrinkage and selection operator ŵ lasso λ := arg min w R p n y Xw λ w. (Convex, though not differentiable.) Objective (as function of w) can be written as n X(w ŵ ols) λ w + (stuff not depending on w) w 2 ŵ ols Level sets for n w X(w ŵ ols) Level sets for λ w 8 / 2

50 Coefficient profile Horizontal axis: varying λ (large λ to left, small λ to right). Vertical axis: coefficient value in ŵ λ for eight different variables. 9 / 2

51 Lasso: theory Many results, mostly roughly of the following flavor. Suppose w has k non-zero entries; X satisfies some special properties (typically not efficiently checkable); λ σ then 2 log(p) n ; [ ] E n Xw Xŵ lasso λ 2 2 ( σ 2 k log(p) O n ). 20 / 2

52 Lasso: theory Many results, mostly roughly of the following flavor. Suppose w has k non-zero entries; X satisfies some special properties (typically not efficiently checkable); λ σ then 2 log(p) n ; [ ] E n Xw Xŵ lasso λ 2 2 ( σ 2 k log(p) O n ). Very active subject of research; closely related to compressed sensing ; intersects with beautiful subject of high-dimensional convex geometry. 20 / 2

53 Recap Fixed-design setting for studying linear regression methods. Ridge and principal components regression make use of eigenvectors of n X X ( principal component directions ), and also corresponding eigenvalues. Ridge regression: shrink ŵ ols along principal component directions by amount related to eigenvalue and λ. Principal components regression: keep-or-kill ŵ ols along principal component directions, based on comparing eigenvalue to λ. Sparse regression: intractable, but some greedy strategies work. Lasso: shrink coefficients towards zero in a way that tends to lead to sparse solutions. 2 / 2

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,