MATH 829: Introduction to Data Mining and Analysis Least angle regression

Size: px

Start display at page:

Download "MATH 829: Introduction to Data Mining and Analysis Least angle regression"

Baldwin Lester
6 years ago
Views:

1 1/14 MATH 829: Introduction to Data Mining and Analysis Least angle regression Dominique Guillot Departments of Mathematical Sciences University of Delaware February 29, 2016

2 Least angle regression (LARS) 2/14 Recall the forward stagewise approach to linear regression:

3 Least angle regression (LARS) 2/14 Recall the forward stagewise approach to linear regression: 1 Start with intercept y, and centered predictors with coecients initially all 0. 2 At each step the algorithm: identify the variable most correlated with the current residual. 3 Compute the simple linear regression coecient of the residual on this chosen variable, and add it to the current coecient for that variable. 4 Continued till none of the variables have correlation with the residuals.

4 Least angle regression (LARS) 2/14 Recall the forward stagewise approach to linear regression: 1 Start with intercept y, and centered predictors with coecients initially all 0. 2 At each step the algorithm: identify the variable most correlated with the current residual. 3 Compute the simple linear regression coecient of the residual on this chosen variable, and add it to the current coecient for that variable. 4 Continued till none of the variables have correlation with the residuals. Greedy approach.

5 Least angle regression (LARS) 2/14 Recall the forward stagewise approach to linear regression: 1 Start with intercept y, and centered predictors with coecients initially all 0. 2 At each step the algorithm: identify the variable most correlated with the current residual. 3 Compute the simple linear regression coecient of the residual on this chosen variable, and add it to the current coecient for that variable. 4 Continued till none of the variables have correlation with the residuals. Greedy approach. However, the solution often looks similar to the lasso solution.

6 Least angle regression (LARS) 2/14 Recall the forward stagewise approach to linear regression: 1 Start with intercept y, and centered predictors with coecients initially all 0. 2 At each step the algorithm: identify the variable most correlated with the current residual. 3 Compute the simple linear regression coecient of the residual on this chosen variable, and add it to the current coecient for that variable. 4 Continued till none of the variables have correlation with the residuals. Greedy approach. However, the solution often looks similar to the lasso solution. Connection between the two methods?

7 Forward stagewise vs lasso 3/14 Example: Prostate cancer data (see ESL). Efron et al., 2003

8 LARS 4/14 Least angle regression (LARS) is similar to forward stagewise, but only enters as much of a predictor as it deserves.

9 LARS 4/14 Least angle regression (LARS) is similar to forward stagewise, but only enters as much of a predictor as it deserves. ESL, Algorithm 3.2.

10 LARS (cont.) 5/14 Let A k be the current active set.

11 LARS (cont.) 5/14 Let A k be the current active set. β Ak be the coecients vectors at step k.

12 LARS (cont.) 5/14 Let A k be the current active set. β Ak be the coecients vectors at step k. Let r k = y X Ak β Ak denote the residual at step k.

13 LARS (cont.) 5/14 Let A k be the current active set. β Ak be the coecients vectors at step k. Let r k = y X Ak β Ak denote the residual at step k. Then, at step k, we move the coecients in the direction δ k = (X T A k X Ak ) 1 X T A k r k, i.e., β Ak (α) = β Ak + α δ k.

14 Analysis of LARS 6/14 How does the correlation between the predictors and the residuals evolve? δ k = (X T A k X Ak ) 1 X T A k r k β Ak (α) = β Ak + α δ k.

15 Analysis of LARS 6/14 How does the correlation between the predictors and the residuals evolve? δ k = (X T A k X Ak ) 1 X T A k r k β Ak (α) = β Ak + α δ k. It remains the same for all predictors, and decreases monotonically.

16 Analysis of LARS 6/14 How does the correlation between the predictors and the residuals evolve? δ k = (X T A k X Ak ) 1 X T A k r k β Ak (α) = β Ak + α δ k. It remains the same for all predictors, and decreases monotonically. Indeed, suppose each predictor in a linear regression problem has equal correlation (in absolute value) with the response. 1 n x j, y = λ j = 1,..., p. (Recall, we assume the predictors have been standardized.)

17 Analysis of LARS 6/14 How does the correlation between the predictors and the residuals evolve? δ k = (X T A k X Ak ) 1 X T A k r k β Ak (α) = β Ak + α δ k. It remains the same for all predictors, and decreases monotonically. Indeed, suppose each predictor in a linear regression problem has equal correlation (in absolute value) with the response. 1 n x j, y = λ j = 1,..., p. (Recall, we assume the predictors have been standardized.) Let ˆβ be the least-squares coecients of y on X and let u(α) = αx ˆβ for α [0, 1].

18 Analysis of LARS (cont.) 7/14 We have 1 n x j, y = λ and u(α) = αx ˆβ. Now,

19 Analysis of LARS (cont.) 7/14 We have 1 n x j, y = λ and u(α) = αx ˆβ. Now, ( ) 1 p n x j, y u(α) = 1 j=1 n XT (y u(α)) = 1 n XT (y αx(x T X) 1 X T y) = 1 n (1 α)xt y = (1 α)λ 1 p 1.

20 Analysis of LARS (cont.) 7/14 We have 1 n x j, y = λ and u(α) = αx ˆβ. Now, ( ) 1 p n x j, y u(α) = 1 j=1 n XT (y u(α)) = 1 n XT (y αx(x T X) 1 X T y) = 1 n (1 α)xt y = (1 α)λ 1 p 1. Therefore, the correlation between x j and the residuals y u(α) decreases linearly to 0.

21 Analysis of LARS (cont.) 7/14 We have 1 n x j, y = λ and u(α) = αx ˆβ. Now, ( ) 1 p n x j, y u(α) = 1 j=1 n XT (y u(α)) = 1 n XT (y αx(x T X) 1 X T y) = 1 n (1 α)xt y = (1 α)λ 1 p 1. Therefore, the correlation between x j and the residuals y u(α) decreases linearly to 0. In LARS, the parameter α is increased until a new variable becomes equally correlated with the residuals y u(α).

22 Analysis of LARS (cont.) 7/14 We have 1 n x j, y = λ and u(α) = αx ˆβ. Now, ( ) 1 p n x j, y u(α) = 1 j=1 n XT (y u(α)) = 1 n XT (y αx(x T X) 1 X T y) = 1 n (1 α)xt y = (1 α)λ 1 p 1. Therefore, the correlation between x j and the residuals y u(α) decreases linearly to 0. In LARS, the parameter α is increased until a new variable becomes equally correlated with the residuals y u(α). The new variable is then added to the model, and a new direction is computed.

23 Analysis of LARS (cont.) 8/14 Example: Ĉ k = current maximal correlation. Efron et al., 2003.

24 Equiangular vector 9/14 Why least angle regression? Recal: β Ak (α) = β Ak + α δ k.

25 Equiangular vector 9/14 Why least angle regression? Recal: β Ak (α) = β Ak + α δ k. Thus, ŷ(α) = X Ak β Ak (α) = X Ak β Ak + α X Ak δ k.

26 Equiangular vector 9/14 Why least angle regression? Recal: β Ak (α) = β Ak + α δ k. Thus, ŷ(α) = X Ak β Ak (α) = X Ak β Ak + α X Ak δ k. It is not hard to check that u k := X Ak δ k makes equal angles with the predictors in A k.

27 Equiangular vector 9/14 Why least angle regression? Indeed, Recal: β Ak (α) = β Ak + α δ k. Thus, ŷ(α) = X Ak β Ak (α) = X Ak β Ak + α X Ak δ k. It is not hard to check that u k := X Ak δ k makes equal angles with the predictors in A k. X T A k u k = X T A k X Ak δ k = X T A k X Ak (X T A k X Ak ) 1 X T A k r k = X T A k r k.

28 Equiangular vector 9/14 Why least angle regression? Indeed, Recal: β Ak (α) = β Ak + α δ k. Thus, ŷ(α) = X Ak β Ak (α) = X Ak β Ak + α X Ak δ k. It is not hard to check that u k := X Ak δ k makes equal angles with the predictors in A k. X T A k u k = X T A k X Ak δ k = X T A k X Ak (X T A k X Ak ) 1 X T A k r k = X T A k r k. The entries of the vector X T A k r k are all the same since the predictors in A k all have the same correlation with the residuals r k (by construction).

29 Equiangular vector 9/14 Why least angle regression? Indeed, Recal: β Ak (α) = β Ak + α δ k. Thus, ŷ(α) = X Ak β Ak (α) = X Ak β Ak + α X Ak δ k. It is not hard to check that u k := X Ak δ k makes equal angles with the predictors in A k. X T A k u k = X T A k X Ak δ k = X T A k X Ak (X T A k X Ak ) 1 X T A k r k = X T A k r k. The entries of the vector X T A k r k are all the same since the predictors in A k all have the same correlation with the residuals r k (by construction). Conclusion: u k makes equal angles with the predictors in A k.

30 Equiangular vector Why least angle regression? Indeed, Recal: β Ak (α) = β Ak + α δ k. Thus, ŷ(α) = X Ak β Ak (α) = X Ak β Ak + α X Ak δ k. It is not hard to check that u k := X Ak δ k makes equal angles with the predictors in A k. X T A k u k = X T A k X Ak δ k = X T A k X Ak (X T A k X Ak ) 1 X T A k r k = X T A k r k. The entries of the vector X T A k r k are all the same since the predictors in A k all have the same correlation with the residuals r k (by construction). Conclusion: u k makes equal angles with the predictors in A k. Problem: In general, given v 1,..., v k R n, how do we nd a vector that makes equal angles with v 1,..., v k. When is this possible? 9/14

31 LARS and Lasso 10/14 LARS is closely related to stepwise regression.

32 LARS and Lasso 10/14 LARS is closely related to stepwise regression. There is also a connection to the Lasso.

33 LARS and Lasso LARS is closely related to stepwise regression. There is also a connection to the Lasso. ESL, Figure On the above gure, the lasso coecient proles are almost identical to those of LARS in the left panel, and dier for the rst time when the blue coecient passes back through zero. 10/14

34 LARS and Lasso (cont.) 11/14 The previous observation suggests the following LARS modication.

35 LARS and Lasso (cont.) 11/14 The previous observation suggests the following LARS modication. ESL, Algorithm 3.2a.

36 LARS and Lasso (cont.) 11/14 The previous observation suggests the following LARS modication. ESL, Algorithm 3.2a. Theorem: The modied LARS (lasso) algorithm (Algorithm 3.2a) yields the solution of the Lasso problem if variables appear/disappear one at a time.

37 LARS and Lasso (cont.) 11/14 The previous observation suggests the following LARS modication. ESL, Algorithm 3.2a. Theorem: The modied LARS (lasso) algorithm (Algorithm 3.2a) yields the solution of the Lasso problem if variables appear/disappear one at a time. See Efron et al., Least angle regression, The Annals of Statistics, 2004.

LARS and Lasso (cont.) 11/14 The previous observation suggests the following LARS modication. ESL, Algorithm 3.2a. Theorem: The modied LARS (lasso) algorithm (Algorithm 3.

38 LARS and Lasso (cont.) 11/14 The previous observation suggests the following LARS modication. ESL, Algorithm 3.2a. Theorem: The modied LARS (lasso) algorithm (Algorithm 3.2a) yields the solution of the Lasso problem if variables appear/disappear one at a time. See Efron et al., Least angle regression, The Annals of Statistics, Note: the theorem explains the piecewise linear nature of the lasso.

39 Consistency of the Lasso 12/14 Recall: We proved before that the least-squares estimator is consistent, i.e., ˆβ n β as the sample size n goes to innity (under some assumptions).

40 Consistency of the Lasso 12/14 Recall: We proved before that the least-squares estimator is consistent, i.e., ˆβ n β as the sample size n goes to innity (under some assumptions). We now study analogous results for the lasso. Assumptions:

41 Consistency of the Lasso 12/14 Recall: We proved before that the least-squares estimator is consistent, i.e., ˆβ n β as the sample size n goes to innity (under some assumptions). We now study analogous results for the lasso. Assumptions: X 1,..., X p are (possibly dependent) random variables.

42 Consistency of the Lasso 12/14 Recall: We proved before that the least-squares estimator is consistent, i.e., ˆβ n β as the sample size n goes to innity (under some assumptions). We now study analogous results for the lasso. Assumptions: X 1,..., X p are (possibly dependent) random variables. X j M almost surely for some M > 0, (j = 1,..., p).

43 Consistency of the Lasso 12/14 Recall: We proved before that the least-squares estimator is consistent, i.e., ˆβ n β as the sample size n goes to innity (under some assumptions). We now study analogous results for the lasso. Assumptions: X 1,..., X p are (possibly dependent) random variables. X j M almost surely for some M > 0, (j = 1,..., p). Y = p j=1 β j X j + ɛ for some (unknown) constants β j.

44 Consistency of the Lasso 12/14 Recall: We proved before that the least-squares estimator is consistent, i.e., ˆβ n β as the sample size n goes to innity (under some assumptions). We now study analogous results for the lasso. Assumptions: X 1,..., X p are (possibly dependent) random variables. X j M almost surely for some M > 0, (j = 1,..., p). Y = p j=1 β j X j + ɛ for some (unknown) constants β j. ɛ N(0, σ 2 ) is independent of the X j (σ 2 unknown).

45 Consistency of the Lasso 12/14 Recall: We proved before that the least-squares estimator is consistent, i.e., ˆβ n β as the sample size n goes to innity (under some assumptions). We now study analogous results for the lasso. Assumptions: X 1,..., X p are (possibly dependent) random variables. X j M almost surely for some M > 0, (j = 1,..., p). Y = p j=1 β j X j + ɛ for some (unknown) constants β j. ɛ N(0, σ 2 ) is independent of the X j (σ 2 unknown). Sparsity assumption (specied later).

46 Consistency of the Lasso (cont.) 13/14 We are given n iid observations of (Y, X 1,..., X p ). Z i = (Y i, X i,1,..., X i,p )

47 Consistency of the Lasso (cont.) 13/14 We are given n iid observations of (Y, X 1,..., X p ). Z i = (Y i, X i,1,..., X i,p ) Our goal is to recover β 1,..., β p as accurately as possible.

48 Consistency of the Lasso (cont.) 13/14 We are given n iid observations Z i = (Y i, X i,1,..., X i,p ) of (Y, X 1,..., X p ). Our goal is to recover β1,..., β p as accurately as possible. Let p Ŷ = βj X j, j=1 the best predictor of Y if the true coecients were known.

49 Consistency of the Lasso (cont.) 13/14 We are given n iid observations Z i = (Y i, X i,1,..., X i,p ) of (Y, X 1,..., X p ). Our goal is to recover β1,..., β p as accurately as possible. Let p Ŷ = βj X j, j=1 the best predictor of Y if the true coecients were known. Given β 1,..., β p, let p Ỹ = β j X j. j=1

50 Consistency of the Lasso (cont.) 13/14 We are given n iid observations Z i = (Y i, X i,1,..., X i,p ) of (Y, X 1,..., X p ). Our goal is to recover β1,..., β p as accurately as possible. Let p Ŷ = βj X j, j=1 the best predictor of Y if the true coecients were known. Given β 1,..., β p, let p Ỹ = β j X j. j=1 Dene the mean square prediction error by MSPE( β) = E(Ŷ Ỹ )2.

51 Consistency of the Lasso (cont.) We are given n iid observations Z i = (Y i, X i,1,..., X i,p ) of (Y, X 1,..., X p ). Our goal is to recover β1,..., β p as accurately as possible. Let p Ŷ = βj X j, j=1 the best predictor of Y if the true coecients were known. Given β 1,..., β p, let p Ỹ = β j X j. j=1 Dene the mean square prediction error by MSPE( β) = E(Ŷ Ỹ )2. We will provide a bound on MSPE( β) when β is the lasso solution. 13/14

52 Consistency of the Lasso (cont.) 14/14 Given K > 0, let β K = ( β 1 K,..., β p K ) be the minimizer of n (Y i β 1 X i,1 β p X i,p ) 2 i=1 under the constraint p β i K. i=1 (The problem is equivalent to the lasso).

53 Consistency of the Lasso (cont.) Given K > 0, let β K = ( β 1 K,..., β p K ) be the minimizer of n (Y i β 1 X i,1 β p X i,p ) 2 i=1 under the constraint p β i K. i=1 (The problem is equivalent to the lasso). Theorem: Under the previous assumptions and assuming p βj K for some K > 0 (sparsity assumption), we have j=1 MSPE( β K ) 2KMσ 2 log(2p) n 2 log(2p + 8K 2 M 2 2 ). n See Chatterjee, Assumptionless consistency of the Lasso, preprint, /14

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

1/9 MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression Dominique Guillot Deartments of Mathematical Sciences University of Delaware February 15, 2016 Distribution of regression