The problem is: given an n 1 vector y and an n p matrix X, find the b(λ) that minimizes

Size: px

Start display at page:

Download "The problem is: given an n 1 vector y and an n p matrix X, find the b(λ) that minimizes"

Ursula Brooks
5 years ago
Views:

1 LARS/lasso 1 Statistics 312/612, fall 21 Homework # 9 Due: Monday 29 November The questions on this sheet all refer to (a slight modification of) the lasso modification of the LARS algorithm, based on the papers by Efron, Hastie, Johnstone, and Tibshirani (24) and by Rosset and Zhu (27). I have made a few small changes to the algorithm described in the first paper. I hope I haven t broken anything. The homework problems are embedded into an explanation of how and why the algorithm works. Look for the numbers [1], [2],... in the left margin and the text enclosed in a box. 1 The Lasso problem The problem is: given an n 1 vector y and an n p matrix X, find the b(λ) that minimizes L λ (b) = y Xb 2 + 2λ b j for each λ. [The extra factor of 2 eliminates many factors of 2 in what follows.] The columns of X will be assumed to be standardized to have zero means and X j = 1. Remark. I thought the vector y was also supposed to have a zero mean. That is not the case for the diabetes data set in R, the data set used as an illustration in the LARS paper (Efron, Hastie, Johnstone, and Tibshirani 24) where the LARS procedure was first analyzed. The solution will turn out to be piecewise linear and continuous in λ. To test my modification I also used the diabetes data set: library(lars) data(diabetes) # load the data set #?diabetes for description dd = diabetes LL = lars(diabetes$x,diabetes$y,type="lasso") plot(ll,xvar="step") # the usual plot My modified algorithm gets the same coefficients as lars() for this data set.

2 LARS/lasso 2 Remark. For the sake of exposition, the discussion below assumes that the predictors enter the active set in a way slightly different from what happens for diabetes. I will consider only the one at a time case (page 417 of the LARS paper), for which the active set of predictors X j changes only by either addition or deletion of a single predictor. 2 Directional derivative The function L λ has derivative at b in the direction u defined by <1> L L λ (b + tu) L λ (b) λ (b, u) = lim t t = 2 j λf(b j, u j ) (y Xb) X j u j where f(b j, u j ) = u j {b j > } u j {b j < } + u j {b j = }. By convexity, a vector b minimizes L λ if and only if L λ (b, u) for every u. Equivalently, for every j and every u j the jth summand in <1> must be nonnegative. [Consider u vectors with only one nonzero component to establish this equivalence.] That is, b minimizes L λ if and only if λf(b j, u j ) (y Xb) X j u j for every j and every u j When b j the inequalities for u j = ±1 imply an equality; for b j = we get only the inequality. Thus b minimizes L λ if and only if λ = X j R if b j > λ = X j R if b j < where R := y Xb λ X j R if b j = The LARS/lasso algorithm recursively calculates a sequence of breakpoints > λ 1 > λ 2 > with b(λ) linear for each interval λ k+1 λ λ k. Define residual vector and correlations R(λ) := y X b(λ) and C j (λ) := X jr(λ). Remark. To get a true correlation we would have to divide by R(λ), which would complicate the constraints.

3 LARS/lasso 3 <2> The algorithm will ensure that λ = C j (λ) if b j (λ) > (constraint ) λ = C j (λ) if b j (λ) < (constraint ) λ C j (λ) if b j (λ) = (constraint ) That is, for the minimizing b(λ) each (λ, C j (λ)) needs to stay inside the region R = {(λ, c) R + R : c λ}, moving along the top bounday (c = λ) when b j (λ) > (constraint ), along the lower boundary (c = λ) when b j (λ) < (constraint ), and being anywhere in R when b j (λ) = (constraint ). Modified lasso algorithm for diabetes data ltg bmi, ldl map hdl, tch age, glu bj's sex tc bmi, ltg map, tch, glu age, tc, ldl sex Cj's hdl [See the file lassofull.pdf in the Handouts directory for higher resolution.] 3 The algorithm The solution b(λ) is to be constructed in a sequence of steps, starting with large λ and working towards λ =. First step. Define λ 1 = max X j y. For λ λ 1 take b(λ) =, so that C j (λ) λ 1. Constraint would be violated if we kept b(λ) equal to zero for λ < λ 1 ; the b(λ) must move away from zero as λ decreases below λ 1. We must have C j (λ 1 ) = λ 1 for at least one j. For convenience of exposition, suppose C 1 (λ 1 ) = λ 1 > C j (λ 1 ) for all j 2. The active set is now A = {1}.

4 LARS/lasso 4 For λ 2 λ < λ 1, with λ 2 to be specified soon, keep b j = for j 2 but let b 1 (λ) = + v 1 (λ 1 λ) for some constant v 1. We need b 1 (λ) = λ for a while, which forces v 1 = 1. That choice gives R(λ) = y X 1 (λ 1 λ) and C j (λ) = C j (λ 1 ) X jx 1 (λ 1 λ). Notice that b 1 (λ) > so that is the relevant constraint for b 1. Let λ 2 be the largest λ less than λ 1 for which max j 2 C j (λ) = λ. *[1] Find the value λ 2. That is, express λ 2 as a function of C j s. Hint: Consider Efron, Hastie, Johnstone, and Tibshirani (24, equation 2.13), with appropriate changes of notation. Second step. We have C 1 (λ 2 ) = λ 2, by construction. For convenience of exposition, suppose C 2 (λ 2 ) = λ 2 > C j (λ 2 ) for all j 3. The active set is now A = {1, 2}. For λ 3 λ < λ 2 choose a new v 1 and a v 2 then define b 1 (λ) = b 1 (λ 2 ) + (λ 2 λ)v 1 b 2 (λ) = + (λ 2 λ)v 2 with all other b j s still zero. Write Z for [X 1, X 2 ]. The new C j s become C j (λ) = X j (y X 1 b 1 (λ) X 2 b 2 (λ)) = C j (λ 2 ) (λ 2 λ)x jzv where v = (v 1, v 2 ). *[2] Show that C 1 (λ) = C 2 (λ) = λ if we choose v to make Z Zv = (1, 1). That is, v = (Z Z) 1 1 with 1 = (1, 1). Of course we must assume that X 1 and X 2 are linearly independent for Z Z to have an inverse. *[3] Show that v 1 and v 2 are both positive, so that is the relevant constraint for both b 1 and b 2. Let λ 3 be the largest λ less than λ 2 for which max j 3 C j (λ) = λ. Third step. We have C 1 (λ 3 ) = C 2 (λ 3 ) = λ 3 and max j 3 C j (λ 3 ) = λ 3. Consider now what happens if the last equality holds because a C j has hit the lower boundary of the

5 LARS/lasso 5 region R: suppose C 3 (λ 3 ) = λ 3 > max j 4 C j (λ 3 ). The active set then becomes A = {1, 2, 3}. For λ 4 λ < λ 3 choose a new v 1, v 2 and a v 3 to make b 1 (λ) = b 1 (λ 3 ) + (λ 3 λ)v 1 b 2 (λ) = b 2 (λ 3 ) + (λ 3 λ)v 2 b 3 (λ) = (λ 3 λ)v 3 with all other b j s still zero. More concisely, with s j = sign(c j (λ 3 )), b j (λ) = b j (λ 3 ) + (λ 3 λ)v j s j for j A. For λ 3 λ small enough, we still have b 1 (λ) > and b 2 (λ) >, keeping as the relevant constraint for b 1 and b 2. To ensure that b 3 (λ) < we need v 3 >. Define Z = [X 1, X 2, X 3 ]. For λ λ 3 show that *[4] [5] so that R(λ) R(λ k ) = j A Z jv j (λ k λ) C j (λ) = C j (λ 3 ) (λ 3 λ)x jzv Show that v = (Z Z) 1 = 1 keeps b 1, b 2, and b 3 on the correct boundaries. How should we choose λ 4? Is it possible for some active b j to switch sign? Prove that v 3 >. I am not yet clear on why this must be true. Efron, Hastie, Johnstone, and Tibshirani (24, Lemma 4) seems to contain the relevant argument. The general step. Just before λ = λ k+1 suppose the active set is A, so that C j (λ) = λ for each j in A. For each j in A let s j = sign(c j (λ k )). For λ k+1 λ λ k define b j (λ) = b j (λ k ) + (λ k λ)v j s j for j A *[6] Define Z as the matrix with jth column equal to s j X j for each j A. λ k+1 λ λ k show that C j (λ) = C j (λ k ) (λ k λ)x jzv For so that the choice = (Z Z) 1 1 keeps C j (λ) = s j λ for j A. Define λ k+1 to be the largest λ less than λ k for which either of these conditions holds:

6 LARS/lasso 6 (i) max j / A C j (λ) = λ. In that case add the new j A c for which C j (λ k+1 ) = λ k+1 to the active set, then carry out another general step. (ii) b j (λ) = for some j A. In that case, remove j from the active set, then carry out another general step. For diabetes, the second alternative caused the behavior shown below [see lassoending.pdf for higher resolution] for 1.3 λ 2.2. Modified lasso algorithm for diabetes data ltg bmi, ldl map hdl, tch age, glu bj's sex tc bmi, map, ldl, tch, ltg, glu -5.1 Cj's age, sex, tc, hdl *[7] Find λ k+1. That is, write out what an R program would have to calculate. Note that this step is slightly tricky. If predictor j is removed from the active set because b j = then the corresponding C j will lie on the boundary of R. You will need to interpret (i) with care to avoid λ k+1 = λ k. [8] [9] Show that the coefficient b j (λ), if j enters the active set, has the correct sign. Again Efron, Hastie, Johnstone, and Tibshirani (24, Lemma 4) seems to contain the relevant argument. If you think you understand the modified algorithm, as described above, try to submit your solution to the next problem to me by Friday 3 December. Write an R program (a script or a function) that implements the modified algorithm. Test it out on the diabetes data set.

7 LARS/lasso 7 References Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (24). Least angle regression. The Annals of Statistics 32 (2), pp Rosset, S. and J. Zhu (27). Piecewise linear regularized solution paths. Annals of Statistics 35 (3),

LASSO. Step. My modified algorithm getlasso() produces the same sequence of coefficients as the R function lars() when run on the diabetes data set:

LASSO. Step. My modified algorithm getlasso() produces the same sequence of coefficients as the R function lars() when run on the diabetes data set: lars/lasso 1 Homework 9 stepped you through the lasso modification of the LARS algorithm, based on the papers by Efron, Hastie, Johnstone, and Tibshirani (24) (= the LARS paper) and by Rosset and Zhu (27).