LASSO. Step. My modified algorithm getlasso() produces the same sequence of coefficients as the R function lars() when run on the diabetes data set:

Size: px

Start display at page:

Download "LASSO. Step. My modified algorithm getlasso() produces the same sequence of coefficients as the R function lars() when run on the diabetes data set:"

Elwin Jefferson
5 years ago
Views:

1 lars/lasso 1 Homework 9 stepped you through the lasso modification of the LARS algorithm, based on the papers by Efron, Hastie, Johnstone, and Tibshirani (24) (= the LARS paper) and by Rosset and Zhu (27). I made a few small changes to the algorithm in the LARS paper. I used the diabetes data set in R (the data used as an illustration in the LARS paper) to test my modification: library(lars) data(diabetes) # load the data set LL = lars(diabetes$x,diabetes$y,type="lasso") plot(ll,xvar="step") # one variation on the usual plot LASSO Standardized Coefficients Step My modified algorithm getlasso() produces the same sequence of coefficients as the R function lars() when run on the diabetes data set: J= getlasso(diabetes$y,diabetes$x) round(ll$bet[-1,]-t(j$output["coef.end",,]),6)# gives all zeros I am fairly confident that my changes lead to the same sequences of lasso fits. I find my plots, which are slightly different from those produced by the lars library, more helpful for understanding the algorithm: Statistics 312/612, Fall 21

2 lars/lasso 2 show2(j) # my plot Modified lasso algorithm for diabetes data ltg bmi, ldl map hdl, tch age, glu bj's tc bmi, ltg map, tch, glu age, tc, ldl Cj's hdl [For higher resolution, see the pdf files attached to this handout.] 1 The Lasso problem The problem is: given an n 1 vector y and an n p matrix X, find the ˆb(λ) that minimizes L λ (b) = y Xb 2 + 2λ b j for each λ. [The extra factor of 2 eliminates many factors of 2 in what follows.] The columns of X will be assumed to be standardized to have zero means and X j = 1. I also seem to need linear independence of various subsets of columns of X, which would be awkward if p were larger than n. Remark. I thought the vector y was also supposed to have a zero mean. That is not the case for the diabetes data set in R. It does not seem to be needed for the algorithm to work. I will consider only the one at a time case (page 417 of the LARS paper), for which the active set of predictors X j changes only by either addition or deletion of a single predictor. Statistics 312/612, Fall 21

3 lars/lasso 3 2 Directional derivative <1> The function L λ has derivative at b in the direction u defined by L λ(b, u) = lim t L λ (b + tu) L λ (b) t = 2 j λd(b j, u j ) (y Xb) X j u j where D(b j, u j ) = u j {b j > } u j {b j < } + u j {b j = }. By convexity, a vector b minimizes L λ if and only if L λ (b, u) for every u. Equivalently, for every j and every u j the jth summand in <1> must be nonnegative. [Consider u vectors with only one nonzero component to establish this equivalence.] That is, b minimizes L λ if and only if λd(b j, u j ) (y Xb) X j u j for every j and every u j When b j the inequalities for u j = ±1 imply an equality; for b j = we get only the inequality. Thus b minimizes L λ if and only if λ = X jr if b j > λ = X jr if b j < where R := y Xb λ X jr if b j = The LARS/lasso algorithm recursively calculates a sequence of breakpoints > λ 1 > λ 2 > with ˆb(λ) linear for each interval λ k+1 λ λ k. Define residual vector and correlations R(λ) := y Xˆb(λ) and C j (λ) := X jr(λ). Remark. To get a true correlation we would have to divide by R(λ), which would complicate the constraints. <2> The algorithm will ensure that λ = C j (λ) if ˆb j (λ) > (constraint ) λ = C j (λ) if ˆb j (λ) < (constraint ) λ C j (λ) if ˆb j (λ) = (constraint ) That is, for the minimizing ˆb(λ) each (λ, C j (λ)) needs to stay inside the region R = {(λ, c) R + R : c λ}, moving along the top bounday (c = λ) when b j (λ) > (constraint ), along the lower boundary (c = λ) when ˆb j (λ) < (constraint ), and being anywhere in R when ˆb j (λ) = (constraint ). Statistics 312/612, Fall 21

4 lars/lasso 4 3 The algorithm The solution ˆb(λ) is continuous in λ and linear on intervals defined by change points = λ > λ 1 > λ 2 > >. The construction proceeds in steps, starting with large λ and working towards λ =. The vector of fitted values ˆf(λ) = Xˆb(λ) is also piecewise linear. Within each interval (λ k+1, λ k ) only the active subset A = A k = {j : ˆb j (λ) } of the coefficients changes; the inactive coefficients stay fixed at zero. 3.1 Some illuminating special cases It helped me to work explicitly through the first few steps before thinking about the equations that define a general step in the algorithm. Start with A = and ˆb(λ) = for λ λ 1 := max X jy. Constraint is satisfied on [λ 1, ). Step 1. Constraint would be violated if we kept ˆb(λ) equal to zero for λ < λ 1, because we would have max j C j (λ) > λ. The ˆb(λ) must move away from zero as λ decreases below λ 1. We must have C j (λ 1 ) = λ 1 for at least one j. For convenience of exposition, suppose C 1 (λ 1 ) = λ 1 > C j (λ 1 ) for all j 2. The active set now becomes A = {1}. For λ 2 λ < λ 1, with λ 2 to be specified soon, keep ˆb j (λ) = for j 2 but let ˆb1 (λ) = + v 1 (λ 1 λ) for some constant v 1. To maintain the equalities λ = C 1 (λ) = X 1(y X 1ˆb1 (λ)) = C 1 (λ 1 ) X 1X 1 v 1 (λ 1 λ) = λ 1 v 1 (λ 1 λ) we need v 1 = 1. This choice also ensures that ˆb 1 (λ) > for a while, so that is the relevant constraint for ˆb 1. For λ < λ 1, with v 1 = 1 we have R(λ) = y X 1 (λ 1 λ) and C j (λ) = C j (λ 1 ) a j (λ 1 λ) where a j := X jx 1. Notice that a j < 1 unless X j = ±X 1. Also, as long as max j 2 C j (λ) λ the other ˆb j s still satisfy constraint. Statistics 312/612, Fall 21

5 lars/lasso 5 We need to end the first step at λ 2, the largest λ less than λ 1 for which max j 2 C j (λ) = λ. Solve for C j (λ) = ±λ for each fixed j 2: λ = λ 1 (λ 1 λ) = C j (λ 1 ) a j (λ 1 λ) λ = λ 1 + (λ 1 λ) = C j (λ 1 ) a j (λ 1 λ) if and only if λ 1 λ = (λ 1 C j (λ 1 )) /(1 a j ) λ 1 λ = (λ 1 + C j (λ 1 )) /(1 + a j ) <3> Both right-hand sides are strictly positive. Thus λ 2 = λ 1 λ where ( λ1 C j (λ 1 ) λ := min j 2 min, λ ) 1 + C j (λ 1 ) 1 a j 1 + a j Second step. We have C 1 (λ 2 ) = λ 2 = max j 2 C j (λ 2 ), by construction. For convenience of exposition, suppose C 2 (λ 2 ) = λ 2 > C j (λ 2 ) for all j 3. The active set now becomes A = {1, 2}. To emphasize a subtle point it helps to consider separately two cases. Write s 2 for sign(c 2 (λ 2 )), so that C 2 (λ 2 ) = s 2 λ 2. case s 2 = +1: For λ 3 λ < λ 2 and a new v 1 and v 2 (Note the recycling of notation.), define ˆb1 (λ) = ˆb 1 (λ 2 ) + (λ 2 λ)v 1 ˆb2 (λ) = + (λ 2 λ)v 2 with all other ˆb j s still zero. Write Z for [X 1, X 2 ]. The new C j s become ( ) C j (λ) = X j y X 1ˆb1 (λ) X 2ˆb2 (λ) = C j (λ 2 ) (λ 2 λ)x jzv where v = (v 1, v 2 ). We keep C 1 (λ) = C 2 (λ) = λ if we choose v to make X 1Zv = 1 = X 2v. That is, we need v = (Z Z) 1 1 with 1 = (1, 1). Of course we must assume that X 1 and X 2 are linearly independent for Z Z to have an inverse. Statistics 312/612, Fall 21

6 lars/lasso 6 case s 2 = 1: For λ 3 λ < λ 2 and a new v 1 and v 2, define ˆb1 (λ) = ˆb 1 (λ 2 ) + (λ 2 λ)v 1 ˆb2 (λ) = (λ 2 λ)v 2 (note the change of sign) with all other ˆb j s still zero. Write Z for [X 1, X 2 ]. The tricky business with the signs ensures that Xˆb(λ) Xˆb(λ 2 ) = X 1 (λ 2 λ)v 1 X 2 (λ 2 λ)v 2 = (λ 2 λ)zv. The new C j s become ( ) C j (λ) = X j y X 1ˆb1 (λ) X 2ˆb2 (λ) = C j (λ 2 ) (λ 2 λ)x jzv. We keep C 1 (λ) = C 2 (λ) = λ if we choose v to make X 1Zv = 1 = X 2v. That is, again we need v = (Z Z) 1 1. Remark. If we had assumed C 1 (λ 1 ) = λ 1 then X 1 would be replaced by X 1 in the Z matrix, that is, Z = [s 1 X 1, s 2 X 2 ] with s 1 = sign(c 1 (λ 1 )) and s 2 = sign(c 2 (λ 2 )). Because ˆb 1 (λ 2 ) >, the correlation C 1 (λ) stays on the correct boundary for the constraint. If s 2 = +1 we need v 2 > to keep ˆb 2 (λ) > and C 2 (λ) = λ, satisfying. If s 2 = 1 we also need v 2 > to keep ˆb 2 (λ) < and C 2 (λ) = λ, satisfying. That is, in both cases we need v 2 >. Why do we get a strictly positive v 2? Write ρ for s 2 X 2X 1. As the Z Z matrix is nonsingular we must have ρ < 1 so that v = ( 1 ρ ρ 1 ) 1 ( 1 ρ 1 = (1 ρ 2 ) 1 ρ 1 and v 2 = v 1 = (1 ρ)/(1 ρ 2 ) >. If no further C j (λ) s were to hit the ±λ boundary, step 2 could continue all the way to λ =. More typically, we would need to create a new active set at the largest λ 3 strictly smaller than λ 2 for which max j 3 C j (λ) = λ. For the general step there is another possible event that would require a change to the active set: one of the ˆb j (λ) s in the active set might hit zero, threatening to change sign and leave the corresponding C j (λ) on the wrong boundary. I could pursue these special cases further, but it is better to start again for the generic step in the algorithm. ) 1 Statistics 312/612, Fall 21

7 lars/lasso The general algorithm Once again, start with A = and ˆb(λ) = for λ λ 1 := max X jy. Constraint is satisfied on [λ 1, ). At each λ k a new active set A k is defined. During the kth step the parameter λ decreases from λ k to λ k+1. For all j s in the active set A k, the coefficients ˆb j (λ) change linearly and the C j (λ) s move along one of the boundaries of the feasible region: C j (λ) = λ if ˆb j (λ) > and C j (λ) = λ if ˆb j (λ) <. For each inactive j the coefficient ˆb j (λ) remains zero throughout [λ k+1, λ k ]. Step k ends when either an inactive C j (λ) hits a ±λ boundary or if an active ˆb j (λ) becomes zero: λ k+1 is defined as the largest λ less than λ k for which either of these conditions holds: (i) max j / Ak C j (λ) = λ. In that case add the new j A c k for which C j (λ k+1 ) = λ k+1 to the active set, then proceed to step k + 1. (ii) ˆb j (λ) = for some j A k. In that case, remove j from the active set, then proceed to step k + 1. For the diabetes data, the alternative (ii) caused the behavior shown below for 1.3 λ 2.2. Modified lasso algorithm for diabetes data ltg bmi, ldl map hdl, tch age, glu bj's tc bmi, map, ldl, tch, ltg, glu -5.1 Cj's age,, tc, hdl I will show how the C j (λ) s and the ˆb j (λ) s can be chosen so that the conditions <2> are always satisfied. Statistics 312/612, Fall 21

8 lars/lasso 8 At the start of step k (with λ = λ k ), define Z = [s j X j : j A k ], where s j := sign(c j (λ)). That is, Z has a column for each active X j predictor, with the sign flipped if the corresponding C j (λ) is moving along the λ boundary. Define v := (Z Z) 1 1. For λ k+1 λ λ k the active coefficients change linearly, <4> ˆbj (λ) = ˆb j (λ k ) + (λ k λ)v j s j for j A k. The fitted vector, ˆf(λ) := Xˆb(λ) = ˆf(λ k ) + X j (ˆbj (λ) ˆb ) j (λ k ) = ˆf(λ k ) + (λ k λ)zv, j A k the residuals, R(λ) := y ˆf(λ) = R(λ k ) (λ k λ)zv, and the correlations, <5> C j (λ) := X jr(λ) = C j (λ k ) (λ k λ)a j also change linearly. In particular, for j A k, where a j := X jzv, C j (λ) = s j λ k (λ k λ)s j Z jzv = s j λ because Z jz(z Z) 1 1 = 1. The active C j (λ) s move along one of the boundaries ±λ of R. Remember that I am assuming the one at a time condition: the active set changes only by addition of one new predictor in case (i) or by dropping one predictor in case (ii). Suppose index α enters the active set via (i) when λ equals λ k+1 then leaves it via (ii) when λ equals λ l+1. (If α never leaves the active set put λ l+1 equal to.) Throughout the interval (λ l+1, λ k+1 ) both sign(c α (λ)) and sign(ˆb α λ)) stay constant. To ensure that all constraints are satisfied it is enough to prove: (a) For λ only slightly smaller than λ k+1, both C α (λ) and ˆb α (λ) have the same sign. (b) For λ only slightly smaller than λ l+1, constraint is satisfied, that is, C α (λ) λ. Statistics 312/612, Fall 21

9 lars/lasso 9 <6> The analyses are similar for the two cases. They both depend on a neat formula for inversion of symmetric block matrices. Suppose A is an m m nonsingular, symmetric matrix and d is an m 1 vector for which κ := 1 d A 1 d. Then ( ) ( ) 1 A d A 1 + ww /κ w/κ d = where w := A 1 d. 1 w /κ 1/κ This formula captures the ideas involved in Lemma 4 of the LARS paper. For simplicity of notation, I will act in both cases (a) and (b) as if the α is larger than all the other j s that were in the active set. (More formally, I could replace X in what follows by XE for a suitable permutation matrix E.) The old and new Z matrices then have the block form shown in <6>. Case (a): α enters the active set at λ k+1 As for the analysis that led to the expression in <3>, the solutions for λ = C j (λ) with j A c k are given by (λ k λ)(1 a j ) = λ k C j (λ k ) to get λ = C j (λ) (λ k λ)(1 + a j ) = λ k + C j (λ k ) to get λ = C j (λ) Index α is brought into the active set because C α (λ k+1 ) = λ k+1. Thus (λ k λ k+1)(1 s α a α ) = λ k s α C α (λ k ) where s α := sign(c α (λ k+1)). Note that C α (λ k ) < λ k because α was not active during step k. It follows that the right-hand side of the last equality is strictly positive, which implies <7> 1 s α a α >. Throughout a small neighborhood of λ k+1 the sign s α of C α (λ) stays the same. Continue to write Z for the active matrix [s j X j ; j A k ] for λ slightly larger than λ k+1 and denote by Z = [s j X j ; j A k+1 ] = [Z, s α X α ] the new active matrix for λ slightly small than λ k+1. Then ( ) Z Z Z Z d = where d := s α Z X α. d 1 Statistics 312/612, Fall 21

10 lars/lasso 1 Notice that 1 κ = d A 1 d = X αz(z Z) 1 Z X α = HX α 2, <8> where H = Z(Z Z) 1 Z is the matrix that projects vectors orthogonally onto span(z). If X α / span(z) then HX α < X α = 1 so that κ >. From <6>, ( ) (Z ( Z Z) Z) 1 + ww /κ w/κ 1 = where w := s α (Z Z) 1 Z X α. w /κ 1/κ The αth coordinate of the new ṽ = ( Z Z) 1 1 equals ṽ α = κ 1 (1 w 1), which is strictly positive because 1 w 1 = 1 s α X αz(z Z) 1 1 = 1 s α X αzv = 1 s α a α by <5> > by <7>. By <4>, the new ˆb α (λ) = (λ k+1 λ)ṽ α s α has the same sign, s α, as C α (λ). Case (b): α leaves the active set at λ = λ l+1 The roles of Z = [s j X j : 1 j < α] and Z = [Z, s α X α ] are now reversed. For λ slightly larger than λ l+1 the active matrix is Z, and both C α (λ) and ˆb α (λ) have sign s α. Index α leaves the active set because ˆb α (λ l+1 ) =. Thus = ˆb α (λ l ) + (λ l λ l+1 )ṽ α s α where s α := sign(c α (λ l )) and ṽ = ( Z Z) 1 1. sign s α. Consequently, we must have The active ˆb α (λ l ) also had <9> ṽ α <. For λ slightly smaller than λ l+1 the active matrix is Z, and, by <5>, s α C α (λ) = λ l+1 (λ l+1 λ)s α a α = λ + (λ l+1 λ)(1 s α a α ) where a α := X αzv Statistics 312/612, Fall 21

11 lars/lasso 11 We also know that > κṽ α = 1 w 1 = 1 s α a α. Thus s α C α (λ) < λ, and hence C α (λ) < λ, for λ slightly less than λ l+1. The new C α (λ) satisfies constraint as it heads off towards the other boundary. Remark. Clearly there is some sort of duality accounting for the similarities in the arguments for cases (a) and (b), with <6> as the shared mechanism. If we think of λ as time, case (b) is a time reversal of case (a). For case (a) the defining condition (C α hits the boundary) at λ k+1 gives 1 s α v α >, which implies v α >. For case (b), the defining condition (ˆb α hits zero) at λ l+1 gives ṽ a <, which implies 1 s α a α <. Is there some clever way to handle both cases by a duality argument? Maybe I should read more of the LARS paper. 4 My getlasso() function I wrote the R code in the file lasso.r to help me understand the algorithm described in the LARS paper. The function is not particularly elegant or efficient. I wrote in a way that made it easy to examine the output from each step. If verbose=t, lots of messages get written to the console and the function pauses after each step. I also used some calls to browser() while tracking down some annoying bugs related to case (b). d.p. 1 December 21 References Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (24). Least angle regression. The Annals of Statistics 32 (2), pp Rosset, S. and J. Zhu (27). Piecewise linear regularized solution paths. Annals of Statistics 35 (3), Statistics 312/612, Fall 21

12 Modified lasso algorithm for diabetes data ltg bj's bmi, ldl map hdl, tch age, glu tc bmi, ltg map, tch, glu age, tc, ldl Cj's hdl

13 Modified lasso algorithm for diabetes data bmi bj's ltg age,, map, tc, ldl, hdl, tch, glu bmi, ltg map, tch, glu age, tc, ldl Cj's hdl

14 Modified lasso algorithm for diabetes data bmi ltg bj's map age,, tc, ldl, hdl, tch, glu bmi, ltg map, tch, glu age, tc, ldl Cj's hdl

15 Modified lasso algorithm for diabetes data 55.7 bmi ltg bj's map age,, tc, ldl, tch, glu hdl bmi, ltg map, tch, glu age, tc, ldl Cj's hdl

16 Modified lasso algorithm for diabetes data bmi, ltg bj's map age, tc, ldl, tch, glu hdl bmi, map, ltg tch, glu age tc, ldl Cj's hdl

17 Modified lasso algorithm for diabetes data ltg bj's bmi, ldl map hdl, tch age, glu tc bmi, map, tch, ltg, glu age -2 Cj's ldl, tc, hdl

18 Modified lasso algorithm for diabetes data ltg bj's bmi, ldl map hdl, tch age, glu tc bmi, map, ldl, tch, ltg, glu -5.1 Cj's age,, tc, hdl

The problem is: given an n 1 vector y and an n p matrix X, find the b(λ) that minimizes

The problem is: given an n 1 vector y and an n p matrix X, find the b(λ) that minimizes LARS/lasso 1 Statistics 312/612, fall 21 Homework # 9 Due: Monday 29 November The questions on this sheet all refer to (a slight modification of) the lasso modification of the LARS algorithm, based on