The lasso an l 1 constraint in variable selection

Size: px

Start display at page:

Download "The lasso an l 1 constraint in variable selection"

Virginia Marilynn Green
5 years ago
Views:

1 The lasso algorithm is due to Tibshirani who realised the possibility for variable selection of imposing an l 1 norm bound constraint on the variables in least squares models and then tuning the model estimation calculation using this bound. Considerable interest has been generated in this procedure by the discovery by Osborne, Presnell, and Turlach that the complete solution trajectory parametrised by this bound can be calculated very efficiently (the homotopy algorithm). This has resulted in the study both of the selection problem for different objective and constraint choices and of applications to such areas as data compression and the generation of sparse solutions of very under-determined systems. One class of generalisation is to piecewise linear systems - one example is quantile regression. In this case the selection problem can be formulated as a linear program and post-optimality procedures used to generate the solution trajectory. Our original continuation idea also extends in an interesting two phase procedure which has significant computational advantages over the LP approach. However, it is significantly less effective than that of the original homotopy algorithm for least squares objectives. The underlying problem is easier to state than resolve. In contrast to the smooth objective case a relatively efficient descent algorithm is available for fixed values of the constraint bound. This is joint work with Berwin Turlach.

2 Outline Introduction LSQ Descent LSQ Homotopy l 1 Descent l 1 Homotopy Results Other References

3 Original formulation Start with linear model: r = y Xβ, X : R p R n, where rank X = min(p, n). Problem: select small subset of columns of X so that r 2 is small in an appropriate sense. Applications: 1. Exploratory data analysis (y signal observed in presence of noise). Here case of most interest corresponds to p < n. 2. Economise representation of a sampled signal in manner compatible with adequate image reconstruction. Here case of interest corresponds to p n. Aim is data compression.

4 Tibshirani Add l 1 constraint gor1 1 min β 2 r 2 2 ; β 1 κ.

5 Tibshirani Add l 1 constraint gor1 1 min β 2 r 2 2 ; β 1 κ. Can be written as QP by introducing slack variables and positivity constraints β i = u i v i, u i, v i 0, i = 1, 2,, p, p β 1 = (u i + v i ). i=1

6 Tibshirani Add l 1 constraint gor1 1 min β 2 r 2 2 ; β 1 κ. Can be written as QP by introducing slack variables and positivity constraints β i = u i v i, u i, v i 0, i = 1, 2,, p, p β 1 = (u i + v i ). i=1 Osborne, Presnell, and Turlach treat constraint directly.

7 The lasso in variable selection

8 Necessary conditions golsqr Then Let µ be the Lagrange multiplier for the l1 constraint. r T X = µu T, µ 0, u T ε β 1, µ = rt Xβ β 1. Note µ = 0 if κ β LS 1. Introduce index set ψ pointing to the nonzero components of β (the currently selected variables) and a permutation matrix Q ψ which collects together these nonzero components. Then β = Q T ψ [ βψ 0 ], u = Q T ψ [ θψ u 2 ] ε β 1, (θ ψ ) j = sgn(β ψ(j) ), 1 (u 2 ) k 1, kεψ c, ψ ψ c = {1, 2,, p}, u T β = β 1, u = 1.

9 Matrix factorization Partial orthogonal factorization helps to simplify calculations. [ ] [ ] XQψ T = S U1 U 12, S T c1 y =. 0 B Necessary conditions become [ ] {[ ] [ U T 1 0 c1 U1 U 12 0 B U T 12 B T Solving gives c 2 ] [ βψ 0 c 2 ]} U 1 β ψ = c 1 µw ψ, w ψ = U T θ ψ, µu 2 = B T c 2 + µu T 12 w ψ, κ = w T ψ c 1 µ w ψ 2 2. [ θψ = µ u 2 ]. Note the linear relation between κ and µ. Also condition 1 (u 2 ) i 1. goot gost

10 Descent direction Key feature is explicit treatment of l 1 constraint by insisting that θ ψ continues to define the norm constraint. Start with feasible β such that θ T ψ β ψ = κ. Find β 1 = β ψ + h by solving Kuhn-Tucker conditions give min r(β) 2 h,θ T ψ h=0 2. X T 1 r(β ψ + h) = µθ ψ, θ T ψ ( βψ + h ) = κ. h 0 is a descent direction. Small enough displacement in direction h retains feasibility and reduces objective. { } 1 β 2 r 2 2 h = h T X1 T r, [ ] = h T X1 T X 1h µθ ψ, = h T X T 1 X 1h < 0, h 0.

11 Optimality test By construction β 1 = β ψ + h satisfies the first necessary condition. If β 1 is feasible then say it is sign feasible. Now u 2 can be tested to see if the second necessary conditions are satisfied. go SD If no, then s, (u2 ) s > 1. This condition triggers variable addition as follows. 1. Select an infeasible multiplier say (u 2 ) s subject to the constraint that B s 2 is not too small. 2. Update: ψ σ {s}, β ψ [ βψ (β s = 0) ] [ θψ, θ ψ (θ s ) Can be shown that this choice of θ ψ ensures sgn h ψ = θ ψ in next descent step. ].

12 Otherwise If the incremented β is not sign feasible then move to first new zero of β in the direction defined by h 0 = β ψ(k) ( β ψ )k + γh k, 0 < γ < 1. Now there are two possibilities. 1. Set θ k θ k, and recompute h. If new h gives descent direction consistent with the updated sign feasibility requirement then continue. This step is relatively cheap. 2. Else must reset ψ σ \ {ψ(k)}, reset β ψ, θ ψ, downdate factorization, and recompute h. This is the backtrack step that derails greedy algorithms.

13 Piecewise linear solution trajectory Need a key result: The minimum of a positive definite quadratic form subject to a bound on the l 1 norm of the variables is stable in the sense that small pertubations in the data lead to small perturbations in the minimum. Start with the necessary conditions U 1 β ψ = c 1 µw ψ, µu 2 = B T c 2 + µu 12 w ψ, µ = wt ψ c 2 κ w T ψ w. ψ If at initial κ both µ u 2 < µ, and β ψ(i) > 0, i = 1, 2,, ψ, then differentiating necessary conditions gives: dµ dκ = 1 dβ ψ w T ψ w,u 1 ψ dκ = 1 w T ψ w w ψ, ψ d(µu 2 ) dκ = 1 w T ψ w U12 T w ψ. ψ

14 Solution trajectory Right hand side of the ODE s is independent of κ. solution trajectory is piecewise linear. This means it is a simple and effective computation to follow the solution trajectory until the basic assumptions break down! The continuity guaranteed by the perurbation result now shows how to restart at the breakpoints. This observation is the basis for the homotopy algorithm of Osborne, Presnell, and Turlach. It proves to be remarkably efficient, computing the entire solution trajectory in little more than the cost of solving the unconstrained problem and returning significant additional information. It links to the standard least squares solution algorithm based on orthogonal factorization by using standard stepwise updating techniques. gosd

15 Breakpoints Solution process breaks down for values of κ for which either β ψ(j) = 0, or (u 2 ) j = ±1. If β ψ(i) = 0 then ψ c ψ c {ψ(i)}. The corresponding component of u 2 is θ i. It must move into the interior of [ 1, 1] from its bound as κ increases in order to preserve solution continuity. This step deletes a variable from the selection. If (u 2 ) j = 1 then ψ ψ {ψ c (j)}. The corresponding component of β must move from 0 as κ increases. The rule θ ψ = sgn (u 2 ) j applies as in the descent algorithm. This step adds a variable to the solution.

16 Properties The number of piecewise linear pieces in the homotopy trajectory is finite. If ψ repeats at κ 1, κ 2, κ 1 < κ 2 then it holds for all κ in between by linearity. r 2 2 monotone decreasing as κ < κ LS increases. 1 d r dκ = rt X dβ dβ = µut dκ dκ, = µθ T 1 ψ w ψ 2 U 1 1 w ψ = µ < 0. 2 To start note that if an unique maximum of X i T y occurs when i = s then the optimal solution for κ small enough is ψ = {s}, µ = X T sy κ X s 2 2, β = θ sκ.

17 Extensions gor1 It turns out that what is important is that the objective should be strictly convex, have degree no more than 2, and have continuous first derivative. Cases considered include: Piecewise quadratics with C 1 smoothness. Variable selection using the Huber M estimator which gives a C 1 combination of quadratic and linear pieces. Quadratic spline approximation of log likelihood functions. L = L (r i (β)) has been considered. Sorting out the pieces adds an extra level of complexity as breakpoints occur when pieces change. More general constraints. For example, the signed rank objective p i=1 w i β τ(i) where w i 0, and τ() ranks the variables in increasing order of magnitude. Turlach et al consider simultaneous selection of a common set of predictor variables for several objectives.

18 l 1 objective A number of applications which involve polyhedral objectives and lasso like constraints have been considered. Perhaps variable selection in quantile regression has received most attention. l 1 lasso corresponds to quantile parameter set to.5. min β r 1, β 1 κ. Need Lagrangian form with multiplier λ L (β, λ) = r 1 + λ { β 1 κ} Convex if λ 0. Necessary conditions give 0 β L (β, λ) = β r 1 + λ β β 1. This is the condition for the minimum of the l 1 minimization problem (λ fixed) min β { r 1 + λ β 1 }.

19 When LP wont do! Basic LP no line search possibility l 1 structurely different line search important

20 Residual zeros are non-smooth points for l 1 To follow zeros set σ = {i : r i = 0}, ψ = {i : β i 0}. Define set complements by σ σ c = {1, 2,, n}, ψ ψ c = {1, 2,, p}, and Permutation matrices P σ : R n R n, Q ψ : R p R p by P σ r = Q ψ β = [ r1 r 2 [ β1 β 2 ], ], P σ XQ T ψ = [ X11 X 12 X 21 X 22 { (r1 ) i = r σ c (i) 0, i = 1, 2,, n σ,, (r 2 ) i = r σ(i) = 0, i = 1, 2,, σ { (β1 ) i = β ψ(i) 0, i = 1, 2,, ψ, (β 2 ) i = β ψ c (i) = 0, i = 1, 2,, p ψ ] [ ] y1, P σ y = y 2

21 Necessary conditions Have subdifferential components for permuted system [ ] [ ] θ T σ v T σ r P σ r 1, θ T ψ β Q ψ β 1. These permit the necessary conditions to be written: [ ] [ ] θ T σ v T X 11 X [ ] 12 σ = λ θ T X 21 X ψ ut ψ, λ 0, 22 ut ψ 1 v i 1, i = 1, 2,, σ, 1 u i 1, i = 1, 2,, ψ c, θ T σ r 1 = [ θ T ] σ v T σ Pσ r = r 1, [ ] θ T ψ β 1 = Q ψ β = β 1 κ. θ T ψ ut ψ golda gomnc

22 Structure of the homotopy The new feature of the extension of the continuation algorithm to the non-smooth case is that it involves two distinct phases. The first uses essentially the constrained form of the problem which involves κ explicitly but not the Lagrange multiplier λ, while the second uses the Lagrangian form which involves the multiplier explicitly but not the constraint bound.

23 Varying κ first homotopy phase Start with κ = κ > 0, κ in an open interval with β determined by conditions r i = 0, i σ, β 1 = κ. This gives conditions σ = ψ 1, θ T ψ β 1 = κ, X 21 β 1 = y 2. Note X 21 has full row rank σ. Differentiating gives θ T dβ 1 ψ dκ = 1, X 21 dβ 1 dκ = 0, dβ 1 dκ = [ θ T ψ X 21 ] 1 e 1. So dβ dκ is constant in a neighbourhood of κ. It follows that β is piecewise linear on intervals of increase of κ.

24 More from the necessary conditions golnc These give Differentiating θ T σ X 11 + v T σ X 21 = λθ T ψ. dv T σ dκ X 21 = dλ dκ θt ψ. Post-multiply by dβ 1 dλ dκ. This gives dκ = 0. Similar consequences give dvσ dκ, du ψ dκ = 0. Thus λ, v σ, u ψ are constant on intervals of increase of κ.

25 Properties dβ dκ is a descent direction for minimizing r 1. ( r 1 β : dβ ) = sup z T X dβ dκ z r 1 dκ, = θ T dβ σ X 1 11 dκ = dβ 1 λθt ψ dκ, [ ] = λθ T θ T 1 ψ ψ e X 1 = λ < There are two possibilities for terminating the κ step: 1. There occurs a new zero residual corresponding to row σ c (k) of X 11. Actions: σ c (k) σ(1), v 1 (λ 0 ) = sgn (r σ c (k)). 2. {β 1 } j = 0. Actions: ψ(j) ψ c (1), u 1 = sgn ({β 1 } j (κ )). Sign conditions necessary to preserve optimality.

26 Varying λ second homotopy phase Have made κ-step κ 0 κ κ 1. Update possibilities: 1. (r 1 (κ 1 )) k = 0: [ (X11 ) X 21 k X 21 ] [ (y1 ), y 2 k y 2 ] 2. (β(κ 1 ) 1 ) j = 0. Action: remove column j from X 21. Now X 21 full rank, σ = ψ, and fixes both β(κ 1 ) 1, κ 1. X 21 β 1 = y 2

27 Governing DE for λ step golnc Differentiating the necessary conditions gives: [ 0 dvt σ dλ ] [ X 11 X 12 X 21 X 22 ] [ = θ T ψ d(λu T ψ) dλ ] Thus dv T σ dλ X 21 = θ T ψ, ( ) d λu T ψ = dvt σ dλ dλ X 22 It follows that dvσ dλ, d(λu ψ) dλ are constant.

28 Reducing λ Necessary conditions continue to hold as λ is reduced while κ = κ 1. Two cases to determine how to terminate this phase. 1. A component of u ψ is first to reach a bound (say u q = e T q u ψ ). Then a ψ ψ {ψ c (q)}. b Increase κ phase recommences. c Corresponding component of β moves away from 0 with sign of bound. 2. A component of v σ is first to reach a bound. a Remove corresponding index from σ, σ σ \ {σ(q)}. b Commence next κ phase. c r q moves from zero with sign of v q.

29 Results: LSQ homotopy golsnc p n XA XD Hald Iowa diabetes housing Table: Step counts for homotopy algorithm least squares objective Here XA steps add variable ψ ψ + 1 while XD steps delete variable ψ ψ 1. Variable addition is much most common action. This explains the observed efficiency. Tibshirani noted that addition is only action when columns of the design are orthogonal.

30 Results: l 1 homotopy p n SASD SAXA XDXA XDSD Hald Iowa diabetes housing Table: Step counts for homotopy algorithm l 1 objective New feature here is residual sign changes trigger points of non differentiability. SA, SD indicate addition and deletion of entries in σ. This is where the extra work is being done as r adapts to the required sign structure. Double entries (eg SA followed by SD) reflect the two phases at each step of the computation.

31 SASD breakdown This table shows that consecutive SASD phases need not complete an l 1 minimisation in subspace defined by non-zero β components. κ variables at 0 subspace SASD steps ,8, , , ,9, ,9, , , Table: Example of backtrack step, diabetes data

32 Variable trajectories β i (κ) β i (κ) κ κ Figure: Homotopy algorithm illustrated on the diabetes data. The left panel shows the complete homotopy. The numbers on the right of this panel label the solution components. The right panel is a magnification of the initial part of the homotopy illustrating the large number of SASD steps that are taken.

33 l 1 descent calculations λ l 1 iterations solution zeros variables selected Table: l 1 descent calculations diabetes data Random initialisation is used. 10 steps are minimum needed for each λ. Total number of iterations is 283. Descent algorithm used a secant based line search.

34 Two class classification problem gor2 Idea is given training data (x 1, y 1 ),, (x n, y n ) where x i R p, y i { 1, 1} find rule so that given new x can assign class from { 1, 1}. Use l 1 -norm SVM: n p min 1 y i β 0 + β j h j (x i ), β β p κ β 0, β i=1 j=1 Basic algorithm applies with very minor modifications to take account of the unconstrained variable. Fitted model is f (x) = β0 + + p β j h j (x). j=1 The class assignment is given by sgn f (x).

35 Dantzig selector [3] gor2 Variable selection when p n and a local approximate orthogonality condition called uniform uncertainty principle holds. Basic form is ( min β 1, X T r 1 + t 1) σ 2 log p. β where σ is variance and t > 0 is a parameter whose choice affects the level of confidence in the results. This is equivalent to a problem of form min β X T r, β 1 κ, which has similar necessary conditions with λ 1 λ so it fits the lasso framework. It is also trying to make small the standard least squares criterion, and normal errors are assumed.

36 Another LP caution [3] suggest that LP be used to implement the Dantzig Selector. Two ways to pose max norm approximation problem as LP. Descent min h,β h; he Xβ y he. Ascent [ max y T y T ] u, u 0 [ e T e T ] X T X T u = e 1. Ascent algorithm identical to first algorithm of Remes. It performs well with systematic data (p step 2 nd order convergent) while descent algorithm is O { n 2}

37 References quadratic objective R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1): , M. R. Osborne, B. Presnell, and B. A. Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20: , B. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variable selection. Technometrics, 47(3): , S. Rosset and Ji Zhu. Piecewise linear regularised solution paths. Annals of Statistics, 35(3): , gorti96 goopt98b gosrjz

38 References piecewise linear objective M. R. Osborne. Simplicial Algorithms for Minimizing Polyhedral Functions. Cambridge University Press, J. Zhu, T. Hastie, S. Rosset, and R. Tibshirani. l 1 -norm Support Vector Machines. Advances in Neural Information Processing Systems, 16:49 56, E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. Annals of Statistics, 35(6): , goct gozhrt

Lasso applications: regularisation and homotopy

Lasso applications: regularisation and homotopy M.R. Osborne 1 mailto:mike.osborne@anu.edu.au 1 Mathematical Sciences Institute, Australian National University Abstract Ti suggested the use of an l 1 norm