ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017
Outline Proximal operators Proximal gradient methods for lasso and its extensions Nesterov s accelerated algorithm Lasso: algorithms and extensions 4-2
Proximal operators Lasso: algorithms and extensions 4-3
Gradient descent minimize β R p where f(β) is convex and differentiable f(β) Algorithm 4.1 Gradient descent for t = 0, 1, : β t+1 = β t µ t f(β t ) where µ t : step size / learning rate Lasso: algorithms and extensions 4-4
A proximal point of view of GD β t+1 = arg min β { f(β t ) + f(β t ), β β t }{{} linear approximation When µ t is small, β t+1 tends to stay close to β t + 1 } β β t 2 2µ } t {{} proximal term Lasso: algorithms and extensions 4-5
Proximal operator If we define the proximal operator prox h (b) := arg min β { 1 2 β b 2 + h(β)} for any convex function h, then one can write β t+1 = prox µtf t ( β t) where f t (β) := f(β t ) + f(β t ), β β t Lasso: algorithms and extensions 4-6
Why consider proximal operators? { 1 prox h (b) := arg min β 2 β b 2 + h(β)} It is well-defined under very general conditions (including nonsmooth convex functions) The operator can be evaluated efficiently for many widely used functions (in particular, regularizers) This abstraction is conceptually and mathematically simple, and covers many well-known optimization algorithms Lasso: algorithms and extensions 4-7
Example: characteristic functions If h is characteristic function { 0, if β C h(β) =, else then prox h (b) = arg min β C β b 2 (Euclidean projection) Lasso: algorithms and extensions 4-8
Example: l 1 norm If h(β) = β 1, then prox λh (b) = ψ st (b; λ) where soft-thresholding ψ st ( ) is applied in an entry-wise manner. Lasso: algorithms and extensions 4-9
Example: l 2 norm { 1 prox h (b) := arg min β 2 β b 2 + h(β)} If h(β) = β, then prox λh (b) = ( 1 λ ) b b + where a + := max{a, 0}. This is called block soft thresholding. Lasso: algorithms and extensions 4-10
Example: log barrier { 1 prox h (b) := arg min β 2 β b 2 + h(β)} If h(β) = p i=1 log β i, then (prox λh (b)) i = b i + b 2 i + 4λ 2 Lasso: algorithms and extensions 4-11
Nonexpansiveness of proximal operators { 0, if β C Recall that when h(β) =, prox h (β) is Euclidean else projection P C onto C, which is nonexpansive: P C (β 1 ) P C (β 2 ) β 1 β 2 Lasso: algorithms and extensions 4-12
Nonexpansiveness of proximal operators Nonexpansiveness is a property for general prox h ( ) Fact 4.1 (Nonexpansiveness) prox h (β 1 ) prox h (β 2 ) β 1 β 2 In some sense, proximal operator behaves like projection Lasso: algorithms and extensions 4-13
Proof of nonexpansiveness Let z 1 = prox h (β 1 ) and z 2 = prox h (β 2 ). Subgradient characterizations of z 1 and z 2 read The claim would follow if β 1 z 1 h(z 1 ) and β 2 z 2 h(z 2 ) (β 1 β 2 ) (z 1 z 2 ) z 1 z 2 2 (together with Cauchy-Schwarz) = (β 1 z 1 β 2 + z 2 ) (z 1 z 2 ) 0 h(z 2 ) h(z 1 ) + β 1 z 1, z 2 z 1 }{{} h(z = 1 ) h(z 1 ) h(z 2 ) + β 2 z 2, z 1 z 2 }{{} h(z 2 ) Lasso: algorithms and extensions 4-14
Proximal gradient methods Lasso: algorithms and extensions 4-15
Optimizing composite functions (Lasso) minimize β R p 1 Xβ y 2 + λ β 1 = f(β) + g(β) } 2 {{}}{{} :=g(β) :=f(β) where f(β) is differentiable, and g(β) is non-smooth Since g(β) is non-differentiable, we cannot run vanilla gradient descent Lasso: algorithms and extensions 4-16
Proximal gradient methods One strategy: replace f(β) with linear approximation, and compute the proximal solution { β t+1 = arg min f(β t ) + f(β t ), β β t + g(β) + 1 } β β t 2 β 2µ t The optimality condition reads 0 f(β t ) + g(β t+1 ) + 1 µ t ( β t+1 β t) which is equivalent to optimality condition of { β t+1 = arg min g(β) + 1 β ( β t µ t f(β t ) ) } 2 β 2µ t ( ) = prox µtg β t µ t f(β t ) Lasso: algorithms and extensions 4-17
Proximal gradient methods Alternate between gradient updates on f and proximal minimization on g Algorithm 4.2 Proximal gradient methods for t = 0, 1, : β t+1 = prox µtg where µ t : step size / learning rate ( ) β t µ t f(β t ) Lasso: algorithms and extensions 4-18
Projected gradient methods 0, if β C When g(β) =, else }{{} convex is characteristic function: ) β t+1 = P C (β t µ t f(β t ) := arg min β (β t µ t f(β t )) β C This is a first-order method to solve the constrained optimization minimize β s.t. f(β) β C Lasso: algorithms and extensions 4-19
Proximal gradient methods for lasso For lasso: f(β) = 1 2 Xβ y 2 and g(β) = λ β 1, prox g (β) { } 1 = arg min b 2 β b 2 + λ b 1 = ψ st (β; λ) ) = β t+1 = ψ st (β t µ t X (Xβ t y); µ t λ (iterative soft thresholding) Lasso: algorithms and extensions 4-20
Proximal gradient methods for group lasso Sometimes variables have a natural group structure, and it is desirable to set all variables within a group to be zero (or nonzero) simultaneously (group lasso) where β j R p/k and β = 1 Xβ y 2 + λ k } 2 β j j=1 {{}}{{} :=f(β) :=g(β) β 1. β k. prox g (β) = ψ bst (β; λ) := [ ( 1 λ ) ] β j β j + 1 j k = β t+1 = ψ bst ( β t µ t X (Xβ t y); µ t λ ) Lasso: algorithms and extensions 4-21
Proximal gradient methods for elastic net Lasso does not handle highly correlated variables well: if there is a group of highly correlated variables, lasso often picks one from the group and ignore the rest. Sometimes we make a compromise between lasso and l 2 penalties 1 { } (elastic net) = β t+1 = Xβ y 2 + λ } 2 {{} :=f(β) prox λg (β) = β 1 + (γ/2) β 2 2 }{{} :=g(β) 1 1 + λγ ψ st (β; λ) 1 ( ) 1 + µ t λγ ψ st β t µ t X (Xβ t y); µ t λ soft thresholding followed by multiplicative shrinkage Lasso: algorithms and extensions 4-22
Interpretation: majorization-minimization f µt (β, β t ) := f(β t ) + f(β t ), β β t + 1 β β t 2 }{{} 2µ } t {{} linearization trust region penalty majorizes f(β) if 0 < µ t < 1 L, where L is Lipschitz constant1 of f( ) Proximal gradient descent is a majorization-minimization algorithm { } β t+1 = arg min f µt (β, β t ) + g(β) β }{{}}{{} majorization minimization 1 This means f(β) f(b) L β b for all β and b Lasso: algorithms and extensions 4-23
Convergence rate of proximal gradient methods Theorem 4.2 (fixed step size; Nesterov 07) Suppose g is convex, and f is differentiable and convex whose gradient has Lipschitz constant L. If µ t µ (0, 1/L), then ( ) f(β t ) + g(β t { } 1 ) min f(β) + g(β) O β t Step size requires an upper bound on L May prefer backtracking line search to fixed step size Question: can we further improve the convergence rate? Lasso: algorithms and extensions 4-24
Nesterov s accelerated gradient methods Lasso: algorithms and extensions 4-25
Nesterov s accelerated method Problem of gradient descent: zigzagging Nesterov s idea: include a momentum term to avoid overshooting Lasso: algorithms and extensions 4-26
Nesterov s accelerated method Nesterov s idea: include a momentum term to avoid overshooting β t = ( prox µtg b t 1 µ t f (b t 1)) b t = ( β t + α t β t β t 1) (extrapolation) } {{ } momentum term A simple (but mysterious) choice of extrapolation parameter α t = t 1 t + 2 Fixed size µ t µ (0, 1/L) or backtracking line search Same computational cost per iteration as proximal gradient Lasso: algorithms and extensions 4-27
Convergence rate of Nesterov s accelerated method Theorem 4.3 (Nesterov 83, Nesterov 07) Suppose f is differentiable and convex and g is convex. If one takes α t = t 1 t+2 and a fixed step size µ t µ (0, 1/L), then ( ) f(β t ) + g(β t { } 1 ) min f(β) + g(β) O β t 2 In general, this rate cannot be improved if one only uses gradient information! Lasso: algorithms and extensions 4-28
Numerical experiments (for lasso) Figure credit: Hastie, Tibshirani, & Wainwright 15 Lasso: algorithms and extensions 4-29
Reference [1] Proximal algorithms, Neal Parikh and S. Boyd, Foundations and Trends in Optimization, 2013. [2] Convex optimization algorithms, D. Bertsekas, Athena Scientific, 2015. [3] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and Trends in Machine Learning, 2015. [4] Statistical learning with sparsity: the Lasso and generalizations, T. Hastie, R. Tibshirani, and M. Wainwright, 2015. [5] Model selection and estimation in regression with grouped variables, M. Yuan and Y. Lin, Journal of the royal statistical society, 2006. [6] A method of solving a convex programming problem with convergence rate O(1/k 2 ), Y. Nestrove, Soviet Mathematics Doklady, 1983. Lasso: algorithms and extensions 4-30
Reference [7] Gradient methods for minimizing composite functions,, Y. Nesterov, Technical Report, 2007. [8] A fast iterative shrinkage-thresholding algorithm for linear inverse problems, A. Beck and M. Teboulle, SIAM journal on imaging sciences, 2009. Lasso: algorithms and extensions 4-31