Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique
1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap
1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap
Supervised learning Setting Data x i X, y i Y for i = 1,..., n x i is an input and y i is an output x i are called features and x i X = R d y i are called labels Y = { 1, 1} or Y = {0, 1} for binary classification Y = {1,..., K} for multiclass classification Y = R for regression Goal: given a new x s, predict y s.
Supervised learning Loss functions, linearity What to do Minimize with respect to f : R d R where R n (f ) = 1 n n l(y i, f (x i )) i=1 l is a loss function. l(y i, f (x i )) small means y i is close to f (x i ) R n (f ) is called goodness-of-fit or empirical risk Computation f is called training or estimation step
Supervised learning Loss functions, linearity Hence: When d is large, impossible to fit a complex functions f on the data When n is large, training is too time-consuming for a complex function f Choose a linear function f : f (x) = x, θ = d x j θ j, j=1 for some parameter vector θ R d to be trained Remark: linear with respect to x i, but you can choose the x i based on the data. Hence, not linear w.r.t the original features
Supervised learning Loss functions, linearity Training the model: compute where Classical losses R n (θ) = 1 n ˆθ argmin R n (θ) θ R d n l(y i, x i, θ ). i=1 l(y, z) = 1 2 (y z)2 : least-squares loss, linear regression (label y R) l(y, z) = (1 yz) + hinge loss, or SVM loss (binary classification, label y { 1, 1}) l(y, z) = log(1 + e yz ) logistic loss (binary classification, label y { 1, 1})
Supervised learning Loss functions, linearity l least sq (y, z) = 1 2 (y z)2 l hinge (y, z) = (1 yz) + l logistic (y, z) = log(1 + e yz )
1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap
Penalization Introduction You should never actually fit a model by minimizing only 1 ˆθ n argmin θ R d n You should minimize instead where { 1 ˆθ n argmin θ R d n n i=1 n l(y i, x i, θ ). i=1 } l(y i, x i, θ ) + λ pen(θ) pen is a penalization function, that encodes a prior assumption on θ. It forbids θ to be too complex λ > 0 is a tuning or smoothing parameter, that balances goodness-of-fit and penalization
Penalization Introduction Why using penalization? { 1 ˆθ argmin θ R d n n i=1 } l(y i, x i, θ ) + λ pen(θ) Penalization, for a well-chosen λ > 0, allows to avoid overfitting
Penalization Ridge Most classical penalization is the Ridge penalization pen(θ) = θ 2 2 = d θj 2. j=1 It penalizes the energy of θ, measured by squared l 2 -norm Sparsity inducing penalization. It would be nice to find a model where ˆθ j = 0 for many coordinates j few features are useful for prediction, the model is simpler, with a smaller dimension We say that ˆθ is sparse How to do it?
Penalization Sparsity It is tempting to use where { 1 ˆθ argmin θ R d n n l(y i, x i, θ ) + λ θ 0 }, i=1 θ 0 = #{j : θ j 0}. But, to do it exactly, you need to try all possible subsets of non-zero coordinates of θ: 2 d possibilities. Impossible!
Penalization Lasso A solution: Lasso penalization (least absolute shrinkage and selection operator) pen(θ) = θ 1 = d θ j. j=1 This is penalization based on the l 1 -norm 1. In a noiseless setting [compressed sensing, basis pursuit], in a certain regime, l 1 -minimization gives the same solution as 0 But the Lasso penalized problem is easy to compute Why do l 1 -penalization leads to sparsity?
Penalization Lasso Why l 2 (ridge) does not induce sparsity?
Penalization Lasso Hence, a minimizer { 1 ˆθ argmin θ R d n n } l(y i, x i, θ ) + λ θ 1 i=1 is typically sparse (ˆθ j = 0 for many j). for λ large (larger than some constant) ˆθ j = 0 for all j for λ = 0 then there is no penalization Between the two, the sparsity depends on the value of λ: once again, it is a regularization or penalization parameter
Penalization Lasso For the least squares loss { 1 ˆθ argmin θ R d 2n Y X θ 2 2 + λ } 2 θ 2 2 is called ridge linear regression, and { 1 } ˆθ argmin θ R d 2n Y X θ 2 2 + λ θ 1 is called Lasso linear regression.
Penalization Lasso Consider the minimization problem for λ > 0 and b R 1 min a R 2 (a b)2 + λ a Derivative at 0 + : d + = λ b Derivative at 0 : d = λ b Let a be the solution Hence a = 0 iff d + 0 and d 0, namely b λ a 0 iff d + 0, namely b λ and a = b λ a 0 iff d 0, namely b λ and a = b + λ where a + = max(0, a) a = sign(b)( b λ) +
Penalization Lasso As a consequence, we have where 1 a = argmin a R d 2 a b 2 2 + λ a 1 = S λ (b) S λ (b) = sign(b) ( b λ) + is the soft-thresholding operator
1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap
Quick recap f : R d [, + ] is convex if f (tx + (1 t)x ) tf (x) + (1 t)f (x ) for any x, x R d, t [0, 1] proper if not equal to or + everywhere [note that proper valued in (, + ] lower-semicontinuous (l.s.c) if and only if for any x and a sequence x n x we have f (x) lim inf n f (x n ). The set of such functions is often denoted Γ 0 (R d ) or Γ 0
Proximal operator For any g is convex l.s.c and any y R d, we define the proximal operator { 1 prox g (y) = argmin x R d 2 x y 2 2 + g(x)} (strongly convex problem unique minimum) We have seen that the soft-thresholding is the proximal operator of the l 1 -norm prox λ 1 (y) = S λ (y) = sign(y) ( y λ)+ Proximal operators and proximal algorithms are now fundamental tools for optimization in machine learning
Examples of proximal operators g(x) = c for a constant c, prox g = Id If C convex set, and g(x) = δ C (x) = { 0 if x C + if x / C then prox g = proj C = projection onto C. If g(x) = b, x + c, then prox λg (x) = x λb If g(x) = 1 2 x Ax + b, x + c with A symetric positive, then prox λg (x) = (I + λa) 1 (x λb)
Examples of proximal operators If g(x) = 1 2 x 2 2 then prox λg (x) = 1 x = shrinkage operator 1 + λ If g(x) = log x then If g(x) = x 2 then prox λg (x) = x + x 2 + 4λ 2 prox λg (x) = the block soft-thresholding operator ( 1 λ ) x x, 2 +
Examples of proximal operators If g(x) = x 1 + γ 2 x 2 2 (elastic-net) where γ > 0, then prox λg (x) = 1 1 + λγ prox λ 1 (x) If g(x) = g G x g 2 where G partition of {1,..., d}, (prox λg (x)) g = ( 1 λ x g 2 ) + x g, for g G. Block soft-thresholding, used for group-lasso
Subdifferential, Fenchel conjuguate The subdifferential of f Γ 0 at x is the set f (x) = { g R d : f (y) g, y x + f (x) for all y R d} Each element is called a subgradient Optimality criterion 0 f (x) iff f (x) f (y) y If f is differentiable at x, then f (x) = { f (x)} Example: 0 = [ 1, 1]
Fenchel conjuguate The Fenchel conjugate of a function f on R d is given by f { } (x) = sup x, y f (y) y R d Always a convex function (as a sup of continuous, linear functions). It is the smallest constant c such that the affine function is below f. x y, x c Fenchel-Young inequality: we have for any x and y. f (x) + f (y) x, y
Fenchel conjuguate and subgradients Legendre Fenchel identity: if f Γ 0 we have x, y = f (x) + f (y) y f (x) x f (y) Example. Fenchel conjuguate of a norm : where x = sup y R d { x, y y } = δ{y R d : y 1}(x), x = max x, y y R d : y 1 dual norm of [recall that x p = x q with 1/p + 1/q = 1]
Some extras f : R d [, + ] is f is L-smooth if it is continuously differentiable and if f (x) f (y) 2 L x y 2 for any x, y R d. Equivalent to H f (x) LI d for all x, where H f (x) Hessian at x when twice continuously differentiable [i.e. LI d H f (x) positive semi-definite] f is µ-strongly convex if f ( ) µ 2 2 2 is convex. Equivalent to f (y) f (x) + g, y x + µ 2 y x 2 2 for g f (x). Equivalent to H f (x) µi d when twice differentiable. f is µ-strongly convex iff f is 1 µ -smooth
1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap
The general problem we want to solve How to solve { 1 ˆθ argmin θ R d n n i=1 } l(y i, x i, θ ) + λ pen(θ)??? Put for short f (θ) = 1 n n l(y i, x i, θ ) and g(θ) = λ pen(θ) i=1 Assume that f is convex and L-smooth g is convex and continuous, but possibly non-smooth (for instance l 1 penalization) g is prox-capable: not hard to compute its proximal operator
Examples Smoothness of f : Least-squares: Logistic loss: f (θ) = 1 n X (X θ Y ), L = X X op n f (θ) = 1 n n i=1 y i 1 + e y i x i,θ x i, L = max i=1,...,n X i 2 2 4n Prox-capability of g: we gave the explicit prox for many penalizations above
Gradient descent Now how do I minimize f + g? Key point: the descent lemma. If f convex and L-smooth, then for any L L: f (θ ) f (θ) + f (θ), θ θ + L 2 θ θ 2 2 for any θ, θ R d At iteration k, the current point is θ k. I use the descent lemma: f (θ) f (θ k ) + f (θ k ), θ θ k + L 2 θ θk 2 2.
Gradient descent Remark that argmin {f (θ k ) + f (θ k ), θ θ k + L θ R d ) 2 = argmin θ R d θ ( θ k 1 L f (θk ) 2 θ θk 2 2 2 } Hence, choose θ k+1 = θ k 1 L f (θk ) This is the basic gradient descent algorithm [cf previous lecture] Gradient descent is based on a majoration-minimization principle, with a quadratic majorant given by the descent lemma But we forgot about g...
ISTA Let s put back g: f (θ) + g(θ) f (θ k ) + f (θ k ), θ θ k + L 2 θ θk 2 2 + g(θ) and again argmin {f (θ k ) + f (θ k ), θ θ k + L θ R d { L = argmin θ R d = argmin θ R d = prox g/l θ 2 { 1 θ 2 ( θ k 1 L f (θk ) ( θ k 1 L f (θk ) ( θ k 1 ) L f (θk ) } 2 θ θk 2 2 + g(θ) ) 2 ) 2 2 + g(θ) } 2 + 1 L g(θ) } The prox operator naturally appears because of the descent lemma
ISTA Proximal gradient descent algorithm [also called ISTA] Input: starting point θ 0, Lipschitz constant L > 0 for f For k = 1, 2,... until converged do ) θ k = prox g/l (θ k 1 1 L f (θk 1 ) Return last θ k Also called Forward-Backward splitting. For Lasso with least-squares loss, iteration is θ k = S λ/l (θ k 1 1 ) L (X X θ k 1 X Y ), where S λ is the soft-thresholding operator
ISTA Put for short F = f + g, Take any θ argmin θ R d F (θ) Theorem (Beck Teboulle (2009)) If the sequence {θ k } is generated by ISTA, then F (θ k ) F (θ ) L θ0 θ 2 2 2k Convergence rate is O(1/k) Is it possible to improve the O(1/k) rate?
FISTA Yes! Using Accelerated proximal gradient descent (called FISTA, Nesterov 83, 04, Beck Teboule 09) Idea: to find θ k+1, use an interpolation between θ k and θ k 1 Accelerated proximal gradient descent algorithm [FISTA] Input: starting points z 1 = θ 0, Lipschitz constant L > 0 for f, t 1 = 1 For k = 1, 2,... until converged do θ k = prox g/l (z k 1 L f (zk )) t k+1 = 1+ 1+4t 2 k 2 z k+1 = θ k + t k 1 t k+1 (θ k θ k 1 ) Return last θ k
FISTA Theorem (Beck Teboulle (2009)) If the sequence {θ k } is generated by FISTA, then F (θ k ) F (θ ) 2L θ0 θ 2 2 (k + 1) 2 Convergence rate is O(1/k 2 ) Is O(1/k 2 ) the optimal rate in general?
FISTA Yes. Put g = 0 Theorem (Nesterov) For any optimization procedure satisfying θ k+1 θ 1 + span( f (θ 1 ),..., f (θ k )), there is a function f on R d convex and L-smooth such that for any 1 k (d 1)/2. min f 1 j k (θj ) f (θ ) 3L 32 θ 1 θ 2 2 (k + 1) 2
FISTA Comparison of ISTA and FISTA FISTA is not a descent algorithm, while ISTA is
FISTA [Proof of convergence of FISTA on the blackboard]
Backtracking linesearch What if I don t know L > 0? X X op can be long to compute Letting L evolve along iterations k generally improve convergence speed Backtracking linesearch. Idea: Start from a very small lipschitz constant L Between iteration k and k + 1, choose the smallest L satisfying the lemma descent at z k
Backtracking linesearch At iteration k of FISTA, we have z k and a constant L k 1 Put L L k 2 Do an iteration θ prox g/l (z k 1 ) L f (zk ) 3 Check it this step satisfies the descent lemma at z k : f (θ) + g(θ) f (z k ) + f (z k ), θ z k + L 2 θ zk 2 2 + g(θ) 4 If yes, then θ k+1 θ and L k+1 L and continue FISTA 5 It not, then put L 2L (say), and go back to point 2 Sequence L k is non-decreasing: between iteration k and k + 1, a tweak is to decrease it a little bit to have (much) faster convergence
1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap
Fenchel duality How to stop an iterative optimization algorithm? If F objective function, fix ε > 0 small, and stop when F (θ k+1 ) F (θ k ) F (θ k ) ε or f (θ k ) ε An alternative is to compute the duality gap Fenchel Duality. Consider the problem min θ f (Aθ) + g(θ) with f : R d R, g : R p R and a d p matrix A. We have sup{ f (u) g ( A u)} inf{f (Aθ) + g(θ)} u θ Moreover, if f and g are convex, then under mild assumptions, equality of both sides holds (strong duality, no duality gap)
Fenchel duality Fenchel Duality sup{ f (u) g ( A u)} inf{f (Aθ) + g(θ)} u θ Right part is the primal problem Left part is a dual formulation of the primal problem If θ is an optimum for the primal and u is an optimum for the dual, then f (u ) g ( A u ) = f (Aθ ) + g(θ ) When g(θ) = λ θ where λ > 0 and is a norm, this writes sup u: A u λ f (u) inf{f (Aθ) + λ θ } θ
Duality gap If (θ, u ) is a pair of primal/dual solutions then we have u f (Aθ ) or u = f (Aθ ) if f is differentiable Namely, we have at optimum A f (Aθ ) λ and f (Aθ ) + λ θ + f ( f (Aθ )) = 0 Natural stopping rule: imagine we are at iteration k of an optimization algorithm, current primal variable is θ k. Define ( u k = u(θ k λ ) ) = min 1, A f (Aθ k f (Aθ k ) ) and stop at iteration k when for a given small ε > 0 f (Aθ k ) + λ θ k + f (u k ) ε
Duality gap Back to machine learning: n l(y i, x i, θ ) = i=1 n l(y i, (X θ) i ) = f (X θ) i=1 with f (z) = n i=1 l(y i, z i ) for z = [z 1 z n ] and X matrix with lines x 1,..., x n. Gradient is f (X θ) = n l (y i, x i, θ )x i, i=1 where l (y, z) = l(y, z)/ z. For the duality gap, we need to compute f.
Duality gap For least squares f (z) = 1 2 y z 2 2, we have f (u) = 1 2 u 2 2 + u, y For logistic regression f (z) = n i=1 log(1 + e y i z i ) we have f (u) = n (1 + u i y i ) log(1 + u i y i ) u i y i log( u i y i ) i=1 if u i y i (0, 1] for any i = 1,..., d and otherwise f (u) = +
Duality gap Example. Stopping criterion for Lasso based on duality gap: Compute residual Compute dual variable Stop if r k X θ k y ( u k λ ) = X r k 1 1 2 r k 2 2 + λ θ k 1 + u k 2 2 + u k, y ε r k