Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Similar documents
Optimization methods

Proximal methods. S. Villa. October 7, 2014

Optimization methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Coordinate Descent and Ascent Methods

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Lasso: Algorithms and Extensions

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

1 Sparsity and l 1 relaxation

Convex Optimization Lecture 16

Lecture 9: September 28

Conditional Gradient (Frank-Wolfe) Method

Oslo Class 6 Sparsity based regularization

STA141C: Big Data & High Performance Statistical Computing

6. Proximal gradient method

CPSC 540: Machine Learning

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Lecture 8: February 9

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Computational Statistics and Optimisation. Joseph Salmon Télécom Paristech, Institut Mines-Télécom

Learning with stochastic proximal gradient

DATA MINING AND MACHINE LEARNING

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Homework 4. Convex Optimization /36-725

CPSC 540: Machine Learning

6. Proximal gradient method

CSC 411 Lecture 17: Support Vector Machine

Convex optimization COMS 4771

Lecture 1: September 25

Big Data Analytics: Optimization and Randomization

Nesterov s Optimal Gradient Methods

Statistical Data Mining and Machine Learning Hilary Term 2016

Fast proximal gradient methods

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

ECS289: Scalable Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Stochastic Optimization: First order method

Classification Logistic Regression

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Dual Proximal Gradient Method

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Methods for Optimization with Spasity-inducing Norms

Linear Methods for Regression. Lijun Zhang

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Newton s Method. Javier Peña Convex Optimization /36-725

Machine Learning And Applications: Supervised Learning-SVM

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

ICS-E4030 Kernel Methods in Machine Learning

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

Optimization for Learning and Big Data

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Descent methods. min x. f(x)

SVMs, Duality and the Kernel Trick

9. Dual decomposition and dual algorithms

Discriminative Models

Regression.

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

COMS 4771 Regression. Nakul Verma

Convex Optimization and l 1 -minimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Functions. Pontus Giselsson

Math 273a: Optimization Convex Conjugacy

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerating Stochastic Optimization

10-725/36-725: Convex Optimization Prerequisite Topics

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

MATH 680 Fall November 27, Homework 3

Convex Optimization Algorithms for Machine Learning in 10 Slides

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Discriminative Models

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

STAT 200C: High-dimensional Statistics

Lecture 25: November 27

Perturbed Proximal Gradient Algorithm

ORIE 4741 Final Exam

A Unified Approach to Proximal Algorithms using Bregman Distance

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Transcription:

Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique

1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

Supervised learning Setting Data x i X, y i Y for i = 1,..., n x i is an input and y i is an output x i are called features and x i X = R d y i are called labels Y = { 1, 1} or Y = {0, 1} for binary classification Y = {1,..., K} for multiclass classification Y = R for regression Goal: given a new x s, predict y s.

Supervised learning Loss functions, linearity What to do Minimize with respect to f : R d R where R n (f ) = 1 n n l(y i, f (x i )) i=1 l is a loss function. l(y i, f (x i )) small means y i is close to f (x i ) R n (f ) is called goodness-of-fit or empirical risk Computation f is called training or estimation step

Supervised learning Loss functions, linearity Hence: When d is large, impossible to fit a complex functions f on the data When n is large, training is too time-consuming for a complex function f Choose a linear function f : f (x) = x, θ = d x j θ j, j=1 for some parameter vector θ R d to be trained Remark: linear with respect to x i, but you can choose the x i based on the data. Hence, not linear w.r.t the original features

Supervised learning Loss functions, linearity Training the model: compute where Classical losses R n (θ) = 1 n ˆθ argmin R n (θ) θ R d n l(y i, x i, θ ). i=1 l(y, z) = 1 2 (y z)2 : least-squares loss, linear regression (label y R) l(y, z) = (1 yz) + hinge loss, or SVM loss (binary classification, label y { 1, 1}) l(y, z) = log(1 + e yz ) logistic loss (binary classification, label y { 1, 1})

Supervised learning Loss functions, linearity l least sq (y, z) = 1 2 (y z)2 l hinge (y, z) = (1 yz) + l logistic (y, z) = log(1 + e yz )

1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

Penalization Introduction You should never actually fit a model by minimizing only 1 ˆθ n argmin θ R d n You should minimize instead where { 1 ˆθ n argmin θ R d n n i=1 n l(y i, x i, θ ). i=1 } l(y i, x i, θ ) + λ pen(θ) pen is a penalization function, that encodes a prior assumption on θ. It forbids θ to be too complex λ > 0 is a tuning or smoothing parameter, that balances goodness-of-fit and penalization

Penalization Introduction Why using penalization? { 1 ˆθ argmin θ R d n n i=1 } l(y i, x i, θ ) + λ pen(θ) Penalization, for a well-chosen λ > 0, allows to avoid overfitting

Penalization Ridge Most classical penalization is the Ridge penalization pen(θ) = θ 2 2 = d θj 2. j=1 It penalizes the energy of θ, measured by squared l 2 -norm Sparsity inducing penalization. It would be nice to find a model where ˆθ j = 0 for many coordinates j few features are useful for prediction, the model is simpler, with a smaller dimension We say that ˆθ is sparse How to do it?

Penalization Sparsity It is tempting to use where { 1 ˆθ argmin θ R d n n l(y i, x i, θ ) + λ θ 0 }, i=1 θ 0 = #{j : θ j 0}. But, to do it exactly, you need to try all possible subsets of non-zero coordinates of θ: 2 d possibilities. Impossible!

Penalization Lasso A solution: Lasso penalization (least absolute shrinkage and selection operator) pen(θ) = θ 1 = d θ j. j=1 This is penalization based on the l 1 -norm 1. In a noiseless setting [compressed sensing, basis pursuit], in a certain regime, l 1 -minimization gives the same solution as 0 But the Lasso penalized problem is easy to compute Why do l 1 -penalization leads to sparsity?

Penalization Lasso Why l 2 (ridge) does not induce sparsity?

Penalization Lasso Hence, a minimizer { 1 ˆθ argmin θ R d n n } l(y i, x i, θ ) + λ θ 1 i=1 is typically sparse (ˆθ j = 0 for many j). for λ large (larger than some constant) ˆθ j = 0 for all j for λ = 0 then there is no penalization Between the two, the sparsity depends on the value of λ: once again, it is a regularization or penalization parameter

Penalization Lasso For the least squares loss { 1 ˆθ argmin θ R d 2n Y X θ 2 2 + λ } 2 θ 2 2 is called ridge linear regression, and { 1 } ˆθ argmin θ R d 2n Y X θ 2 2 + λ θ 1 is called Lasso linear regression.

Penalization Lasso Consider the minimization problem for λ > 0 and b R 1 min a R 2 (a b)2 + λ a Derivative at 0 + : d + = λ b Derivative at 0 : d = λ b Let a be the solution Hence a = 0 iff d + 0 and d 0, namely b λ a 0 iff d + 0, namely b λ and a = b λ a 0 iff d 0, namely b λ and a = b + λ where a + = max(0, a) a = sign(b)( b λ) +

Penalization Lasso As a consequence, we have where 1 a = argmin a R d 2 a b 2 2 + λ a 1 = S λ (b) S λ (b) = sign(b) ( b λ) + is the soft-thresholding operator

1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

Quick recap f : R d [, + ] is convex if f (tx + (1 t)x ) tf (x) + (1 t)f (x ) for any x, x R d, t [0, 1] proper if not equal to or + everywhere [note that proper valued in (, + ] lower-semicontinuous (l.s.c) if and only if for any x and a sequence x n x we have f (x) lim inf n f (x n ). The set of such functions is often denoted Γ 0 (R d ) or Γ 0

Proximal operator For any g is convex l.s.c and any y R d, we define the proximal operator { 1 prox g (y) = argmin x R d 2 x y 2 2 + g(x)} (strongly convex problem unique minimum) We have seen that the soft-thresholding is the proximal operator of the l 1 -norm prox λ 1 (y) = S λ (y) = sign(y) ( y λ)+ Proximal operators and proximal algorithms are now fundamental tools for optimization in machine learning

Examples of proximal operators g(x) = c for a constant c, prox g = Id If C convex set, and g(x) = δ C (x) = { 0 if x C + if x / C then prox g = proj C = projection onto C. If g(x) = b, x + c, then prox λg (x) = x λb If g(x) = 1 2 x Ax + b, x + c with A symetric positive, then prox λg (x) = (I + λa) 1 (x λb)

Examples of proximal operators If g(x) = 1 2 x 2 2 then prox λg (x) = 1 x = shrinkage operator 1 + λ If g(x) = log x then If g(x) = x 2 then prox λg (x) = x + x 2 + 4λ 2 prox λg (x) = the block soft-thresholding operator ( 1 λ ) x x, 2 +

Examples of proximal operators If g(x) = x 1 + γ 2 x 2 2 (elastic-net) where γ > 0, then prox λg (x) = 1 1 + λγ prox λ 1 (x) If g(x) = g G x g 2 where G partition of {1,..., d}, (prox λg (x)) g = ( 1 λ x g 2 ) + x g, for g G. Block soft-thresholding, used for group-lasso

Subdifferential, Fenchel conjuguate The subdifferential of f Γ 0 at x is the set f (x) = { g R d : f (y) g, y x + f (x) for all y R d} Each element is called a subgradient Optimality criterion 0 f (x) iff f (x) f (y) y If f is differentiable at x, then f (x) = { f (x)} Example: 0 = [ 1, 1]

Fenchel conjuguate The Fenchel conjugate of a function f on R d is given by f { } (x) = sup x, y f (y) y R d Always a convex function (as a sup of continuous, linear functions). It is the smallest constant c such that the affine function is below f. x y, x c Fenchel-Young inequality: we have for any x and y. f (x) + f (y) x, y

Fenchel conjuguate and subgradients Legendre Fenchel identity: if f Γ 0 we have x, y = f (x) + f (y) y f (x) x f (y) Example. Fenchel conjuguate of a norm : where x = sup y R d { x, y y } = δ{y R d : y 1}(x), x = max x, y y R d : y 1 dual norm of [recall that x p = x q with 1/p + 1/q = 1]

Some extras f : R d [, + ] is f is L-smooth if it is continuously differentiable and if f (x) f (y) 2 L x y 2 for any x, y R d. Equivalent to H f (x) LI d for all x, where H f (x) Hessian at x when twice continuously differentiable [i.e. LI d H f (x) positive semi-definite] f is µ-strongly convex if f ( ) µ 2 2 2 is convex. Equivalent to f (y) f (x) + g, y x + µ 2 y x 2 2 for g f (x). Equivalent to H f (x) µi d when twice differentiable. f is µ-strongly convex iff f is 1 µ -smooth

1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

The general problem we want to solve How to solve { 1 ˆθ argmin θ R d n n i=1 } l(y i, x i, θ ) + λ pen(θ)??? Put for short f (θ) = 1 n n l(y i, x i, θ ) and g(θ) = λ pen(θ) i=1 Assume that f is convex and L-smooth g is convex and continuous, but possibly non-smooth (for instance l 1 penalization) g is prox-capable: not hard to compute its proximal operator

Examples Smoothness of f : Least-squares: Logistic loss: f (θ) = 1 n X (X θ Y ), L = X X op n f (θ) = 1 n n i=1 y i 1 + e y i x i,θ x i, L = max i=1,...,n X i 2 2 4n Prox-capability of g: we gave the explicit prox for many penalizations above

Gradient descent Now how do I minimize f + g? Key point: the descent lemma. If f convex and L-smooth, then for any L L: f (θ ) f (θ) + f (θ), θ θ + L 2 θ θ 2 2 for any θ, θ R d At iteration k, the current point is θ k. I use the descent lemma: f (θ) f (θ k ) + f (θ k ), θ θ k + L 2 θ θk 2 2.

Gradient descent Remark that argmin {f (θ k ) + f (θ k ), θ θ k + L θ R d ) 2 = argmin θ R d θ ( θ k 1 L f (θk ) 2 θ θk 2 2 2 } Hence, choose θ k+1 = θ k 1 L f (θk ) This is the basic gradient descent algorithm [cf previous lecture] Gradient descent is based on a majoration-minimization principle, with a quadratic majorant given by the descent lemma But we forgot about g...

ISTA Let s put back g: f (θ) + g(θ) f (θ k ) + f (θ k ), θ θ k + L 2 θ θk 2 2 + g(θ) and again argmin {f (θ k ) + f (θ k ), θ θ k + L θ R d { L = argmin θ R d = argmin θ R d = prox g/l θ 2 { 1 θ 2 ( θ k 1 L f (θk ) ( θ k 1 L f (θk ) ( θ k 1 ) L f (θk ) } 2 θ θk 2 2 + g(θ) ) 2 ) 2 2 + g(θ) } 2 + 1 L g(θ) } The prox operator naturally appears because of the descent lemma

ISTA Proximal gradient descent algorithm [also called ISTA] Input: starting point θ 0, Lipschitz constant L > 0 for f For k = 1, 2,... until converged do ) θ k = prox g/l (θ k 1 1 L f (θk 1 ) Return last θ k Also called Forward-Backward splitting. For Lasso with least-squares loss, iteration is θ k = S λ/l (θ k 1 1 ) L (X X θ k 1 X Y ), where S λ is the soft-thresholding operator

ISTA Put for short F = f + g, Take any θ argmin θ R d F (θ) Theorem (Beck Teboulle (2009)) If the sequence {θ k } is generated by ISTA, then F (θ k ) F (θ ) L θ0 θ 2 2 2k Convergence rate is O(1/k) Is it possible to improve the O(1/k) rate?

FISTA Yes! Using Accelerated proximal gradient descent (called FISTA, Nesterov 83, 04, Beck Teboule 09) Idea: to find θ k+1, use an interpolation between θ k and θ k 1 Accelerated proximal gradient descent algorithm [FISTA] Input: starting points z 1 = θ 0, Lipschitz constant L > 0 for f, t 1 = 1 For k = 1, 2,... until converged do θ k = prox g/l (z k 1 L f (zk )) t k+1 = 1+ 1+4t 2 k 2 z k+1 = θ k + t k 1 t k+1 (θ k θ k 1 ) Return last θ k

FISTA Theorem (Beck Teboulle (2009)) If the sequence {θ k } is generated by FISTA, then F (θ k ) F (θ ) 2L θ0 θ 2 2 (k + 1) 2 Convergence rate is O(1/k 2 ) Is O(1/k 2 ) the optimal rate in general?

FISTA Yes. Put g = 0 Theorem (Nesterov) For any optimization procedure satisfying θ k+1 θ 1 + span( f (θ 1 ),..., f (θ k )), there is a function f on R d convex and L-smooth such that for any 1 k (d 1)/2. min f 1 j k (θj ) f (θ ) 3L 32 θ 1 θ 2 2 (k + 1) 2

FISTA Comparison of ISTA and FISTA FISTA is not a descent algorithm, while ISTA is

FISTA [Proof of convergence of FISTA on the blackboard]

Backtracking linesearch What if I don t know L > 0? X X op can be long to compute Letting L evolve along iterations k generally improve convergence speed Backtracking linesearch. Idea: Start from a very small lipschitz constant L Between iteration k and k + 1, choose the smallest L satisfying the lemma descent at z k

Backtracking linesearch At iteration k of FISTA, we have z k and a constant L k 1 Put L L k 2 Do an iteration θ prox g/l (z k 1 ) L f (zk ) 3 Check it this step satisfies the descent lemma at z k : f (θ) + g(θ) f (z k ) + f (z k ), θ z k + L 2 θ zk 2 2 + g(θ) 4 If yes, then θ k+1 θ and L k+1 L and continue FISTA 5 It not, then put L 2L (say), and go back to point 2 Sequence L k is non-decreasing: between iteration k and k + 1, a tweak is to decrease it a little bit to have (much) faster convergence

1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

Fenchel duality How to stop an iterative optimization algorithm? If F objective function, fix ε > 0 small, and stop when F (θ k+1 ) F (θ k ) F (θ k ) ε or f (θ k ) ε An alternative is to compute the duality gap Fenchel Duality. Consider the problem min θ f (Aθ) + g(θ) with f : R d R, g : R p R and a d p matrix A. We have sup{ f (u) g ( A u)} inf{f (Aθ) + g(θ)} u θ Moreover, if f and g are convex, then under mild assumptions, equality of both sides holds (strong duality, no duality gap)

Fenchel duality Fenchel Duality sup{ f (u) g ( A u)} inf{f (Aθ) + g(θ)} u θ Right part is the primal problem Left part is a dual formulation of the primal problem If θ is an optimum for the primal and u is an optimum for the dual, then f (u ) g ( A u ) = f (Aθ ) + g(θ ) When g(θ) = λ θ where λ > 0 and is a norm, this writes sup u: A u λ f (u) inf{f (Aθ) + λ θ } θ

Duality gap If (θ, u ) is a pair of primal/dual solutions then we have u f (Aθ ) or u = f (Aθ ) if f is differentiable Namely, we have at optimum A f (Aθ ) λ and f (Aθ ) + λ θ + f ( f (Aθ )) = 0 Natural stopping rule: imagine we are at iteration k of an optimization algorithm, current primal variable is θ k. Define ( u k = u(θ k λ ) ) = min 1, A f (Aθ k f (Aθ k ) ) and stop at iteration k when for a given small ε > 0 f (Aθ k ) + λ θ k + f (u k ) ε

Duality gap Back to machine learning: n l(y i, x i, θ ) = i=1 n l(y i, (X θ) i ) = f (X θ) i=1 with f (z) = n i=1 l(y i, z i ) for z = [z 1 z n ] and X matrix with lines x 1,..., x n. Gradient is f (X θ) = n l (y i, x i, θ )x i, i=1 where l (y, z) = l(y, z)/ z. For the duality gap, we need to compute f.

Duality gap For least squares f (z) = 1 2 y z 2 2, we have f (u) = 1 2 u 2 2 + u, y For logistic regression f (z) = n i=1 log(1 + e y i z i ) we have f (u) = n (1 + u i y i ) log(1 + u i y i ) u i y i log( u i y i ) i=1 if u i y i (0, 1] for any i = 1,..., d and otherwise f (u) = +

Duality gap Example. Stopping criterion for Lasso based on duality gap: Compute residual Compute dual variable Stop if r k X θ k y ( u k λ ) = X r k 1 1 2 r k 2 2 + λ θ k 1 + u k 2 2 + u k, y ε r k