Proximal methods. S. Villa. October 7, 2014

Similar documents
Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Learning with stochastic proximal gradient

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

The proximal mapping

Convex Functions. Pontus Giselsson

Conditional Gradient (Frank-Wolfe) Method

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Lasso: Algorithms and Extensions

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Optimization methods

Optimization methods

6. Proximal gradient method

Convex Optimization Conjugate, Subdifferential, Proximation

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

1 Sparsity and l 1 relaxation

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

6. Proximal gradient method

Oslo Class 6 Sparsity based regularization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Stochastic Optimization: First order method

Big Data Analytics: Optimization and Randomization

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Smoothing Proximal Gradient Method. General Structured Sparse Regression

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Algorithms for Nonsmooth Optimization

Dual Proximal Gradient Method

Introduction to Alternating Direction Method of Multipliers

On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Sequential convex programming,: value function and convergence

ON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Sparsity Regularization

Math 273a: Optimization Subgradients of convex functions

Subgradient Method. Ryan Tibshirani Convex Optimization

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

A Unified Approach to Proximal Algorithms using Bregman Distance

10-725/36-725: Convex Optimization Prerequisite Topics

A Dykstra-like algorithm for two monotone operators

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Perturbed Proximal Gradient Algorithm

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

3.1 Convexity notions for functions and basic properties

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

About Split Proximal Algorithms for the Q-Lasso

Proximal Operator and Proximal Algorithms (Lecture notes of UCLA 285J Fall 2016)

A semi-algebraic look at first-order methods

Lecture 3. Optimization Problems and Iterative Algorithms

Linear convergence of iterative soft-thresholding

consistent learning by composite proximal thresholding

Optimization for Learning and Big Data

CPSC 540: Machine Learning

Linear Methods for Regression. Lijun Zhang

Mathematical methods for Image Processing

Optimization and Optimal Control in Banach Spaces

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Dual and primal-dual methods

Dedicated to Michel Théra in honor of his 70th birthday

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

9. Dual decomposition and dual algorithms

Math 273a: Optimization Convex Conjugacy

Convergence of Fixed-Point Iterations

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

Proximal Methods for Optimization with Spasity-inducing Norms

Convex Optimization Algorithms for Machine Learning in 10 Slides

A user s guide to Lojasiewicz/KL inequalities

Lecture 8: February 9

BASICS OF CONVEX ANALYSIS

Variable Metric Forward-Backward Algorithm

Math 273a: Optimization Subgradient Methods

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

8 Numerical methods for unconstrained problems

Sparse Regularization via Convex Analysis

STA141C: Big Data & High Performance Statistical Computing

Self-dual Smooth Approximations of Convex Functions via the Proximal Average

Newton s Method. Javier Peña Convex Optimization /36-725

arxiv: v1 [cs.lg] 27 Dec 2015

A Proximal Method for Identifying Active Manifolds

Convex Optimization Notes

Radial Subgradient Descent

Oslo Class 2 Tikhonov regularization and kernels

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem

Optimization for Machine Learning

RegML 2018 Class 2 Tikhonov regularization and kernels

ON PROXIMAL POINT-TYPE ALGORITHMS FOR WEAKLY CONVEX FUNCTIONS AND THEIR CONNECTION TO THE BACKWARD EULER METHOD

Lecture 9: September 28

OWL to the rescue of LASSO

ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

Convergence rate estimates for the gradient differential inclusion

Epiconvergence and ε-subgradients of Convex Functions

Adaptive discretization and first-order methods for nonsmooth inverse problems for PDEs

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Transcription:

Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem of the form min y Kc 2, c R d for various choices of the loss function. Another typical problem is the regularized one, e.g. Tikhonov regularization where, for linear kernels one looks for 1 min w R d n n V ( w, x i, y i ) + λr(w). i=1 The class of methods we will consider are suitable to solve problems involving a non smooth regularization term. In particular, we will be motivated by the following regularization terms: (i) l 1 norm regularization (ii) group l 1 norm regularization (iii) matrix norm regularization More generally, we are interested in solving a minimization problem min F (w). w R d We review the basic concepts that allow to study the problem. Existence of a minimizer domain of F is We will consider extended real valued functions F : R d R {+ }. The domf = {w R d : F (w) < + }. This all F is proper if the domain is nonempty. It is useful to consider extended valued functions since they allow to include constraints in the regularization. F is lower semicontinuous if epif is closed (example). F is coercive if lim w + F (w) = +. Theorem 1.1. If F is lower semicontinuous and coercive then there exists w such that F (w ) = min F. We will always assume that the functions we consider are lower semicontinuous. 1

1.1 Convexity concepts Convexity F is convex if ( w, w domf )( λ [0, 1]) F (λw + (1 λ)w ) λf (w) + (1 λ)f (w ). If F is differentiable, we can write an equivalent characterization of convexity based on the gradient: ( w, w R d ) F (w ) F (w) + F (w), w w If F is twice differentiable, and 2 F is the Hessian matrix, convexity is equivalent to 2 F (w) positive semidefinite for all w R d. If a function is convex and differentiable, then F (w) = 0 implies that w is a global minimizer. Strict Convexity F is strictly convex if ( w, w domf )( λ (0, 1)) F (λw + (1 λ)w ) < λf (w) + (1 λ)f (w ). If F is differentiable, we can write an equivalent charcterization of strct convexity based on the gradient: ( w, w R d ) F (w ) > F (w) + F (w), w w If F is twice differentiable, and 2 F is the Hessian matrix, convexity is implied by 2 F (w) positive definite for all w R d. The minimizer of a strictly convex function is unique (if it exists) Strong Convexity [0, 1]) F is µ-strongly convex if the function f µ 2 is convex, i.e. ( w, w domf )( λ F (λw + (1 λ)w ) λf (w) + (1 λ)f (w ) µ 2 λ(1 λ) w w 2. If F is differentiable, then strong convexity is equivalent to ( w, w R d ) F (w ) F (w) + F (w), w w + µ 2 w w 2 If F is twice differentiable, and 2 F is the Hessian matrix, strong convexity is equivalent to 2 F (w) µi for all w R d. If F is strongly convex then it is coercive. Therefore if it is lsc, it admits a unique minimizer. Moreover F (w) F (w ) µ 2 w w 2. We will often assume Lipschitz continuity of the gradient This gives a useful quadratic upper bound of F F (w) F (w ) L w w. F (w ) F (w) + F (w), w w + L 2 w w 2 ( w, w domf ) (1) Moreover, for every w domf and w is a minimizer, 1 2L F (w) 2 F (w) F (w ) L 2 w w 2. The second inequality follows by substituting in the quadratic upper bound w = w and w = w. The first follows by substituting w = w 1 L F (w). 2

2 Convergence of the gradient method with constant step-size Assume F to be convex, differentiable, with L Lipschitz continuous gradient, and that a minimizer exists. The first order necessary condition is F (w) = 0. Therefore w α F (w ) = w This suggests an algorithm based on the fixed point iteration w k+1 = w k α F (w k ). We want to study convergence of this algorithm. Convergence can be intended in two senses, towards the minimum or towards a minimizer. Start from the first one. Different strategis to choose stepsize. We keep α fixed and determine a priori conditions guaranteeing convergence. From the quadratic upper bound (1) we get F (w k+1 ) F (w k ) α F (w k ) 2 + Lα2 2 F (w k) 2 = F (w k ) α (1 L2 ) α F (w k ) 2 If 0 < α < 2/L the iteration decreases the function value. Choose α = 1/L (which gives the maximum decrease) and get F (w k+1 ) F (w k ) 1 2L F (w k) 2 F (w ) + F (w k ), w k w 1 2L F (w k) 2 = F (w ) + L ( 1L 2 F (w k), w k w 1L ) 2 F (w k) 2 w k w 2 + w k w 2 = F (w ) + L 2 ( w k w 2 w k 1 L F (w k) w 2 ) = F (w ) + L 2 ( w k w 2 w k+1 w 2 ) Summing the above inequality for k = 0,..., K 1 we get K 1 k=0 K 1 k=0 F (w k ) F (w ) K 1 k=0 F (w k ) F (w ) L 2 w 0 w 2 L 2 ( w k w 2 w k+1 w 2 ) Noting that F (w k ) is decreasing, F (w K ) F (w ) F (w k ) F (w ) for every k, therefore we obtain F (w K ) F (w ) L 2K w 0 w 2. This is called sublinear rate of convergence. For strongly convex functions, it is possible to prove that the operator I α F is a contraction, and therefore we get linear convergence rate: w K w 2 ( ) 2K L µ w 0 w 2 L + µ 3

which gives, using the bound following (1) which is much better. F (w K ) F (w ) L 2 ( ) 2K L µ w 0 w 2 L + µ It is known that for general convex problems problems, with Lipschitz continuous gradient, the performance of any first order method is lower bounded by 1/k 2. Nesterov in 1983 devised an algorithm reaching the lower bound. The algorithm is called accelerated gradient descent and is very similar to the gradient. It needs to store two iterates, instead of only one. It is of the form w k+1 = u k 1 L F (u k) u k+1 = a k w k + b k w k+1, for some w 0 domf, and u 1 = w 0 and a suitable (a priori determined) sequence of parameters a k and b k. More precisely, choose w 0 domf, and u 1 = w 0. Set t 1 = 1. Then define w k+1 = u k 1 L F (u k) t k+1 = 1 + 1 + 4t 2 k ( 2 u k+1 = 1 + t ) k 1 w k + 1 t k w k+1. t k+1 t k+1 We obtain F (w k ) F (w ) L w 0 w 2 2k 2 3 Regularized optimization We often want to minimize min F (w) + R(w), w R d where either F is smooth (e.g. square loss) and R is convex and nonsmooth, either R is smooth and F is not (SVM). We would like to write a similar condition to = 0 to characterize a minimizer. We use the subdifferential. Let R be a convex, lsc proper function. η R d is a subgradient of R at w if R(w ) R(w) + η, w w. The subdifferential R(w) is the set of all subgradients. It is easy to see that R(w ) = min R 0 R(w ). If R is differentiable, the subdifferential is a singleton and coincides with the gradient. Example 3.1 (Subdifferential of the indicator function). Let i C be the indicator function of a convex set C (constrained regularization). Let w C. Then i C =. If w C, then η i C (w) if and only if, for all v C i C (v) i C (w) η, v w 0 η, v w. This is the normal cone to C. 4

Example 3.2 (Subdifferential of R = 1 ). If, η is such that for all i = 1,..., d n v i i=1 n w i η, v w. i=1 v i w i η i (v i w i ), then η R(w). Vice versa, taking v j = w j for all j i we get that η R(w) implies that v i w i η i (v i w i ), and thus η i (w i ). We therefore proved that R(w) = ( (w 1 ),..., (w d )). Proximity operator Let R be lsc, convex, proper. Then prox R (v) = argmin w R d{r(w) + 1 2 w v 2 } is well-defined and is unique. Imposing the first order necessary conditions, we get Example 3.3. u = prox R (v) 0 R(u) + (u v) v u R(u) u = (I + R) 1 (v) i) If R = 0, then prox(v) = v. ii) If R = i C, directly from the definition prox R (v) = P C (v). ii) Proximity operator of the l 1 norm. Let λ > 0 and set R = λ 1. Let v R d and u = prox R (v). Then v u λ 1 (u). Since the subdifferential can be computed componentwise, the same holds for the prox. In particular, u = (I + R) 1 (v) By the previous example, this is equivalent to u = (I + R) 1 (v). To compute this quantity first note that v i + λ if v i > λ ((I + R)(v)) i = [ λ, λ] if v i = 0 v i λ if v i < λ Inverting the previous relationship we get u i λ if u i > λ (prox 1 (u)) i = 0 if u i [ λ, λ] u i + λ if u i < λ 4 Basic proximal algorithm (forward-backward splitting) Assume that F is convex and differentiable with Lipschitz continuous gradient. As for gradient descent, the idea is to start from a fixed point equation characterizing the minimizer. If we write the first order conditions, we get 0 F (w ) + R(w ) α F (w ) α R(w ) w α F (w ) w αr(w ) w = prox αr (w α F (w )). 5

We consider the fixed point iteration Another interpretation: w k+1 = prox αk R(w k α k F (w k )). w k+1 = argmin{α k R(w) + 1 2 w (w k α k F (w k )) 2 } = argmin{r(w) + 1 2α k w w k 2 + w w k, F (w k ) + F (w k )} Special cases: R = 0 (gradient method), R = i C (projected gradient method). The proof of convergence for the sequence of objective values with α k = 1/L is similar to the proof of convergence for the differentiable case. The rate of convergence is the same as in the differentiable case (this would not be the case if a subdifferential method was used, compare...) F (w k ) F (w ) L w 0 w 2 2k Convergence proof Set α k = 1/L and define the gradient mapping as Then G 1/L (w) = L(w 1 L prox R/L(w 1 F (w))) L w k+1 = w k 1 L G 1/L(w k ). Note that G 1/L is not a gradient or a subgradient of F + R but is called gradient mapping. By wiriting the first order condition for the prox operator, we get: Recalling the upper bound (1), we obtain G 1/L (w) F (w) + R(w 1 L G 1/L(w)) F (w 1 L G 1/L(w)) F (w) 1 L F (w), G 1/L(w) + 1 2L G 1/L(w) 2 (2) If inequality (2) holds, then for every v R d :... F (w 1 L G 1/L(w)) F (v) + G 1/L (w), w v + 1 2L G 1/L(w) 2 Accelerated versions As for the gradient. The problem is that the forward-backward algorithm is effective only when prox is easy to compute. Note indeed that we replaced our original problem with a sequence of new minimization problems. They are strongly convex (therefore easier), but in general not solvable in closed form. 5 Fenchel conjugate and Moreau decomposition Fenchel conjugate The Fenchel conjugate is a function R : R d R {+ } defined as R (η) w R d { η, w R(w)}. R is a convex function (even if R is not), since it is the pointwise supremum of convex (linear) functions. Example 6

1. Conjugate of an affine function. It is the indicator function. If R(w) = a, w +b, then R (w) = bι {a} 2. The conjugate of an indicator function i C is the support function sup u C u, 3. Conjugate of the norm R(w) = w. In this case define η w: w 1 η, w. Then R = i B, where B = {η η 1}. (for the l 1 norm, it is the l norm) Indeed, let η R d : R (η) w R d η, w w sup t R + w =1 t R + t( η 1) η, tw t w and thus R (η) = i B By definition, R (η) + R(w) η, w for η, w R d (Fenchel Young inequality). Moreover, R(w) + R (η) = η, w η R(w) w R (η) Suppose that R (η) = η, w R(w) iff η, w R(w ) η, w R(w) for every w iff η R(w). From R(w) + R (η) = w, η we get R (η ) η, u R(u) u η, w R(w) = η η, w + η, w R(w) = η η, w R (η). If R is lsc and convex, then R = R (which gives the other equivalence). Moreau decomposition w = prox R (w) + prox R (w) It follows from the properties stated above of the subdifferential and of the conjugate: u = prox R (w) w u R(u) u R (w u) w (w u) R (w u) w u = prox R (w). Note that this is a generalization of the classical decomposition on orthogonal components. So if V is a linear subspace and V is the orthogonal subspace, we know w = P V (w) + P V (w). This is a special case of the Moreau decomposition obtained by choosing R = i V (and noting that R = i V ). Properties of the proximity operators examples prox R (w) = (prox R1 (w 1 ), prox R2 (w 2 )). Scaling: ( ) v prox R+ µ 2 2(v) = prox 1 1+µ R 1 + µ Separable sum: If R(w) = R 1 (w 1 )+R 2 (w 2 ), then 7

Generalized Moreau decomposition: for every λ > 0: w = prox λr (w) + λ prox R /λ(w/λ) Sometimes, Moreau decomposition is useful to compute proximity operators. Let R(w) = λ w. We have seen that R = i B (λ). Therefore, from the Moreau decomposition, we get prox R (w) = w P B (λ)(w). In particular, if R = 1, noting that =, we obtain again the formula for the soft-thresholding seen before. Elastic-net. Let G = {G 1,..., G t } be a partition of the indices {1,..., d}. The following norm is called group lasso penalty: t R(w) = w Gi, i=1 where w 2 G i = j G i w 2 j. The dual norm is max j=1,...,t w G j, and therefore prox R (w) = w P B (w), where B = {w R d : w Gj 1, j = 1,..., t}. The projection on this set can be expressed componentwise as w Gj if w Gj 1 (P B (w)) Gj = w Gj w Gj otherwise 6 Computation of the proximity operator of matrix functions Let W R D T and let σ : R D T R q, where q = min{d, T } and σ 1 (W ) σ 2 (W )... σ q (W ) 0 are the singular values of the matrix W. We consider regularization terms of the form R(W ) = g(σ(w )) where g is absolutely symmetric, namely g(α) = g( ˆα), where ˆα is the vector obtained ordering the components of α in a decreasing order. We will show that the computation of the proximity operator of R can be reduced to the computation of that of the function g. Indeed, if W = U diag(σ(w ))V T (with U R D D, diag(σ(w )) R D T, and V R T T ), it holds prox λr (W ) = U diag(prox λg (σ(w )))V T. The proof of this statement is based on two results. Theorem 6.1 (Von Neumann 1937). For any D T matrices Z, W and A, and hence max{ UZV T, W U and V orthogonal} = σ(z), σ(w ), Z, W σ(z), σ(w ). Equality holds if and only if there exists a simultaneous SVD of Z and W. 8

Theorem 6.2 (Conjugacy formula). If g : R q ], + ] is absolutely symmetric then (g σ) = g σ. Proof. Let Z R D T. Then (g σ) (Z) Z, W g(σ(z)) W U,V ort.,a A R D T UAV T, W g(σ(a)) { } sup UAV T, W U,V A R D T σ(a), σ(w ) g(σ(a)) α R q α, σ(w ) g(α) = g (σ(w )). g(σ(a)) Using the two previous results it is easy to show a formula for the subdifferential of the function R. Theorem 6.3. Let g : R q ], + ] be absolutely symmetric. Then (g σ)(w ) = {U diag(α)v T α g(σ(w )), W = Uσ(W )V T, U and V orthogonal} Proof. Let Z (g σ)(w ). By definition of Fenchel conjugate and using the previous theorem (g σ)(w ) = {Z (g σ)(w ) + (g σ) (Z) = W, Z } = {Z (g σ)(w ) + (g σ)(z) = W, Z }. Now, by Young-Fenchel inequality (g σ)(w ) + (g σ)(z) σ(w ), σ(z) W, Z. Let Z (g σ)(w ). Then, we must have σ(w ), σ(z) = W, Z and hence, by Von Neumann Theorem Z and W have the same singular value decomposition. Moreover, by Theorem 6.1 g(σ(w )) + g (σ)(z) = σ(w ), σ(z) and therefore σ(z) g(σ(w )). So we proved one inclusion. Let s prove the other one. Take α g(σ(w ) and define Z = U diag(α)v T where U and V are orthonormal matrices such that W = Uσ(W )V T. Then W and Z have a simultaneous singular value decomposition and hence This implies that Z (g σ)(w ). (g σ)(w ) + (g σ) (Z) = g(σ(w )) + g (α) = σ(w ), α = σ(w ), σ(z) = W, Z. Theorem 6.4. Let R = g σ, with g : R q ], + ] absolutely symmetric, let W = Uσ(W )V T, with U and V are orthogonal, and let λ > 0. Then prox λr (W ) = U prox λg (σ(w ))V T. Proof. By definition, Z = prox λr (W ) is the unique minimizer of Z R(Z) + 1 Z W 2 2λ 9

and is thus the unique point in R D T satisfying 1 ( ) W Z R(Z). λ Now, let Z = U prox λg (σ(w ))V T, and let α = prox λg (σ(w )). By definition of prox λg, we have 1 (σ(w ) α) g(α). λ Using the charachterization of the subdifferential we found in Theorem 6.3 we get U 1 λ (σ(w ) α)v T (g σ)(z), and the conclusion follows by noting that Z = Z since the left hand side coincides with (W Z)/λ. Using these general results it is easy to compute the proximity operator of the nuclear norm (a.k.a. trace class norm, Schatten 1 norm): R(W ) = W 1 = q σ(w ) i. The function 1 is the ocmposition of the absolutely symmetric function g : R q R, g(α) = q i=1 α i with σ. The computation of the prox of R is thus reduced to the computation of the prox of the l 1 norm in R q. We already know that prox λg (α) = S λ (α), and finally we get i=1 prox λr (W ) = US λ (σ(w ))V T, where W = Uσ(W )V T. 7 References [A] Combettes, Pesquet Proximal splitting methods in signal processing, 2009 [B] Combettes and Wajs, Signal Recovery by Proximal forward-backward splitting, Multiscale Model Simul, 2005 [C] Lewis, The convex analysis of unitary invariant matrix functions, 1995 [D] Nesterov, A basic course in optimization [E] Beck- Teboulle, A fast iterative soft-thresholding algorithm for linear inverse problems, SIAM J Imaging Sciences 2009 10