Proximal methods. S. Villa. October 7, 2014
|
|
- Stephany Marylou Johns
- 5 years ago
- Views:
Transcription
1 Proximal methods S. Villa October 7, Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem of the form min y Kc 2, c R d for various choices of the loss function. Another typical problem is the regularized one, e.g. Tikhonov regularization where, for linear kernels one looks for 1 min w R d n n V ( w, x i, y i ) + λr(w). i=1 The class of methods we will consider are suitable to solve problems involving a non smooth regularization term. In particular, we will be motivated by the following regularization terms: (i) l 1 norm regularization (ii) group l 1 norm regularization (iii) matrix norm regularization More generally, we are interested in solving a minimization problem min F (w). w R d We review the basic concepts that allow to study the problem. Existence of a minimizer domain of F is We will consider extended real valued functions F : R d R {+ }. The domf = {w R d : F (w) < + }. This all F is proper if the domain is nonempty. It is useful to consider extended valued functions since they allow to include constraints in the regularization. F is lower semicontinuous if epif is closed (example). F is coercive if lim w + F (w) = +. Theorem 1.1. If F is lower semicontinuous and coercive then there exists w such that F (w ) = min F. We will always assume that the functions we consider are lower semicontinuous. 1
2 1.1 Convexity concepts Convexity F is convex if ( w, w domf )( λ [0, 1]) F (λw + (1 λ)w ) λf (w) + (1 λ)f (w ). If F is differentiable, we can write an equivalent characterization of convexity based on the gradient: ( w, w R d ) F (w ) F (w) + F (w), w w If F is twice differentiable, and 2 F is the Hessian matrix, convexity is equivalent to 2 F (w) positive semidefinite for all w R d. If a function is convex and differentiable, then F (w) = 0 implies that w is a global minimizer. Strict Convexity F is strictly convex if ( w, w domf )( λ (0, 1)) F (λw + (1 λ)w ) < λf (w) + (1 λ)f (w ). If F is differentiable, we can write an equivalent charcterization of strct convexity based on the gradient: ( w, w R d ) F (w ) > F (w) + F (w), w w If F is twice differentiable, and 2 F is the Hessian matrix, convexity is implied by 2 F (w) positive definite for all w R d. The minimizer of a strictly convex function is unique (if it exists) Strong Convexity [0, 1]) F is µ-strongly convex if the function f µ 2 is convex, i.e. ( w, w domf )( λ F (λw + (1 λ)w ) λf (w) + (1 λ)f (w ) µ 2 λ(1 λ) w w 2. If F is differentiable, then strong convexity is equivalent to ( w, w R d ) F (w ) F (w) + F (w), w w + µ 2 w w 2 If F is twice differentiable, and 2 F is the Hessian matrix, strong convexity is equivalent to 2 F (w) µi for all w R d. If F is strongly convex then it is coercive. Therefore if it is lsc, it admits a unique minimizer. Moreover F (w) F (w ) µ 2 w w 2. We will often assume Lipschitz continuity of the gradient This gives a useful quadratic upper bound of F F (w) F (w ) L w w. F (w ) F (w) + F (w), w w + L 2 w w 2 ( w, w domf ) (1) Moreover, for every w domf and w is a minimizer, 1 2L F (w) 2 F (w) F (w ) L 2 w w 2. The second inequality follows by substituting in the quadratic upper bound w = w and w = w. The first follows by substituting w = w 1 L F (w). 2
3 2 Convergence of the gradient method with constant step-size Assume F to be convex, differentiable, with L Lipschitz continuous gradient, and that a minimizer exists. The first order necessary condition is F (w) = 0. Therefore w α F (w ) = w This suggests an algorithm based on the fixed point iteration w k+1 = w k α F (w k ). We want to study convergence of this algorithm. Convergence can be intended in two senses, towards the minimum or towards a minimizer. Start from the first one. Different strategis to choose stepsize. We keep α fixed and determine a priori conditions guaranteeing convergence. From the quadratic upper bound (1) we get F (w k+1 ) F (w k ) α F (w k ) 2 + Lα2 2 F (w k) 2 = F (w k ) α (1 L2 ) α F (w k ) 2 If 0 < α < 2/L the iteration decreases the function value. Choose α = 1/L (which gives the maximum decrease) and get F (w k+1 ) F (w k ) 1 2L F (w k) 2 F (w ) + F (w k ), w k w 1 2L F (w k) 2 = F (w ) + L ( 1L 2 F (w k), w k w 1L ) 2 F (w k) 2 w k w 2 + w k w 2 = F (w ) + L 2 ( w k w 2 w k 1 L F (w k) w 2 ) = F (w ) + L 2 ( w k w 2 w k+1 w 2 ) Summing the above inequality for k = 0,..., K 1 we get K 1 k=0 K 1 k=0 F (w k ) F (w ) K 1 k=0 F (w k ) F (w ) L 2 w 0 w 2 L 2 ( w k w 2 w k+1 w 2 ) Noting that F (w k ) is decreasing, F (w K ) F (w ) F (w k ) F (w ) for every k, therefore we obtain F (w K ) F (w ) L 2K w 0 w 2. This is called sublinear rate of convergence. For strongly convex functions, it is possible to prove that the operator I α F is a contraction, and therefore we get linear convergence rate: w K w 2 ( ) 2K L µ w 0 w 2 L + µ 3
4 which gives, using the bound following (1) which is much better. F (w K ) F (w ) L 2 ( ) 2K L µ w 0 w 2 L + µ It is known that for general convex problems problems, with Lipschitz continuous gradient, the performance of any first order method is lower bounded by 1/k 2. Nesterov in 1983 devised an algorithm reaching the lower bound. The algorithm is called accelerated gradient descent and is very similar to the gradient. It needs to store two iterates, instead of only one. It is of the form w k+1 = u k 1 L F (u k) u k+1 = a k w k + b k w k+1, for some w 0 domf, and u 1 = w 0 and a suitable (a priori determined) sequence of parameters a k and b k. More precisely, choose w 0 domf, and u 1 = w 0. Set t 1 = 1. Then define w k+1 = u k 1 L F (u k) t k+1 = t 2 k ( 2 u k+1 = 1 + t ) k 1 w k + 1 t k w k+1. t k+1 t k+1 We obtain F (w k ) F (w ) L w 0 w 2 2k 2 3 Regularized optimization We often want to minimize min F (w) + R(w), w R d where either F is smooth (e.g. square loss) and R is convex and nonsmooth, either R is smooth and F is not (SVM). We would like to write a similar condition to = 0 to characterize a minimizer. We use the subdifferential. Let R be a convex, lsc proper function. η R d is a subgradient of R at w if R(w ) R(w) + η, w w. The subdifferential R(w) is the set of all subgradients. It is easy to see that R(w ) = min R 0 R(w ). If R is differentiable, the subdifferential is a singleton and coincides with the gradient. Example 3.1 (Subdifferential of the indicator function). Let i C be the indicator function of a convex set C (constrained regularization). Let w C. Then i C =. If w C, then η i C (w) if and only if, for all v C i C (v) i C (w) η, v w 0 η, v w. This is the normal cone to C. 4
5 Example 3.2 (Subdifferential of R = 1 ). If, η is such that for all i = 1,..., d n v i i=1 n w i η, v w. i=1 v i w i η i (v i w i ), then η R(w). Vice versa, taking v j = w j for all j i we get that η R(w) implies that v i w i η i (v i w i ), and thus η i (w i ). We therefore proved that R(w) = ( (w 1 ),..., (w d )). Proximity operator Let R be lsc, convex, proper. Then prox R (v) = argmin w R d{r(w) w v 2 } is well-defined and is unique. Imposing the first order necessary conditions, we get Example 3.3. u = prox R (v) 0 R(u) + (u v) v u R(u) u = (I + R) 1 (v) i) If R = 0, then prox(v) = v. ii) If R = i C, directly from the definition prox R (v) = P C (v). ii) Proximity operator of the l 1 norm. Let λ > 0 and set R = λ 1. Let v R d and u = prox R (v). Then v u λ 1 (u). Since the subdifferential can be computed componentwise, the same holds for the prox. In particular, u = (I + R) 1 (v) By the previous example, this is equivalent to u = (I + R) 1 (v). To compute this quantity first note that v i + λ if v i > λ ((I + R)(v)) i = [ λ, λ] if v i = 0 v i λ if v i < λ Inverting the previous relationship we get u i λ if u i > λ (prox 1 (u)) i = 0 if u i [ λ, λ] u i + λ if u i < λ 4 Basic proximal algorithm (forward-backward splitting) Assume that F is convex and differentiable with Lipschitz continuous gradient. As for gradient descent, the idea is to start from a fixed point equation characterizing the minimizer. If we write the first order conditions, we get 0 F (w ) + R(w ) α F (w ) α R(w ) w α F (w ) w αr(w ) w = prox αr (w α F (w )). 5
6 We consider the fixed point iteration Another interpretation: w k+1 = prox αk R(w k α k F (w k )). w k+1 = argmin{α k R(w) w (w k α k F (w k )) 2 } = argmin{r(w) + 1 2α k w w k 2 + w w k, F (w k ) + F (w k )} Special cases: R = 0 (gradient method), R = i C (projected gradient method). The proof of convergence for the sequence of objective values with α k = 1/L is similar to the proof of convergence for the differentiable case. The rate of convergence is the same as in the differentiable case (this would not be the case if a subdifferential method was used, compare...) F (w k ) F (w ) L w 0 w 2 2k Convergence proof Set α k = 1/L and define the gradient mapping as Then G 1/L (w) = L(w 1 L prox R/L(w 1 F (w))) L w k+1 = w k 1 L G 1/L(w k ). Note that G 1/L is not a gradient or a subgradient of F + R but is called gradient mapping. By wiriting the first order condition for the prox operator, we get: Recalling the upper bound (1), we obtain G 1/L (w) F (w) + R(w 1 L G 1/L(w)) F (w 1 L G 1/L(w)) F (w) 1 L F (w), G 1/L(w) + 1 2L G 1/L(w) 2 (2) If inequality (2) holds, then for every v R d :... F (w 1 L G 1/L(w)) F (v) + G 1/L (w), w v + 1 2L G 1/L(w) 2 Accelerated versions As for the gradient. The problem is that the forward-backward algorithm is effective only when prox is easy to compute. Note indeed that we replaced our original problem with a sequence of new minimization problems. They are strongly convex (therefore easier), but in general not solvable in closed form. 5 Fenchel conjugate and Moreau decomposition Fenchel conjugate The Fenchel conjugate is a function R : R d R {+ } defined as R (η) w R d { η, w R(w)}. R is a convex function (even if R is not), since it is the pointwise supremum of convex (linear) functions. Example 6
7 1. Conjugate of an affine function. It is the indicator function. If R(w) = a, w +b, then R (w) = bι {a} 2. The conjugate of an indicator function i C is the support function sup u C u, 3. Conjugate of the norm R(w) = w. In this case define η w: w 1 η, w. Then R = i B, where B = {η η 1}. (for the l 1 norm, it is the l norm) Indeed, let η R d : R (η) w R d η, w w sup t R + w =1 t R + t( η 1) η, tw t w and thus R (η) = i B By definition, R (η) + R(w) η, w for η, w R d (Fenchel Young inequality). Moreover, R(w) + R (η) = η, w η R(w) w R (η) Suppose that R (η) = η, w R(w) iff η, w R(w ) η, w R(w) for every w iff η R(w). From R(w) + R (η) = w, η we get R (η ) η, u R(u) u η, w R(w) = η η, w + η, w R(w) = η η, w R (η). If R is lsc and convex, then R = R (which gives the other equivalence). Moreau decomposition w = prox R (w) + prox R (w) It follows from the properties stated above of the subdifferential and of the conjugate: u = prox R (w) w u R(u) u R (w u) w (w u) R (w u) w u = prox R (w). Note that this is a generalization of the classical decomposition on orthogonal components. So if V is a linear subspace and V is the orthogonal subspace, we know w = P V (w) + P V (w). This is a special case of the Moreau decomposition obtained by choosing R = i V (and noting that R = i V ). Properties of the proximity operators examples prox R (w) = (prox R1 (w 1 ), prox R2 (w 2 )). Scaling: ( ) v prox R+ µ 2 2(v) = prox 1 1+µ R 1 + µ Separable sum: If R(w) = R 1 (w 1 )+R 2 (w 2 ), then 7
8 Generalized Moreau decomposition: for every λ > 0: w = prox λr (w) + λ prox R /λ(w/λ) Sometimes, Moreau decomposition is useful to compute proximity operators. Let R(w) = λ w. We have seen that R = i B (λ). Therefore, from the Moreau decomposition, we get prox R (w) = w P B (λ)(w). In particular, if R = 1, noting that =, we obtain again the formula for the soft-thresholding seen before. Elastic-net. Let G = {G 1,..., G t } be a partition of the indices {1,..., d}. The following norm is called group lasso penalty: t R(w) = w Gi, i=1 where w 2 G i = j G i w 2 j. The dual norm is max j=1,...,t w G j, and therefore prox R (w) = w P B (w), where B = {w R d : w Gj 1, j = 1,..., t}. The projection on this set can be expressed componentwise as w Gj if w Gj 1 (P B (w)) Gj = w Gj w Gj otherwise 6 Computation of the proximity operator of matrix functions Let W R D T and let σ : R D T R q, where q = min{d, T } and σ 1 (W ) σ 2 (W )... σ q (W ) 0 are the singular values of the matrix W. We consider regularization terms of the form R(W ) = g(σ(w )) where g is absolutely symmetric, namely g(α) = g( ˆα), where ˆα is the vector obtained ordering the components of α in a decreasing order. We will show that the computation of the proximity operator of R can be reduced to the computation of that of the function g. Indeed, if W = U diag(σ(w ))V T (with U R D D, diag(σ(w )) R D T, and V R T T ), it holds prox λr (W ) = U diag(prox λg (σ(w )))V T. The proof of this statement is based on two results. Theorem 6.1 (Von Neumann 1937). For any D T matrices Z, W and A, and hence max{ UZV T, W U and V orthogonal} = σ(z), σ(w ), Z, W σ(z), σ(w ). Equality holds if and only if there exists a simultaneous SVD of Z and W. 8
9 Theorem 6.2 (Conjugacy formula). If g : R q ], + ] is absolutely symmetric then (g σ) = g σ. Proof. Let Z R D T. Then (g σ) (Z) Z, W g(σ(z)) W U,V ort.,a A R D T UAV T, W g(σ(a)) { } sup UAV T, W U,V A R D T σ(a), σ(w ) g(σ(a)) α R q α, σ(w ) g(α) = g (σ(w )). g(σ(a)) Using the two previous results it is easy to show a formula for the subdifferential of the function R. Theorem 6.3. Let g : R q ], + ] be absolutely symmetric. Then (g σ)(w ) = {U diag(α)v T α g(σ(w )), W = Uσ(W )V T, U and V orthogonal} Proof. Let Z (g σ)(w ). By definition of Fenchel conjugate and using the previous theorem (g σ)(w ) = {Z (g σ)(w ) + (g σ) (Z) = W, Z } = {Z (g σ)(w ) + (g σ)(z) = W, Z }. Now, by Young-Fenchel inequality (g σ)(w ) + (g σ)(z) σ(w ), σ(z) W, Z. Let Z (g σ)(w ). Then, we must have σ(w ), σ(z) = W, Z and hence, by Von Neumann Theorem Z and W have the same singular value decomposition. Moreover, by Theorem 6.1 g(σ(w )) + g (σ)(z) = σ(w ), σ(z) and therefore σ(z) g(σ(w )). So we proved one inclusion. Let s prove the other one. Take α g(σ(w ) and define Z = U diag(α)v T where U and V are orthonormal matrices such that W = Uσ(W )V T. Then W and Z have a simultaneous singular value decomposition and hence This implies that Z (g σ)(w ). (g σ)(w ) + (g σ) (Z) = g(σ(w )) + g (α) = σ(w ), α = σ(w ), σ(z) = W, Z. Theorem 6.4. Let R = g σ, with g : R q ], + ] absolutely symmetric, let W = Uσ(W )V T, with U and V are orthogonal, and let λ > 0. Then prox λr (W ) = U prox λg (σ(w ))V T. Proof. By definition, Z = prox λr (W ) is the unique minimizer of Z R(Z) + 1 Z W 2 2λ 9
10 and is thus the unique point in R D T satisfying 1 ( ) W Z R(Z). λ Now, let Z = U prox λg (σ(w ))V T, and let α = prox λg (σ(w )). By definition of prox λg, we have 1 (σ(w ) α) g(α). λ Using the charachterization of the subdifferential we found in Theorem 6.3 we get U 1 λ (σ(w ) α)v T (g σ)(z), and the conclusion follows by noting that Z = Z since the left hand side coincides with (W Z)/λ. Using these general results it is easy to compute the proximity operator of the nuclear norm (a.k.a. trace class norm, Schatten 1 norm): R(W ) = W 1 = q σ(w ) i. The function 1 is the ocmposition of the absolutely symmetric function g : R q R, g(α) = q i=1 α i with σ. The computation of the prox of R is thus reduced to the computation of the prox of the l 1 norm in R q. We already know that prox λg (α) = S λ (α), and finally we get i=1 prox λr (W ) = US λ (σ(w ))V T, where W = Uσ(W )V T. 7 References [A] Combettes, Pesquet Proximal splitting methods in signal processing, 2009 [B] Combettes and Wajs, Signal Recovery by Proximal forward-backward splitting, Multiscale Model Simul, 2005 [C] Lewis, The convex analysis of unitary invariant matrix functions, 1995 [D] Nesterov, A basic course in optimization [E] Beck- Teboulle, A fast iterative soft-thresholding algorithm for linear inverse problems, SIAM J Imaging Sciences
Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationEE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1
EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationThe proximal mapping
The proximal mapping http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/37 1 closed function 2 Conjugate function
More informationConvex Functions. Pontus Giselsson
Convex Functions Pontus Giselsson 1 Today s lecture lower semicontinuity, closure, convex hull convexity preserving operations precomposition with affine mapping infimal convolution image function supremum
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationORIE 4741: Learning with Big Messy Data. Proximal Gradient Method
ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationConvex Optimization Conjugate, Subdifferential, Proximation
1 Lecture Notes, HCI, 3.11.211 Chapter 6 Convex Optimization Conjugate, Subdifferential, Proximation Bastian Goldlücke Computer Vision Group Technical University of Munich 2 Bastian Goldlücke Overview
More informationA New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction
A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationI P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION
I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1
More informationLocal strong convexity and local Lipschitz continuity of the gradient of convex functions
Local strong convexity and local Lipschitz continuity of the gradient of convex functions R. Goebel and R.T. Rockafellar May 23, 2007 Abstract. Given a pair of convex conjugate functions f and f, we investigate
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationr=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J
7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationIntroduction to Alternating Direction Method of Multipliers
Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction
More informationOn the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems
On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems Radu Ioan Boţ Ernö Robert Csetnek August 5, 014 Abstract. In this paper we analyze the
More informationCoordinate Update Algorithm Short Course Subgradients and Subgradient Methods
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationLECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE
LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization
More informationSequential convex programming,: value function and convergence
Sequential convex programming,: value function and convergence Edouard Pauwels joint work with Jérôme Bolte Journées MODE Toulouse March 23 2016 1 / 16 Introduction Local search methods for finite dimensional
More informationON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS
MATHEMATICS OF OPERATIONS RESEARCH Vol. 28, No. 4, November 2003, pp. 677 692 Printed in U.S.A. ON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS ALEXANDER SHAPIRO We discuss in this paper a class of nonsmooth
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationConvex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014
Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationA Dykstra-like algorithm for two monotone operators
A Dykstra-like algorithm for two monotone operators Heinz H. Bauschke and Patrick L. Combettes Abstract Dykstra s algorithm employs the projectors onto two closed convex sets in a Hilbert space to construct
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationPerturbed Proximal Gradient Algorithm
Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing
More informationConvex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE
Convex Analysis Notes Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE These are notes from ORIE 6328, Convex Analysis, as taught by Prof. Adrian Lewis at Cornell University in the
More information3.1 Convexity notions for functions and basic properties
3.1 Convexity notions for functions and basic properties We start the chapter with the basic definition of a convex function. Definition 3.1.1 (Convex function) A function f : E R is said to be convex
More informationSome Properties of the Augmented Lagrangian in Cone Constrained Optimization
MATHEMATICS OF OPERATIONS RESEARCH Vol. 29, No. 3, August 2004, pp. 479 491 issn 0364-765X eissn 1526-5471 04 2903 0479 informs doi 10.1287/moor.1040.0103 2004 INFORMS Some Properties of the Augmented
More informationAbout Split Proximal Algorithms for the Q-Lasso
Thai Journal of Mathematics Volume 5 (207) Number : 7 http://thaijmath.in.cmu.ac.th ISSN 686-0209 About Split Proximal Algorithms for the Q-Lasso Abdellatif Moudafi Aix Marseille Université, CNRS-L.S.I.S
More informationProximal Operator and Proximal Algorithms (Lecture notes of UCLA 285J Fall 2016)
Proximal Operator and Proximal Algorithms (Lecture notes of UCLA 285J Fall 206) Instructor: Wotao Yin April 29, 207 Given a function f, the proximal operator maps an input point x to the minimizer of f
More informationA semi-algebraic look at first-order methods
splitting A semi-algebraic look at first-order Université de Toulouse / TSE Nesterov s 60th birthday, Les Houches, 2016 in large-scale first-order optimization splitting Start with a reasonable FOM (some
More informationLecture 3. Optimization Problems and Iterative Algorithms
Lecture 3 Optimization Problems and Iterative Algorithms January 13, 2016 This material was jointly developed with Angelia Nedić at UIUC for IE 598ns Outline Special Functions: Linear, Quadratic, Convex
More informationLinear convergence of iterative soft-thresholding
arxiv:0709.1598v3 [math.fa] 11 Dec 007 Linear convergence of iterative soft-thresholding Kristian Bredies and Dirk A. Lorenz ABSTRACT. In this article, the convergence of the often used iterative softthresholding
More informationconsistent learning by composite proximal thresholding
consistent learning by composite proximal thresholding Saverio Salzo Università degli Studi di Genova Optimization in Machine learning, vision and image processing Université Paul Sabatier, Toulouse 6-7
More informationOptimization for Learning and Big Data
Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationMathematical methods for Image Processing
Mathematical methods for Image Processing François Malgouyres Institut de Mathématiques de Toulouse, France invitation by Jidesh P., NITK Surathkal funding Global Initiative on Academic Network Oct. 23
More informationOptimization and Optimal Control in Banach Spaces
Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationDedicated to Michel Théra in honor of his 70th birthday
VARIATIONAL GEOMETRIC APPROACH TO GENERALIZED DIFFERENTIAL AND CONJUGATE CALCULI IN CONVEX ANALYSIS B. S. MORDUKHOVICH 1, N. M. NAM 2, R. B. RECTOR 3 and T. TRAN 4. Dedicated to Michel Théra in honor of
More informationLecture 15 Newton Method and Self-Concordance. October 23, 2008
Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications
More information9. Dual decomposition and dual algorithms
EE 546, Univ of Washington, Spring 2016 9. Dual decomposition and dual algorithms dual gradient ascent example: network rate control dual decomposition and the proximal gradient method examples with simple
More informationMath 273a: Optimization Convex Conjugacy
Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper
More informationConvergence of Fixed-Point Iterations
Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and
More informationIterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem
Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem Charles Byrne (Charles Byrne@uml.edu) http://faculty.uml.edu/cbyrne/cbyrne.html Department of Mathematical Sciences
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationA user s guide to Lojasiewicz/KL inequalities
Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationBASICS OF CONVEX ANALYSIS
BASICS OF CONVEX ANALYSIS MARKUS GRASMAIR 1. Main Definitions We start with providing the central definitions of convex functions and convex sets. Definition 1. A function f : R n R + } is called convex,
More informationVariable Metric Forward-Backward Algorithm
Variable Metric Forward-Backward Algorithm 1/37 Variable Metric Forward-Backward Algorithm for minimizing the sum of a differentiable function and a convex function E. Chouzenoux in collaboration with
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationSplitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches
Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More information8 Numerical methods for unconstrained problems
8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields
More informationSparse Regularization via Convex Analysis
Sparse Regularization via Convex Analysis Ivan Selesnick Electrical and Computer Engineering Tandon School of Engineering New York University Brooklyn, New York, USA 29 / 66 Convex or non-convex: Which
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationSelf-dual Smooth Approximations of Convex Functions via the Proximal Average
Chapter Self-dual Smooth Approximations of Convex Functions via the Proximal Average Heinz H. Bauschke, Sarah M. Moffat, and Xianfu Wang Abstract The proximal average of two convex functions has proven
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationarxiv: v1 [cs.lg] 27 Dec 2015
New Perspectives on k-support and Cluster Norms New Perspectives on k-support and Cluster Norms arxiv:151.0804v1 [cs.lg] 7 Dec 015 Andrew M. McDonald Department of Computer Science University College London
More informationA Proximal Method for Identifying Active Manifolds
A Proximal Method for Identifying Active Manifolds W.L. Hare April 18, 2006 Abstract The minimization of an objective function over a constraint set can often be simplified if the active manifold of the
More informationConvex Optimization Notes
Convex Optimization Notes Jonathan Siegel January 2017 1 Convex Analysis This section is devoted to the study of convex functions f : B R {+ } and convex sets U B, for B a Banach space. The case of B =
More informationRadial Subgradient Descent
Radial Subgradient Descent Benja Grimmer Abstract We present a subgradient method for imizing non-smooth, non-lipschitz convex optimization problems. The only structure assumed is that a strictly feasible
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationIn applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem
1 Conve Analsis Main references: Vandenberghe UCLA): EECS236C - Optimiation methods for large scale sstems, http://www.seas.ucla.edu/ vandenbe/ee236c.html Parikh and Bod, Proimal algorithms, slides and
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationON PROXIMAL POINT-TYPE ALGORITHMS FOR WEAKLY CONVEX FUNCTIONS AND THEIR CONNECTION TO THE BACKWARD EULER METHOD
ON PROXIMAL POINT-TYPE ALGORITHMS FOR WEAKLY CONVEX FUNCTIONS AND THEIR CONNECTION TO THE BACKWARD EULER METHOD TIM HOHEISEL, MAXIME LABORDE, AND ADAM OBERMAN Abstract. In this article we study the connection
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS
ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS Mau Nam Nguyen (joint work with D. Giles and R. B. Rector) Fariborz Maseeh Department of Mathematics and Statistics Portland State
More informationConvergence rate estimates for the gradient differential inclusion
Convergence rate estimates for the gradient differential inclusion Osman Güler November 23 Abstract Let f : H R { } be a proper, lower semi continuous, convex function in a Hilbert space H. The gradient
More informationEpiconvergence and ε-subgradients of Convex Functions
Journal of Convex Analysis Volume 1 (1994), No.1, 87 100 Epiconvergence and ε-subgradients of Convex Functions Andrei Verona Department of Mathematics, California State University Los Angeles, CA 90032,
More informationAdaptive discretization and first-order methods for nonsmooth inverse problems for PDEs
Adaptive discretization and first-order methods for nonsmooth inverse problems for PDEs Christian Clason Faculty of Mathematics, Universität Duisburg-Essen joint work with Barbara Kaltenbacher, Tuomo Valkonen,
More informationConvex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version
Convex Optimization Theory Chapter 5 Exercises and Solutions: Extended Version Dimitri P. Bertsekas Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com
More information