Mathematics of Data: From Theory to Computation

Mathematics of Data: From Theory to Computation Prof. Volkan Cevher volkan.cevher@epfl.ch Lecture 8: Composite convex minimization I Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale de Lausanne (EPFL) EE-556 (Fall 2015) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 1/ 43

License Information for Mathematics of Data Slides This work is released under a Creative Commons License with the following terms: Attribution The licensor permits others to copy, distribute, display, and perform the work. In return, licensees must give the original authors credit. Non-Commercial The licensor permits others to copy, distribute, display, and perform the work. In return, licensees may not use the work for commercial purposes unless they get the licensor s permission. Share Alike The licensor permits others to distribute derivative works only under a license identical to the one that governs the licensor s work. Full Text of the License Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 2/ 43

Outline Today 1. Composite convex minimization 2. Proximal operator and computational complexity 3. Proximal gradient methods Next week 1. Proximal Newton-type methods 2. Composite self-concordant minimization Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 3/ 43

Recommended reading material A. Beck and M. Tebulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems, SIAM J. Imaging Sciences, 2(1), 183 202, 2009. Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program, 103(1), 127 152, 2005. Q. Tran-Dinh, A. Kyrillidis and V. Cevher, Composite Self-Concordant Minimization, LIONS-EPFL Tech. Report. http://arxiv.org/abs/1308.2867, 2013. N. Parikh and S. Boyd, Proximal Algorithms, Foundations and Trends in Optimization, 1(3):123-231, 2014. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 4/ 43

Motivation Motivation Data analytics problems in various disciplines can often be simplified to nonsmooth composite convex minimization problems. To this end, this lecture provides efficient numerical solution methods for such problems. Intriguingly, composite minimization problems are far from generic nonsmooth problems and we can exploit individual function structures to obtain numerical solutions nearly as efficiently as if they are smooth problems. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 5/ 43

Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp f and g are both proper, closed, and convex. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp f and g are both proper, closed, and convex. dom(f ) := dom(f) dom(g) and < F < +. The solution set S := {x dom(f ) : F (x ) = F } is nonempty. Two remarks Nonsmoothness: At least one of the two functions f and g is nonsmooth General nonsmooth convex optimization methods (e.g., classical subgradient methods, level, or bundle methods) lack efficiency and numerical robustness. Require O(ɛ 2 ) iterations to reach a point x ɛ such that F (x ɛ ) F ɛ. Hence, to reach x 0.01 such that F (x 0.01 ) F 0.01, we need O(10 4 ) iterations. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp f and g are both proper, closed, and convex. dom(f ) := dom(f) dom(g) and < F < +. The solution set S := {x dom(f ) : F (x ) = F } is nonempty. Two remarks Nonsmoothness: At least one of the two functions f and g is nonsmooth General nonsmooth convex optimization methods (e.g., classical subgradient methods, level, or bundle methods) lack efficiency and numerical robustness. Require O(ɛ 2 ) iterations to reach a point x ɛ such that F (x ɛ ) F ɛ. Hence, to reach x 0.01 such that F (x 0.01 ) F 0.01, we need O(10 4 ) iterations. Generality: it covers a wider range of problems than smooth unconstrained problems. E.g. when handling regularized M-estimation, f is a loss function, a data fidelity, or negative log-likelihood function. g is a regularizer, encouraging structure and/or constraints in the solution. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

Example 1: Sparse regression in generalized linear models (GLMs) Problem (Sparse regression in GLM) Our goal is to estimate x R p given {b i } n i=1 and {a i } n i=1, knowing that the likelihood function at y i given a i and x is given by L(b i ; a i, x ), and that x is sparse. b A x w Optimization formulation { min x R p n log L(b i ; a i, x ) i=1 } {{ } f(x) + ρ n x 1 } {{ } g(x) where ρ n > 0 is a parameter which controls the strength of sparsity regularization. } Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 7/ 43

Example 1: Sparse regression in generalized linear models (GLMs) Problem (Sparse regression in GLM) Our goal is to estimate x R p given {b i } n i=1 and {a i } n i=1, knowing that the likelihood function at y i given a i and x is given by L(b i ; a i, x ), and that x is sparse. b A x w Optimization formulation { min x R p n log L(b i ; a i, x ) i=1 } {{ } f(x) + ρ n x 1 } {{ } g(x) where ρ n > 0 is a parameter which controls the strength of sparsity regularization. Theorem (cf. [4, 5, 6] for details) Under some technical conditions, there exists {ρ i } i=1 such that with high probability, x x ( ) 2 s log p = O, supp x = supp x. 2 n xml x 2 Recall ML: Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 7/ 43 } = O (p/n). 2

Example 2: Image processing Problem (Imaging denoising/deblurring) Our goal is to obtain a clean image x given dirty observations b R n 1 via b = A(x) + w, where A is a linear operator, which, e.g., captures camera blur as well as image subsampling, and w models perturbations, such as Gaussian or Poisson noise. Optimization formulation Poisson : Gaussian : { } min (1/2) A(x) b 2 x R n p 2 + ρ x } {{ }} TV {{ } f(x) g(x) min x R n p { 1 n n [ a i, x b i ln ( a i, x )] i=1 } {{ } f(x) + ρ x TV } {{ } g(x) where ρ > 0 is a regularization parameter and TV is the total variation (TV) norm: { x TV := i,j x i,j+1 x i,j + x i+1,j x i,j anisotropic case, xi,j+1 x i,j i,j 2 + x i+1,j x i,j 2 isotropic case } Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 8/ 43

Example 3: Confocal microscopy with camera blur and Poisson observations Original original image image x Observed input: Noise image image b output: Estimate Denoise image ˆx Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 9/ 43

Example 4: Sparse inverse covariance estimation Problem (Graphical model selection) Given a data set D := {x 1,, x N }, where x i is a Gaussian random variable. Let Σ be the Example: covariance matrix Log-determinant corresponding to the graphical for LMIs model of the Gaussian Markov random field. Our goal is to learn a sparse precision matrix Θ (i.e., the inverse covariance Application: matrix Σ 1 ) that Graphical captures model the selection Markov random field structure.. x5 x 4 x1 x2 x3 x4 x5 x1 x 1 x 2 x 3 = x2 x3 x4 x5 Given a data set D := {x 1,...,x N },wherex i is a Gaussian random variable. Let be the covariance matrix corresponding to the graphical model of the Gausian Markov random field. The aim is to learn a sparse matrix that approximates the inverse 1. Optimization problem 8 9 >< >= min log det( ) + trace( ) + kvec( )k 1 0 >: {z } {z } >; f(x) g(x) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 10/ 43

Example 4: Sparse inverse covariance estimation Problem (Graphical model selection) Given a data set D := {x 1,, x N }, where x i is a Gaussian random variable. Let Σ be the Example: covariance matrix Log-determinant corresponding to the graphical for LMIs model of the Gaussian Markov random field. Our goal is to learn a sparse precision matrix Θ (i.e., the inverse covariance Application: matrix Σ 1 ) that Graphical captures model the selection Markov random field structure.. x5 x 4 x1 x2 x3 x4 x5 x1 x 1 x 2 x 3 = x2 x3 x4 x5 Given a data set D := {x 1,...,x N },wherex i is a Gaussian random variable. Let be the covariance matrix corresponding to the graphical model of Optimization the Gausianformulation Markov random field. The aim is to learn a sparse matrix that approximates the inverse 1. { } min tr(σθ) log det(θ) + λ vec(θ) 1 Θ 0 } Optimization {{ problem }} {{ } 8 f(x) g(x) 9 >< >= where Θ 0 means that min Θ islog symmetric det( ) + and trace( ) positive+ definite kvec( )k and 1 λ > 0 is a regularization parameter 0 >: and vec is the{z vectorization} operator. {z } >; f(x) g(x) (2) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 10/ 43

Question: How do we design algorithms for finding a solution x? Philosophy We cannot immediately design algorithms just based on the original formulation F := min {F (x) := f(x) + g(x)}. (1) x Rp We need intermediate tools to characterize the solutions x of this problem One key tool is called the optimality condition Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 12/ 43

Optimality condition Theorem (Moreau-Rockafellar s theorem [8]) Let f and g be the subdiffierential of f and g, respectively. If f, g F(R p ) and dom(f) dom(g), then: F (f + g) = f + g. Note: dom(f ) = dom(f) dom(g) and f(x) is defined as (cf., Lecture 2): f := {w R n : f(y) f(x) w T (y x), y R n }, Optimality condition Generally, the optimality condition for (1) can be written as 0 F (x ) f(x ) + g(x ). (3) If f F 1,1 L (Rp ), then (3) features the gradient of f as opposed to the subdifferential 0 F (x ) f(x ) + g(x ). (4) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 13/ 43

Necessary and sufficient condition Lemma (Necessary and sufficient condition) A point x dom(f ) is called a globally optimal solution to (1) (i.e., F := min x R p{f (x) := f(x) + g(x)} x satisfies (3): 0 f(x ) + g(x ) (or (4): 0 f(x ) + g(x ) when f F 1,1 L (Rp ) ). iff Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 14/ 43

Necessary and sufficient condition Lemma (Necessary and sufficient condition) A point x dom(f ) is called a globally optimal solution to (1) (i.e., F := min x R p{f (x) := f(x) + g(x)} x satisfies (3): 0 f(x ) + g(x ) (or (4): 0 f(x ) + g(x ) when f F 1,1 L (Rp ) ). Sketch of the proof. : By definition of F : iff F (x) F (x ) ξ T (x x ), for any ξ F (x ), x R p. If (3) (or (4)) is satisfied, then F (x) F (x ) 0 x is a global solution to (1). : If x is a global of (1) then F (x) F (x ), x dom(f ) F (x) F (x ) 0 T (x x ), x R p. This leads to 0 F (x ) or (3) (or (4)). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 14/ 43

A short detour: Proximal-point operators Definition (Proximal operator [9]) Let g F(R p ) and x R p. The proximal operator (or prox-operator) of f is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. (5) Numerical efficiency: Why do we need proximal operator? For problem (1): Many well-known convex functions g, we can compute prox g (x) analytically or very efficiently. If f F 1,1 L (Rp ), and prox g (x) is cheap to compute, then solving (1) is as efficient as solving min f(x) x R p in terms of complexity. If prox f (x) and prox g (x) are both cheap to compute, then convex splitting (1) is also efficient (cf., Lecture 8). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 15/ 43

A short detour: Basic properties of prox-operator Property (Basic properties of prox-operator) 1. prox g (x) is well-defined and single-valued (i.e., the prox-operator (5) has a unique solution since g( ) + (1/2) x 2 2 is strongly convex). 2. Optimality condition: 3. x is a fixed point of prox g ( ): 4. Nonexpansiveness: x prox g (x) + g(prox g (x)), x R p. 0 g(x ) x = prox g (x ). prox g (x) prox g ( x) 2 x x 2, x, x R p. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 16/ 43

Fixed-point characterization Optimality condition as fixed-point formulation The optimality condition (3): 0 f(x ) + g(x ) is equivalent to x prox λg (x λ f(x )) := T λ (x ), for any λ > 0. (6) The optimality condition (4): 0 f(x ) + g(x ) is equivalent to x = prox λg (x λ f(x )) := U λ (x ), for any λ > 0. (7) T λ is a set-valued operator and U λ is a single-valued operator. Proof. We prove (7) ((6) is done similarly). (4) implies 0 f(x ) + g(x ) x λ f(x ) x + λ g(x ) (I + λ g)(x ). Using the basic property 2 of prox λg, we have x prox λg (x λ f(x )). Since prox λg and f are single-valued, we obtain (7). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 17/ 43

Choices of solution methods f 2 F 1,1 L (Rp ),g 2 F prox (R p ) f 2 F 2 (dom(f)),g 2 F prox (R p ) [Fast] proximal gradient method Proximal gradient/newton F? =min{f (x) :=f(x)+g(x)} x2rp f is smoothable,g 2 F prox (R p ) f 2 F prox (R p ),g 2 F prox (R p ) Smoothing techniques Splitting methods/admm F 1,1 L and F 2 are the class of convex functions with Lipschitz gradient and self-concordant, respectively. F prox is the class of convex functions with proximity operator (defined in the next slides). smoothable is defined in the next lectures. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 18/ 43

Solution methods Composite convex minimization { } F := min F (x) := f(x) + g(x). (1) x R p Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 19/ 43

Solution methods Composite convex minimization { } F := min F (x) := f(x) + g(x). (1) x R p Choice of numerical solution methods Solve (1) = Find x k R p such that for a given tolerance ε > 0. F (x k ) F ε Oracles: We can use one of the following configurations (oracles): 1. f( ) and g( ) at any point x R p. 2. f( ) and prox λg ( ) at any point x R p. 3. prox λf and prox λg ( ) at any point x R p. 4. f( ), inverse of 2 f( ) and prox λg ( ) at any point x R p. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 19/ 43

Solution methods Composite convex minimization { } F := min F (x) := f(x) + g(x). (1) x R p Choice of numerical solution methods Solve (1) = Find x k R p such that for a given tolerance ε > 0. F (x k ) F ε Oracles: We can use one of the following configurations (oracles): 1. f( ) and g( ) at any point x R p. 2. f( ) and prox λg ( ) at any point x R p. 3. prox λf and prox λg ( ) at any point x R p. 4. f( ), inverse of 2 f( ) and prox λg ( ) at any point x R p. Using different oracle leads to different types of algorithms Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 19/ 43

Tractable prox-operators Processing non-smooth terms in (1) We handle the nonsmooth term g in (1) using the proximal mapping principle. Computing proximal operator prox g of a general convex function g prox g (x) arg min y R p { g(y) + (1/2) y x 2 2 }. can be computationally demanding. If we can efficiently compute prox F, we can use the proximal-point algorithm (PPA) [3, 9] to solve (1). Unfortunately, PPA is hardly used in practice to solve (8) since computing prox λf ( ) can be as almost hard as solving (1). Definition (Tractable proximity) Given g F(R p ). We say that g is proximally tractable if prox g defined by (5) can be computed efficiently. efficiently" = {closed form solution, low-cost computation, polynomial time}. We denote F prox(r p ) the class of proximally tractable convex functions. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 20/ 43

The proximal-point method Problem (Unconstrained convex minimization) Given F F(R p ), our goal is to solve F := min F (x). (8) x Rp Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 21/ 43

The proximal-point method Problem (Unconstrained convex minimization) Given F F(R p ), our goal is to solve F := min F (x). (8) x Rp Proximal-point algorithm (PPA): 1. Choose x 0 R p and a positive sequence {λ k } k 0 R ++. 2. For k = 0, 1,, update: x k+1 := prox λk F (xk ) Theorem (Convergence [3]) Let {x k } k 0 be a sequence generated by PPA. If 0 < λ k < + then F (x k ) F x0 x 2 2 2 k j=0 λ j, x S, k 0. If λ k λ > 0, then the convergence rate of PPA is O(1/k). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 21/ 43

Tractable prox-operators Example For separable functions, the prox-operator can be efficient. For instance, g(x) := x 1 = n i=1 x i, we have prox λg (x) = sign(x) max{ x λ, 0}. For smooth functions, we can computer the prox-operator via basic algebra. For instance, g(x) := (1/2) Ax b 2 2, one has prox λg (x) = ( I + λa T A ) 1( x + λab ). For the indicator functions of simple sets, e.g., g(x) := δ X (x), the prox-operator is the projection operator prox λg (x) := π X (x) the projection of x onto X. For instance, when X = {x : x 1 λ}, the projection can be obtained efficiently. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 22/ 43

Computational efficiency - Example Proximal operator of quadratic function The proximal operator of a quadratic function g(x) := 1 2 Ax b 2 2 is defined as { 1 prox λg (x) := arg min y R p 2 Ay b 2 2 + 1 } 2λ y x 2 2. (9) How to compute prox λg (x)? The optimality condition implies that the solution of (9) should satisfy the following linear system: A T (Ay b) + λ 1 (y x) = 0. As a result, we obtain prox λg (x) = ( I + λa T A ) 1 (x + λab). When A T A is efficiently diagonalizable (e.g., U T A T AU := Λ, where U is a unitary matrix and Λ is a diagonal matrix) then prox λg (x) can be cheap prox λg (x) = U (I + λλ) 1 U T (x + λab). Matrices A such as TV operator with periodic boundary conditions, index subsampling operators (e.g., as in matrix completion), and circulant matrices (e.g., typical image blur operators) are efficiently diagonalizable with the Fast Fourier transform U. If AA T is diagonalizable (e.g., a tight frame A), then we can use the identity (I + λa T A) 1 = I A T (λ 1 I + AA T ) 1 A. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 23/ 43

A non-exhaustive list of proximal tractability functions Name Function Proximal operator Complexity l 1 -norm f(x) := x 1 prox λf (x) = sign(x) [ x λ] + O(p) l 2 -norm f(x) := x 2 prox λf (x) = [1 λ/ x 2 ] + x O(p) Support function f(x) := max y C x T y prox λf (x) = x λπ C (x) Box indicator f(x) := δ [a,b] (x) prox λf (x) = π [a,b] (x) O(p) Positive semidefinite f(x) := δ p S (X) prox λf (X) = U[Σ] + U T, where X = 3 O(p ) cone indicator + UΣU T Hyperplane indicator f(x) := δ X (x), X := prox λf (x) = π X (x) = x + O(p) ( {x : a T x = b} b a T x )a a 2 Simplex indicator f(x) = δ X (x), X := prox λf (x) = (x ν1) for some ν R, Õ(p) {x : x 0, 1 T x = 1} which can be efficiently calculated Convex quadratic f(x) := (1/2)x T Qx + q T x prox λf (x) = (λi + Q) 1 x O(p log p) O(p 3 ) Square l 2 -norm f(x) := (1/2) x 2 2 prox λf (x) = (1/(1 + λ))x O(p) log-function f(x) := log(x) prox λf (x) = ((x 2 + 4λ) 1/2 + x)/2 O(1) log det-function f(x) := log det(x) prox λf (X) is the log-function prox applied O(p 3 ) to the individual eigenvalues of X Here: [x] + := max{0, x} and δ X is the indicator function of the convex set X, sign is the sign function, S p + is the cone of symmetric positive semidefinite matrices. For more functions, see [2, 7]. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 24/ 43

Proximal-gradient method: A quadratic majorization perspective Definition (Moreau proximal operator [?, 9]) Let g F(R p ). The proximal operator (or prox-operator) of g is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. Quadratic upper bound for f For f F 1,1 L (Rp ), we have, x, y R p f(x) f(y) + f(y) T (x y) + L 2 x y 2 2 Q L(x, y) Quadratic majorizer for f + g [?] Of course, x, y R p, f(x) Q L (x, y) f(x) + g(x) Q L (x, y) + g(x) P L (x, y) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 26/ 43

Proximal-gradient method: A quadratic majorization perspective Definition (Moreau proximal operator [?, 9]) Let g F(R p ). The proximal operator (or prox-operator) of g is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. Quadratic upper bound for f For f F 1,1 L (Rp ), we have, x, y R p f(x) f(y) + f(y) T (x y) + L 2 x y 2 2 Q L(x, y) Quadratic majorizer for f + g [?] Of course, x, y R p, f(x) Q L (x, y) f(x) + g(x) Q L (x, y) + g(x) P L (x, y) Proximal-gradient from the majorize-minimize perspective [?] x k+1 = arg min P L (x, x k ) x Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 26/ 43

Proximal-gradient method: A quadratic majorization perspective Definition (Moreau proximal operator [?, 9]) Let g F(R p ). The proximal operator (or prox-operator) of g is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. Quadratic upper bound for f For f F 1,1 L (Rp ), we have, x, y R p f(x) f(y) + f(y) T (x y) + L 2 x y 2 2 Q L(x, y) Quadratic majorizer for f + g [?] Of course, x, y R p, f(x) Q L (x, y) f(x) + g(x) Q L (x, y) + g(x) P L (x, y) Proximal-gradient from the majorize-minimize perspective [?] x k+1 = arg min P L (x, x k ) = prox g/l (x f(x k )/L) x Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 26/ 43

Geometric illustration F (x) P L (x, x k ):=f(x k )+rf(x k ) T (x x k )+ L 2 kx xk k 2 2 + g(x) F (x) =f(x) + g(x) x k S L (x k ) f(x k )+rf(x k ) T (x x k ) + g(x) x k x k+1 x? x Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 27/ 43

Proximal-gradient algorithm Basic proximal-gradient scheme (ISTA) [?,?] 1. Choose x 0 dom(f ) arbitrarily as a starting point. 2. For k = 0, 1,, generate a sequence {x k } k 0 as: where α := 1 L. x k+1 := prox αg ( x k α f(x k ) ), Theorem (Convergence of ISTA [1]) Let {x k } be generated by ISTA. Then: F (x k ) F L f x 0 x 2 2 2(k + 1) The worst-case complexity to reach F (x k ) F ε of (ISTA) is O ( Lf R 2 0 ε R 0 := max x S x0 x 2. ), where A line-search procedure can be used to estimate L k for L based on (0 < c 1): f(x k+1 ) f(x k ) c 2L k f(x k ) 2. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 28/ 43

Fast proximal-gradient algorithm Fast proximal-gradient scheme (FISTA) 1. Choose x 0 dom(f ) arbitrarily as a starting point. 2. Set y 0 := x 0 and t 0 := 1. 3. For k = 0, 1,..., generate two sequences {x k } k 0 and {y k } k 0 as: ( ) x k+1 := prox αg y k α f(y k ), t k+1 := (1 + 4t 2 k + 1)/2, y k+1 := x k+1 + t k 1 (x t k+1 x k ). k+1 where α := L 1. ( ) Lf R0 From O 2 to O (R ɛ 0 Complexity per iteration One gradient f(y k ) and one prox-operator of g; L f ɛ 8 arithmetic operations for t k+1 and γ k+1 ; ) iterations at almost no additional cost!. 2 more vector additions, and one scalar-vector multiplication. The cost per iteration is almost the same as in gradient scheme if proximal operator of g is efficient. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 29/ 43

Example 1: l 1 -regularized least squares Problem (l 1 -regularized least squares) Given A R n p and b R n, solve: { F := min F (x) := 1 } x R p 2 Ax b 2 2 + λ x 1, (10) where λ > 0 is a regularization parameter. Complexity per iterations Evaluating f(x k ) = A T (Ax k b) requires one Ax and one A T y. One soft-thresholding operator prox λg (x) = sign(x) max{ x λ, 0}. Optional: Evaluating L = A T A (spectral norm) - via power iterations (e.g., 20 iterations, each iteration requires one Ax and one A T y). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 30/ 43

Example 1: l 1 -regularized least squares Problem (l 1 -regularized least squares) Given A R n p and b R n, solve: { F := min F (x) := 1 } x R p 2 Ax b 2 2 + λ x 1, (10) where λ > 0 is a regularization parameter. Complexity per iterations Evaluating f(x k ) = A T (Ax k b) requires one Ax and one A T y. One soft-thresholding operator prox λg (x) = sign(x) max{ x λ, 0}. Optional: Evaluating L = A T A (spectral norm) - via power iterations (e.g., 20 iterations, each iteration requires one Ax and one A T y). Synthetic data generation A := randn(n, p) - standard Gaussian N (0, I). x is a k-sparse vector generated randomly. b := Ax + N (0, 10 3 ). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 30/ 43

Example 1: Theoretical bounds vs practical performance (Theoretical bounds) FISTA := 2L f R 2 0 (k+2) 2 and ISTA := L f R 2 0 2(k+2). 10 4 10 2 Theoretical bound ISTA FISTA x 2 F(x) F in log-scale 10 0 10 2 10 4 10 6 10 8 descent directions x 2 x 1 10 10 x 1 10 12 0 200 400 600 800 1000 Number of iterations restricted descent directions l 1 -regularized least squares formulation has restricted strong convexity. The proximal-gradient method can automatically exploit this structure. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 31/ 43

Adaptive Restart It is possible the preserve O(1/k 2 ) convergence guarantee! One needs to slightly modify the algorithm as below. Generalized fast proximal-gradient scheme 1. Choose x 0 = x 1 dom(f ) arbitrarily as a starting point. 2. Set θ 0 = θ 1 = 1 3. For k = 0, 1,..., generate two sequences {x k } k 0 and {y k } k 0 as: y k := x k + θ k (θ 1 k 1 1)(xk x k 1 ) ( ) x k+1 := prox λg y k λ f(y k ), if restart test holds (11) y k = x k ( ) x k+1 := prox λg y k λ f(y k ) where λ := L 1 f. θ k is chosen so that it satisfies θ k = θ 4 k + 4θ 2 k θ2 k 2 < 2 k + 3 Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 32/ 43

Adaptive Restart: Guarantee Theorem (Global complexity [10]) The sequence {x k } k 0 generated by the modified algorithm satisfies ( F (x k ) F 2L f (k + 2) 2 R 2 0 + k i k ( ) ) x x k i 2 2 x z k i 2 2 k 0. (12) where R 0 := min x S x0 x, z k = x k 1 + θ 1 k 1 (xk x k 1 ) and k i, i = 1... are the iterations for which the restart test holds. Various restarts tests that might coincide with x x ki 2 2 x z ki 2 2 Exact non-monotonicity test: F (x k+1 ) F (x k ) > 0 Non-monotonicity test: (L F (y k 1 x k ), x k+1 1 2 (xk + y k 1 ) > 0 (implies exact non-monotonicity and it is advantageous when function evaluations are expensive) Gradient-mapping based test: (L f (y k x k+1 ), x k+1 x k > 0 Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 33/ 43

Example 2: Sparse logistic regression Problem (Sparse logistic regression [?]) Given A R n p and b { 1, +1} n, solve: { n F := min F (x) := 1 ( ( } log 1 + exp b j (aj T x,β n x + β))) + ρ x 1. j=1 Real data Real data: w8a with n = 49 749 data points, p = 300 features Available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Parameters ρ = 10 4. Number of iterations 5000, tolerance 10 7. Ground truth: Solve problem up to 10 9 accuracy by TFOCS to get a high accuracy approximation of x and F. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 34/ 43

Example 2: Sparse logistic regression - numerical results F(x) F in log-scale 10 2 10 0 10 2 10 4 10 6 Theoretical bound of ISTA Theoretical bound of FISTA ISTA Line Search ISTA FISTA FISTA with Restart Line Search FISTA Line Search FISTA with Restart F(x) F in log-scale 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 ISTA Line Search ISTA FISTA FISTA with Restart Line Search FISTA Line Search FISTA with Restart 10 8 0 1000 2000 3000 4000 5000 Number of iterations 10 8 0 10 20 30 40 50 60 70 Time (s) ISTA LS-ISTA FISTA FISTA-R LS-FISTA LS-FISTA-R Number of iterations 5000 5000 4046 2423 447 317 CPU time (s) 26.975 61.506 21.859 18.444 10.683 6.228 Solution error ( 10 7 ) 29370 2.774 1.000 0.998 0.961 0.985 Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 35/ 43

Strong convexity case: algorithms Proximal-gradient scheme (ISTA µ) 1. Given x 0 R p as a starting point. 2. For k = 0, 1,, generate a sequence {x k } k 0 as: ( ) x k+1 :=prox αk g x k α k f(x k ), where α k := 2/(L f + µ) is the optimal step-size. Fast proximal-gradient scheme (FISTA µ) 1. Given x 0 R p as a starting point. Set y 0 := x 0. 2. For k = 0, 1,, generate two sequences {x k } k 0 and {y k } k 0 as: ( ) where α k := L 1 f x k+1 := prox αk g y k+1 := x k+1 + y k α k f(y k ) ( cf 1 cf +1 is the optimal step-size., ) (x k+1 x k ), Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 36/ 43

Strong convexity case: Convergence Assumption f is strongly convex with parameter µ > 0, i.e., f F 1,1 L,µ (Rp ). Condition number: c f := L f µ 0. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 37/ 43

Strong convexity case: Convergence Assumption f is strongly convex with parameter µ > 0, i.e., f F 1,1 L,µ (Rp ). Condition number: c f := L f µ 0. Theorem (ISTA µ [7]) F (x k ) F L f 2 ( ) 2k cf 1 x c f +1 0 x 2 2. ( ) 2 ( ) 2 cf 1 Lf µ Convergence rate: Linear with contraction factor: ω := =. c f +1 L f +µ Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 37/ 43

Strong convexity case: Convergence Assumption f is strongly convex with parameter µ > 0, i.e., f F 1,1 L,µ (Rp ). Condition number: c f := L f µ 0. Theorem (ISTA µ [7]) F (x k ) F L f 2 ( ) 2k cf 1 x c f +1 0 x 2 2. ( ) 2 ( ) 2 cf 1 Lf µ Convergence rate: Linear with contraction factor: ω := =. c f +1 L f +µ Theorem (FISTA µ [7]) F (x k ) F L f +µ 2 ( 1 µ L f ) k x 0 x 2 2. Convergence rate: Linear with contraction factor: ω f = Lf µ Lf < ω. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 37/ 43

A practical issue Stopping criterion Fact: If PG L (x ) = 0, then x is optimal to (1), where ( ) PG L (x) = L x prox (1/L)g (x (1/L) f(x)). Stopping criterion: (relative solution change) L k x k+1 x k 2 ε max{l 0 x 1 x 0 2, 1}, where ε is a given tolerance. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 38/ 43

Summary of the worst-case complexities Software TFOCS is a good software package to learn about first order methods. http://cvxr.com/tfocs/ Comparison with gradient scheme (F (x k ) F ε) Complexity Proximal-gradient scheme Fast proximal-gradient scheme Complexity [µ = 0] O ( R 2 0 (L f /ε) ) O ( R 0 Lf /ε ) Per iteration 1-gradient, 1-prox, 1-sv, 1-1-gradient, 1-prox, 2-sv, 3- v+ ( ) v+ ( Complexity [µ > 0] O κ log(ε 1 ) ) O κ log(ε 1 ) Per iteration 1-gradient, 1-prox, 1-sv, 1- v+ 1-gradient, 1-prox, 1-sv, 2- v+ Here: sv = scalar-vector multiplication, v+=vector addition. R 0 := max x0 x and κ = L f /µ f is the condition number. x S Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 39/ 43

Summary of the worst-case complexities Software TFOCS is a good software package to learn about first order methods. http://cvxr.com/tfocs/ Comparison with gradient scheme (F (x k ) F ε) Complexity Proximal-gradient scheme Fast proximal-gradient scheme Complexity [µ = 0] O ( R 2 0 (L f /ε) ) O ( R 0 Lf /ε ) Per iteration 1-gradient, 1-prox, 1-sv, 1-1-gradient, 1-prox, 2-sv, 3- v+ ( ) v+ ( Complexity [µ > 0] O κ log(ε 1 ) ) O κ log(ε 1 ) Per iteration 1-gradient, 1-prox, 1-sv, 1- v+ 1-gradient, 1-prox, 1-sv, 2- v+ Here: sv = scalar-vector multiplication, v+=vector addition. R 0 := max x0 x and κ = L f /µ f is the condition number. x S Need alternatives when f is only self-concordant computing f(x) is much costlier than computing prox g Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 39/ 43

Examples Example (Sparse graphical model selection) { } min tr(σθ) log det(θ) + ρ vec(θ) 1 Θ 0 } {{ }} {{ } f(x) g(x) where Θ 0 means that Θ is symmetric and positive definite, and ρ > 0 is a regularization parameter and vec is the vectorization operator. Computing the gradient is expensive: f(θ) = Θ 1. f F 2 is self-concordant. However, if αi Θ βi, then f F 2,1 L,µ with L = p/α 2 and µ = (β 2 p) 1. Example (l 1 -regularized Lasso) 1 min x 2 b Ax 2 2 + ρ x 1 } {{ }} {{ } g(x) f(x) where n p, A R n p is a full column-rank matrix, and ρ > 0 is a regularization parameter. f F 2,1 L,µ and computing the gradient is O(n). = x \ + n p b A w n p Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 40/ 43

References I [1] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183 202, 2009. [2] Patrick L. Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Heinz Bauschke, Regina S. Burachik, Patrick Combettes, Veit Elser, D. Russell Luck, and Henry Wolkowicz, editors, Fixed-Point Algorithms for Inverse Problems in Science and Engineering, chapter 10. Springer, New York, NY, 2011. [3] Osman Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control Optim., 29(2):403 419, March 1991. [4] Yen-Huan Li, Ya-Ping Hsieh, and Volkan Cevher. A geometric view on constrained M-estimators. EPFL-REPORT-205083, École Polytechnique Fédérale de Lausanne, 2015. [5] Yen-Huan Li, Jonathan Scarlett, Pradeep Ravikumar, and Volkan Cevher. Sparsistency of l 1 -regularized M-estimators. In Proc. 18th Inf. Conf. Artificial Intelligence and Statistics, pages 644 652, 2015. [6] Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci., 27(4):538 557, 2012. [7] Yurii Nesterov. Introductory Lectures on Convex Optimization. Kluwer, Boston, MA, 2004. [8] R. T. Rockafellar. On the maximal monotonicity of subdifferential mappings. Pac. J. Math., 33(1):209 216, 1970. [1] A. Beck and M. Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Imaging Sciences, 2(1):183 202, 2009. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 41/ 43

References II [2] P. Combettes and Pesquet J.-C. Signal recovery by proximal forward-backward splitting. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages 185 212. Springer-Verlag, 2011. [3] O. Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control Optim., 29(2):403 419, 1991. [4] Y. Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004. [5] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127 152, 2005. [6] Y. Nesterov and A. Nemirovski. Interior-point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, 1994. [7] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123 231, 2013. [8] R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1970. [9] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14:877 898, 1976. [10] P. Giselsson and S. Boyd. Monotonicity and Restart in Fast Gradient Methods. IEEE 53rd Annual Conference on Decision and Control, 5058 5063, 2014. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 42/ 43

References III [11] Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. Composite self-concordant minimization. Tech. report, Lab. for Information and Inference Systems (LIONS), EPFL, Switzerland, CH-1015 Lausanne, Switzerland, January 2013. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 43/ 43