Mathematics of Data: From Theory to Computation

Size: px
Start display at page:

Download "Mathematics of Data: From Theory to Computation"

Transcription

1 Mathematics of Data: From Theory to Computation Prof. Volkan Cevher Lecture 8: Composite convex minimization I Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale de Lausanne (EPFL) EE-556 (Fall 2015) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 1/ 43

2 License Information for Mathematics of Data Slides This work is released under a Creative Commons License with the following terms: Attribution The licensor permits others to copy, distribute, display, and perform the work. In return, licensees must give the original authors credit. Non-Commercial The licensor permits others to copy, distribute, display, and perform the work. In return, licensees may not use the work for commercial purposes unless they get the licensor s permission. Share Alike The licensor permits others to distribute derivative works only under a license identical to the one that governs the licensor s work. Full Text of the License Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 2/ 43

3 Outline Today 1. Composite convex minimization 2. Proximal operator and computational complexity 3. Proximal gradient methods Next week 1. Proximal Newton-type methods 2. Composite self-concordant minimization Mathematics of Data Prof. Volkan Cevher, Slide 3/ 43

4 Recommended reading material A. Beck and M. Tebulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems, SIAM J. Imaging Sciences, 2(1), , Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program, 103(1), , Q. Tran-Dinh, A. Kyrillidis and V. Cevher, Composite Self-Concordant Minimization, LIONS-EPFL Tech. Report N. Parikh and S. Boyd, Proximal Algorithms, Foundations and Trends in Optimization, 1(3): , Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 4/ 43

5 Motivation Motivation Data analytics problems in various disciplines can often be simplified to nonsmooth composite convex minimization problems. To this end, this lecture provides efficient numerical solution methods for such problems. Intriguingly, composite minimization problems are far from generic nonsmooth problems and we can exploit individual function structures to obtain numerical solutions nearly as efficiently as if they are smooth problems. Mathematics of Data Prof. Volkan Cevher, Slide 5/ 43

6 Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

7 Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp f and g are both proper, closed, and convex. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

8 Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp f and g are both proper, closed, and convex. dom(f ) := dom(f) dom(g) and < F < +. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

9 Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp f and g are both proper, closed, and convex. dom(f ) := dom(f) dom(g) and < F < +. The solution set S := {x dom(f ) : F (x ) = F } is nonempty. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

10 Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp f and g are both proper, closed, and convex. dom(f ) := dom(f) dom(g) and < F < +. The solution set S := {x dom(f ) : F (x ) = F } is nonempty. Two remarks Nonsmoothness: At least one of the two functions f and g is nonsmooth Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

11 Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp f and g are both proper, closed, and convex. dom(f ) := dom(f) dom(g) and < F < +. The solution set S := {x dom(f ) : F (x ) = F } is nonempty. Two remarks Nonsmoothness: At least one of the two functions f and g is nonsmooth General nonsmooth convex optimization methods (e.g., classical subgradient methods, level, or bundle methods) lack efficiency and numerical robustness. Require O(ɛ 2 ) iterations to reach a point x ɛ such that F (x ɛ ) F ɛ. Hence, to reach x 0.01 such that F (x 0.01 ) F 0.01, we need O(10 4 ) iterations. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

12 Composite convex minimization Problem (Unconstrained composite convex minimization) F := min {F (x) := f(x) + g(x)} (1) x Rp f and g are both proper, closed, and convex. dom(f ) := dom(f) dom(g) and < F < +. The solution set S := {x dom(f ) : F (x ) = F } is nonempty. Two remarks Nonsmoothness: At least one of the two functions f and g is nonsmooth General nonsmooth convex optimization methods (e.g., classical subgradient methods, level, or bundle methods) lack efficiency and numerical robustness. Require O(ɛ 2 ) iterations to reach a point x ɛ such that F (x ɛ ) F ɛ. Hence, to reach x 0.01 such that F (x 0.01 ) F 0.01, we need O(10 4 ) iterations. Generality: it covers a wider range of problems than smooth unconstrained problems. E.g. when handling regularized M-estimation, f is a loss function, a data fidelity, or negative log-likelihood function. g is a regularizer, encouraging structure and/or constraints in the solution. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 6/ 43

13 Example 1: Sparse regression in generalized linear models (GLMs) Problem (Sparse regression in GLM) Our goal is to estimate x R p given {b i } n i=1 and {a i } n i=1, knowing that the likelihood function at y i given a i and x is given by L(b i ; a i, x ), and that x is sparse. b A x w Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 7/ 43

14 Example 1: Sparse regression in generalized linear models (GLMs) Problem (Sparse regression in GLM) Our goal is to estimate x R p given {b i } n i=1 and {a i } n i=1, knowing that the likelihood function at y i given a i and x is given by L(b i ; a i, x ), and that x is sparse. b A x w Optimization formulation { min x R p n log L(b i ; a i, x ) i=1 } {{ } f(x) + ρ n x 1 } {{ } g(x) where ρ n > 0 is a parameter which controls the strength of sparsity regularization. } Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 7/ 43

15 Example 1: Sparse regression in generalized linear models (GLMs) Problem (Sparse regression in GLM) Our goal is to estimate x R p given {b i } n i=1 and {a i } n i=1, knowing that the likelihood function at y i given a i and x is given by L(b i ; a i, x ), and that x is sparse. b A x w Optimization formulation { min x R p n log L(b i ; a i, x ) i=1 } {{ } f(x) + ρ n x 1 } {{ } g(x) where ρ n > 0 is a parameter which controls the strength of sparsity regularization. Theorem (cf. [4, 5, 6] for details) Under some technical conditions, there exists {ρ i } i=1 such that with high probability, x x ( ) 2 s log p = O, supp x = supp x. 2 n xml x 2 Recall ML: Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 7/ 43 } = O (p/n). 2

16 Example 2: Image processing Problem (Imaging denoising/deblurring) Our goal is to obtain a clean image x given dirty observations b R n 1 via b = A(x) + w, where A is a linear operator, which, e.g., captures camera blur as well as image subsampling, and w models perturbations, such as Gaussian or Poisson noise. Optimization formulation Poisson : Gaussian : { } min (1/2) A(x) b 2 x R n p 2 + ρ x } {{ }} TV {{ } f(x) g(x) min x R n p { 1 n n [ a i, x b i ln ( a i, x )] i=1 } {{ } f(x) + ρ x TV } {{ } g(x) where ρ > 0 is a regularization parameter and TV is the total variation (TV) norm: { x TV := i,j x i,j+1 x i,j + x i+1,j x i,j anisotropic case, xi,j+1 x i,j i,j 2 + x i+1,j x i,j 2 isotropic case } Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 8/ 43

17 Example 3: Confocal microscopy with camera blur and Poisson observations Original original image image x Observed input: Noise image image b output: Estimate Denoise image ˆx Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 9/ 43

18 Example 4: Sparse inverse covariance estimation Problem (Graphical model selection) Given a data set D := {x 1,, x N }, where x i is a Gaussian random variable. Let Σ be the Example: covariance matrix Log-determinant corresponding to the graphical for LMIs model of the Gaussian Markov random field. Our goal is to learn a sparse precision matrix Θ (i.e., the inverse covariance Application: matrix Σ 1 ) that Graphical captures model the selection Markov random field structure.. x5 x 4 x1 x2 x3 x4 x5 x1 x 1 x 2 x 3 = x2 x3 x4 x5 Given a data set D := {x 1,...,x N },wherex i is a Gaussian random variable. Let be the covariance matrix corresponding to the graphical model of the Gausian Markov random field. The aim is to learn a sparse matrix that approximates the inverse 1. Optimization problem 8 9 >< >= min log det( ) + trace( ) + kvec( )k 1 0 >: {z } {z } >; f(x) g(x) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 10/ 43

19 Example 4: Sparse inverse covariance estimation Problem (Graphical model selection) Given a data set D := {x 1,, x N }, where x i is a Gaussian random variable. Let Σ be the Example: covariance matrix Log-determinant corresponding to the graphical for LMIs model of the Gaussian Markov random field. Our goal is to learn a sparse precision matrix Θ (i.e., the inverse covariance Application: matrix Σ 1 ) that Graphical captures model the selection Markov random field structure.. x5 x 4 x1 x2 x3 x4 x5 x1 x 1 x 2 x 3 = x2 x3 x4 x5 Given a data set D := {x 1,...,x N },wherex i is a Gaussian random variable. Let be the covariance matrix corresponding to the graphical model of Optimization the Gausianformulation Markov random field. The aim is to learn a sparse matrix that approximates the inverse 1. { } min tr(σθ) log det(θ) + λ vec(θ) 1 Θ 0 } Optimization {{ problem }} {{ } 8 f(x) g(x) 9 >< >= where Θ 0 means that min Θ islog symmetric det( ) + and trace( ) positive+ definite kvec( )k and 1 λ > 0 is a regularization parameter 0 >: and vec is the{z vectorization} operator. {z } >; f(x) g(x) (2) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 10/ 43

20 Outline Today 1. Composite convex minimization 2. Proximal operator and computational complexity 3. Proximal gradient methods Next week 1. Proximal Newton-type methods 2. Composite self-concordant minimization Mathematics of Data Prof. Volkan Cevher, Slide 11/ 43

21 Question: How do we design algorithms for finding a solution x? Philosophy We cannot immediately design algorithms just based on the original formulation F := min {F (x) := f(x) + g(x)}. (1) x Rp We need intermediate tools to characterize the solutions x of this problem One key tool is called the optimality condition Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 12/ 43

22 Optimality condition Theorem (Moreau-Rockafellar s theorem [8]) Let f and g be the subdiffierential of f and g, respectively. If f, g F(R p ) and dom(f) dom(g), then: F (f + g) = f + g. Note: dom(f ) = dom(f) dom(g) and f(x) is defined as (cf., Lecture 2): f := {w R n : f(y) f(x) w T (y x), y R n }, Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 13/ 43

23 Optimality condition Theorem (Moreau-Rockafellar s theorem [8]) Let f and g be the subdiffierential of f and g, respectively. If f, g F(R p ) and dom(f) dom(g), then: F (f + g) = f + g. Note: dom(f ) = dom(f) dom(g) and f(x) is defined as (cf., Lecture 2): f := {w R n : f(y) f(x) w T (y x), y R n }, Optimality condition Generally, the optimality condition for (1) can be written as 0 F (x ) f(x ) + g(x ). (3) If f F 1,1 L (Rp ), then (3) features the gradient of f as opposed to the subdifferential 0 F (x ) f(x ) + g(x ). (4) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 13/ 43

24 Necessary and sufficient condition Lemma (Necessary and sufficient condition) A point x dom(f ) is called a globally optimal solution to (1) (i.e., F := min x R p{f (x) := f(x) + g(x)} x satisfies (3): 0 f(x ) + g(x ) (or (4): 0 f(x ) + g(x ) when f F 1,1 L (Rp ) ). iff Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 14/ 43

25 Necessary and sufficient condition Lemma (Necessary and sufficient condition) A point x dom(f ) is called a globally optimal solution to (1) (i.e., F := min x R p{f (x) := f(x) + g(x)} x satisfies (3): 0 f(x ) + g(x ) (or (4): 0 f(x ) + g(x ) when f F 1,1 L (Rp ) ). Sketch of the proof. : By definition of F : iff F (x) F (x ) ξ T (x x ), for any ξ F (x ), x R p. If (3) (or (4)) is satisfied, then F (x) F (x ) 0 x is a global solution to (1). : If x is a global of (1) then F (x) F (x ), x dom(f ) F (x) F (x ) 0 T (x x ), x R p. This leads to 0 F (x ) or (3) (or (4)). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 14/ 43

26 A short detour: Proximal-point operators Definition (Proximal operator [9]) Let g F(R p ) and x R p. The proximal operator (or prox-operator) of f is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. (5) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 15/ 43

27 A short detour: Proximal-point operators Definition (Proximal operator [9]) Let g F(R p ) and x R p. The proximal operator (or prox-operator) of f is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. (5) Numerical efficiency: Why do we need proximal operator? For problem (1): Many well-known convex functions g, we can compute prox g (x) analytically or very efficiently. If f F 1,1 L (Rp ), and prox g (x) is cheap to compute, then solving (1) is as efficient as solving min f(x) x R p in terms of complexity. If prox f (x) and prox g (x) are both cheap to compute, then convex splitting (1) is also efficient (cf., Lecture 8). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 15/ 43

28 A short detour: Basic properties of prox-operator Property (Basic properties of prox-operator) 1. prox g (x) is well-defined and single-valued (i.e., the prox-operator (5) has a unique solution since g( ) + (1/2) x 2 2 is strongly convex). 2. Optimality condition: 3. x is a fixed point of prox g ( ): 4. Nonexpansiveness: x prox g (x) + g(prox g (x)), x R p. 0 g(x ) x = prox g (x ). prox g (x) prox g ( x) 2 x x 2, x, x R p. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 16/ 43

29 Fixed-point characterization Optimality condition as fixed-point formulation The optimality condition (3): 0 f(x ) + g(x ) is equivalent to x prox λg (x λ f(x )) := T λ (x ), for any λ > 0. (6) The optimality condition (4): 0 f(x ) + g(x ) is equivalent to x = prox λg (x λ f(x )) := U λ (x ), for any λ > 0. (7) T λ is a set-valued operator and U λ is a single-valued operator. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 17/ 43

30 Fixed-point characterization Optimality condition as fixed-point formulation The optimality condition (3): 0 f(x ) + g(x ) is equivalent to x prox λg (x λ f(x )) := T λ (x ), for any λ > 0. (6) The optimality condition (4): 0 f(x ) + g(x ) is equivalent to x = prox λg (x λ f(x )) := U λ (x ), for any λ > 0. (7) T λ is a set-valued operator and U λ is a single-valued operator. Proof. We prove (7) ((6) is done similarly). (4) implies 0 f(x ) + g(x ) x λ f(x ) x + λ g(x ) (I + λ g)(x ). Using the basic property 2 of prox λg, we have x prox λg (x λ f(x )). Since prox λg and f are single-valued, we obtain (7). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 17/ 43

31 Choices of solution methods f 2 F 1,1 L (Rp ),g 2 F prox (R p ) f 2 F 2 (dom(f)),g 2 F prox (R p ) [Fast] proximal gradient method Proximal gradient/newton F? =min{f (x) :=f(x)+g(x)} x2rp f is smoothable,g 2 F prox (R p ) f 2 F prox (R p ),g 2 F prox (R p ) Smoothing techniques Splitting methods/admm F 1,1 L and F 2 are the class of convex functions with Lipschitz gradient and self-concordant, respectively. F prox is the class of convex functions with proximity operator (defined in the next slides). smoothable is defined in the next lectures. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 18/ 43

32 Solution methods Composite convex minimization { } F := min F (x) := f(x) + g(x). (1) x R p Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 19/ 43

33 Solution methods Composite convex minimization { } F := min F (x) := f(x) + g(x). (1) x R p Choice of numerical solution methods Solve (1) = Find x k R p such that for a given tolerance ε > 0. F (x k ) F ε Oracles: We can use one of the following configurations (oracles): 1. f( ) and g( ) at any point x R p. 2. f( ) and prox λg ( ) at any point x R p. 3. prox λf and prox λg ( ) at any point x R p. 4. f( ), inverse of 2 f( ) and prox λg ( ) at any point x R p. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 19/ 43

34 Solution methods Composite convex minimization { } F := min F (x) := f(x) + g(x). (1) x R p Choice of numerical solution methods Solve (1) = Find x k R p such that for a given tolerance ε > 0. F (x k ) F ε Oracles: We can use one of the following configurations (oracles): 1. f( ) and g( ) at any point x R p. 2. f( ) and prox λg ( ) at any point x R p. 3. prox λf and prox λg ( ) at any point x R p. 4. f( ), inverse of 2 f( ) and prox λg ( ) at any point x R p. Using different oracle leads to different types of algorithms Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 19/ 43

35 Tractable prox-operators Processing non-smooth terms in (1) We handle the nonsmooth term g in (1) using the proximal mapping principle. Computing proximal operator prox g of a general convex function g prox g (x) arg min y R p { g(y) + (1/2) y x 2 2 }. can be computationally demanding. If we can efficiently compute prox F, we can use the proximal-point algorithm (PPA) [3, 9] to solve (1). Unfortunately, PPA is hardly used in practice to solve (8) since computing prox λf ( ) can be as almost hard as solving (1). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 20/ 43

36 Tractable prox-operators Processing non-smooth terms in (1) We handle the nonsmooth term g in (1) using the proximal mapping principle. Computing proximal operator prox g of a general convex function g prox g (x) arg min y R p { g(y) + (1/2) y x 2 2 }. can be computationally demanding. If we can efficiently compute prox F, we can use the proximal-point algorithm (PPA) [3, 9] to solve (1). Unfortunately, PPA is hardly used in practice to solve (8) since computing prox λf ( ) can be as almost hard as solving (1). Definition (Tractable proximity) Given g F(R p ). We say that g is proximally tractable if prox g defined by (5) can be computed efficiently. efficiently" = {closed form solution, low-cost computation, polynomial time}. We denote F prox(r p ) the class of proximally tractable convex functions. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 20/ 43

37 The proximal-point method Problem (Unconstrained convex minimization) Given F F(R p ), our goal is to solve F := min F (x). (8) x Rp Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 21/ 43

38 The proximal-point method Problem (Unconstrained convex minimization) Given F F(R p ), our goal is to solve F := min F (x). (8) x Rp Proximal-point algorithm (PPA): 1. Choose x 0 R p and a positive sequence {λ k } k 0 R For k = 0, 1,, update: x k+1 := prox λk F (xk ) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 21/ 43

39 The proximal-point method Problem (Unconstrained convex minimization) Given F F(R p ), our goal is to solve F := min F (x). (8) x Rp Proximal-point algorithm (PPA): 1. Choose x 0 R p and a positive sequence {λ k } k 0 R For k = 0, 1,, update: x k+1 := prox λk F (xk ) Theorem (Convergence [3]) Let {x k } k 0 be a sequence generated by PPA. If 0 < λ k < + then F (x k ) F x0 x k j=0 λ j, x S, k 0. If λ k λ > 0, then the convergence rate of PPA is O(1/k). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 21/ 43

40 Tractable prox-operators Example For separable functions, the prox-operator can be efficient. For instance, g(x) := x 1 = n i=1 x i, we have prox λg (x) = sign(x) max{ x λ, 0}. For smooth functions, we can computer the prox-operator via basic algebra. For instance, g(x) := (1/2) Ax b 2 2, one has prox λg (x) = ( I + λa T A ) 1( x + λab ). For the indicator functions of simple sets, e.g., g(x) := δ X (x), the prox-operator is the projection operator prox λg (x) := π X (x) the projection of x onto X. For instance, when X = {x : x 1 λ}, the projection can be obtained efficiently. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 22/ 43

41 Computational efficiency - Example Proximal operator of quadratic function The proximal operator of a quadratic function g(x) := 1 2 Ax b 2 2 is defined as { 1 prox λg (x) := arg min y R p 2 Ay b } 2λ y x 2 2. (9) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 23/ 43

42 Computational efficiency - Example Proximal operator of quadratic function The proximal operator of a quadratic function g(x) := 1 2 Ax b 2 2 is defined as { 1 prox λg (x) := arg min y R p 2 Ay b } 2λ y x 2 2. (9) How to compute prox λg (x)? The optimality condition implies that the solution of (9) should satisfy the following linear system: A T (Ay b) + λ 1 (y x) = 0. As a result, we obtain prox λg (x) = ( I + λa T A ) 1 (x + λab). When A T A is efficiently diagonalizable (e.g., U T A T AU := Λ, where U is a unitary matrix and Λ is a diagonal matrix) then prox λg (x) can be cheap prox λg (x) = U (I + λλ) 1 U T (x + λab). Matrices A such as TV operator with periodic boundary conditions, index subsampling operators (e.g., as in matrix completion), and circulant matrices (e.g., typical image blur operators) are efficiently diagonalizable with the Fast Fourier transform U. If AA T is diagonalizable (e.g., a tight frame A), then we can use the identity (I + λa T A) 1 = I A T (λ 1 I + AA T ) 1 A. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 23/ 43

43 A non-exhaustive list of proximal tractability functions Name Function Proximal operator Complexity l 1 -norm f(x) := x 1 prox λf (x) = sign(x) [ x λ] + O(p) l 2 -norm f(x) := x 2 prox λf (x) = [1 λ/ x 2 ] + x O(p) Support function f(x) := max y C x T y prox λf (x) = x λπ C (x) Box indicator f(x) := δ [a,b] (x) prox λf (x) = π [a,b] (x) O(p) Positive semidefinite f(x) := δ p S (X) prox λf (X) = U[Σ] + U T, where X = 3 O(p ) cone indicator + UΣU T Hyperplane indicator f(x) := δ X (x), X := prox λf (x) = π X (x) = x + O(p) ( {x : a T x = b} b a T x )a a 2 Simplex indicator f(x) = δ X (x), X := prox λf (x) = (x ν1) for some ν R, Õ(p) {x : x 0, 1 T x = 1} which can be efficiently calculated Convex quadratic f(x) := (1/2)x T Qx + q T x prox λf (x) = (λi + Q) 1 x O(p log p) O(p 3 ) Square l 2 -norm f(x) := (1/2) x 2 2 prox λf (x) = (1/(1 + λ))x O(p) log-function f(x) := log(x) prox λf (x) = ((x 2 + 4λ) 1/2 + x)/2 O(1) log det-function f(x) := log det(x) prox λf (X) is the log-function prox applied O(p 3 ) to the individual eigenvalues of X Here: [x] + := max{0, x} and δ X is the indicator function of the convex set X, sign is the sign function, S p + is the cone of symmetric positive semidefinite matrices. For more functions, see [2, 7]. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 24/ 43

44 Outline Today 1. Composite convex minimization 2. Proximal operator and computational complexity 3. Proximal gradient methods Next week 1. Proximal Newton-type methods 2. Composite self-concordant minimization Mathematics of Data Prof. Volkan Cevher, Slide 25/ 43

45 Proximal-gradient method: A quadratic majorization perspective Definition (Moreau proximal operator [?, 9]) Let g F(R p ). The proximal operator (or prox-operator) of g is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 26/ 43

46 Proximal-gradient method: A quadratic majorization perspective Definition (Moreau proximal operator [?, 9]) Let g F(R p ). The proximal operator (or prox-operator) of g is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. Quadratic upper bound for f For f F 1,1 L (Rp ), we have, x, y R p f(x) f(y) + f(y) T (x y) + L 2 x y 2 2 Q L(x, y) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 26/ 43

47 Proximal-gradient method: A quadratic majorization perspective Definition (Moreau proximal operator [?, 9]) Let g F(R p ). The proximal operator (or prox-operator) of g is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. Quadratic upper bound for f For f F 1,1 L (Rp ), we have, x, y R p f(x) f(y) + f(y) T (x y) + L 2 x y 2 2 Q L(x, y) Quadratic majorizer for f + g [?] Of course, x, y R p, f(x) Q L (x, y) f(x) + g(x) Q L (x, y) + g(x) P L (x, y) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 26/ 43

48 Proximal-gradient method: A quadratic majorization perspective Definition (Moreau proximal operator [?, 9]) Let g F(R p ). The proximal operator (or prox-operator) of g is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. Quadratic upper bound for f For f F 1,1 L (Rp ), we have, x, y R p f(x) f(y) + f(y) T (x y) + L 2 x y 2 2 Q L(x, y) Quadratic majorizer for f + g [?] Of course, x, y R p, f(x) Q L (x, y) f(x) + g(x) Q L (x, y) + g(x) P L (x, y) Proximal-gradient from the majorize-minimize perspective [?] x k+1 = arg min P L (x, x k ) x Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 26/ 43

49 Proximal-gradient method: A quadratic majorization perspective Definition (Moreau proximal operator [?, 9]) Let g F(R p ). The proximal operator (or prox-operator) of g is defined as: prox g (x) arg min {g(y) + 1 } y R p 2 y x 2 2. Quadratic upper bound for f For f F 1,1 L (Rp ), we have, x, y R p f(x) f(y) + f(y) T (x y) + L 2 x y 2 2 Q L(x, y) Quadratic majorizer for f + g [?] Of course, x, y R p, f(x) Q L (x, y) f(x) + g(x) Q L (x, y) + g(x) P L (x, y) Proximal-gradient from the majorize-minimize perspective [?] x k+1 = arg min P L (x, x k ) = prox g/l (x f(x k )/L) x Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 26/ 43

50 Geometric illustration F (x) P L (x, x k ):=f(x k )+rf(x k ) T (x x k )+ L 2 kx xk k g(x) F (x) =f(x) + g(x) x k S L (x k ) f(x k )+rf(x k ) T (x x k ) + g(x) x k x k+1 x? x Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 27/ 43

51 Proximal-gradient algorithm Basic proximal-gradient scheme (ISTA) [?,?] 1. Choose x 0 dom(f ) arbitrarily as a starting point. 2. For k = 0, 1,, generate a sequence {x k } k 0 as: where α := 1 L. x k+1 := prox αg ( x k α f(x k ) ), Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 28/ 43

52 Proximal-gradient algorithm Basic proximal-gradient scheme (ISTA) [?,?] 1. Choose x 0 dom(f ) arbitrarily as a starting point. 2. For k = 0, 1,, generate a sequence {x k } k 0 as: where α := 1 L. x k+1 := prox αg ( x k α f(x k ) ), Theorem (Convergence of ISTA [1]) Let {x k } be generated by ISTA. Then: F (x k ) F L f x 0 x 2 2 2(k + 1) The worst-case complexity to reach F (x k ) F ε of (ISTA) is O ( Lf R 2 0 ε R 0 := max x S x0 x 2. ), where A line-search procedure can be used to estimate L k for L based on (0 < c 1): f(x k+1 ) f(x k ) c 2L k f(x k ) 2. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 28/ 43

53 Fast proximal-gradient algorithm Fast proximal-gradient scheme (FISTA) 1. Choose x 0 dom(f ) arbitrarily as a starting point. 2. Set y 0 := x 0 and t 0 := For k = 0, 1,..., generate two sequences {x k } k 0 and {y k } k 0 as: ( ) x k+1 := prox αg y k α f(y k ), t k+1 := (1 + 4t 2 k + 1)/2, y k+1 := x k+1 + t k 1 (x t k+1 x k ). k+1 where α := L 1. ( ) Lf R0 From O 2 to O (R ɛ 0 L f ɛ ) iterations at almost no additional cost!. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 29/ 43

54 Fast proximal-gradient algorithm Fast proximal-gradient scheme (FISTA) 1. Choose x 0 dom(f ) arbitrarily as a starting point. 2. Set y 0 := x 0 and t 0 := For k = 0, 1,..., generate two sequences {x k } k 0 and {y k } k 0 as: ( ) x k+1 := prox αg y k α f(y k ), t k+1 := (1 + 4t 2 k + 1)/2, y k+1 := x k+1 + t k 1 (x t k+1 x k ). k+1 where α := L 1. ( ) Lf R0 From O 2 to O (R ɛ 0 Complexity per iteration One gradient f(y k ) and one prox-operator of g; L f ɛ 8 arithmetic operations for t k+1 and γ k+1 ; ) iterations at almost no additional cost!. 2 more vector additions, and one scalar-vector multiplication. The cost per iteration is almost the same as in gradient scheme if proximal operator of g is efficient. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 29/ 43

55 Example 1: l 1 -regularized least squares Problem (l 1 -regularized least squares) Given A R n p and b R n, solve: { F := min F (x) := 1 } x R p 2 Ax b λ x 1, (10) where λ > 0 is a regularization parameter. Complexity per iterations Evaluating f(x k ) = A T (Ax k b) requires one Ax and one A T y. One soft-thresholding operator prox λg (x) = sign(x) max{ x λ, 0}. Optional: Evaluating L = A T A (spectral norm) - via power iterations (e.g., 20 iterations, each iteration requires one Ax and one A T y). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 30/ 43

56 Example 1: l 1 -regularized least squares Problem (l 1 -regularized least squares) Given A R n p and b R n, solve: { F := min F (x) := 1 } x R p 2 Ax b λ x 1, (10) where λ > 0 is a regularization parameter. Complexity per iterations Evaluating f(x k ) = A T (Ax k b) requires one Ax and one A T y. One soft-thresholding operator prox λg (x) = sign(x) max{ x λ, 0}. Optional: Evaluating L = A T A (spectral norm) - via power iterations (e.g., 20 iterations, each iteration requires one Ax and one A T y). Synthetic data generation A := randn(n, p) - standard Gaussian N (0, I). x is a k-sparse vector generated randomly. b := Ax + N (0, 10 3 ). Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 30/ 43

57 Example 1: Theoretical bounds vs practical performance (Theoretical bounds) FISTA := 2L f R 2 0 (k+2) 2 and ISTA := L f R 2 0 2(k+2) Theoretical bound ISTA FISTA x 2 F(x) F in log-scale descent directions x 2 x x Number of iterations restricted descent directions l 1 -regularized least squares formulation has restricted strong convexity. The proximal-gradient method can automatically exploit this structure. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 31/ 43

58 Adaptive Restart It is possible the preserve O(1/k 2 ) convergence guarantee! One needs to slightly modify the algorithm as below. Generalized fast proximal-gradient scheme 1. Choose x 0 = x 1 dom(f ) arbitrarily as a starting point. 2. Set θ 0 = θ 1 = 1 3. For k = 0, 1,..., generate two sequences {x k } k 0 and {y k } k 0 as: y k := x k + θ k (θ 1 k 1 1)(xk x k 1 ) ( ) x k+1 := prox λg y k λ f(y k ), if restart test holds (11) y k = x k ( ) x k+1 := prox λg y k λ f(y k ) where λ := L 1 f. θ k is chosen so that it satisfies θ k = θ 4 k + 4θ 2 k θ2 k 2 < 2 k + 3 Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 32/ 43

59 Adaptive Restart: Guarantee Theorem (Global complexity [10]) The sequence {x k } k 0 generated by the modified algorithm satisfies ( F (x k ) F 2L f (k + 2) 2 R k i k ( ) ) x x k i 2 2 x z k i 2 2 k 0. (12) where R 0 := min x S x0 x, z k = x k 1 + θ 1 k 1 (xk x k 1 ) and k i, i = 1... are the iterations for which the restart test holds. Various restarts tests that might coincide with x x ki 2 2 x z ki 2 2 Exact non-monotonicity test: F (x k+1 ) F (x k ) > 0 Non-monotonicity test: (L F (y k 1 x k ), x k (xk + y k 1 ) > 0 (implies exact non-monotonicity and it is advantageous when function evaluations are expensive) Gradient-mapping based test: (L f (y k x k+1 ), x k+1 x k > 0 Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 33/ 43

60 Example 2: Sparse logistic regression Problem (Sparse logistic regression [?]) Given A R n p and b { 1, +1} n, solve: { n F := min F (x) := 1 ( ( } log 1 + exp b j (aj T x,β n x + β))) + ρ x 1. j=1 Real data Real data: w8a with n = data points, p = 300 features Available at Parameters ρ = Number of iterations 5000, tolerance Ground truth: Solve problem up to 10 9 accuracy by TFOCS to get a high accuracy approximation of x and F. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 34/ 43

61 Example 2: Sparse logistic regression - numerical results F(x) F in log-scale Theoretical bound of ISTA Theoretical bound of FISTA ISTA Line Search ISTA FISTA FISTA with Restart Line Search FISTA Line Search FISTA with Restart F(x) F in log-scale ISTA Line Search ISTA FISTA FISTA with Restart Line Search FISTA Line Search FISTA with Restart Number of iterations Time (s) ISTA LS-ISTA FISTA FISTA-R LS-FISTA LS-FISTA-R Number of iterations CPU time (s) Solution error ( 10 7 ) Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 35/ 43

62 Strong convexity case: algorithms Proximal-gradient scheme (ISTA µ) 1. Given x 0 R p as a starting point. 2. For k = 0, 1,, generate a sequence {x k } k 0 as: ( ) x k+1 :=prox αk g x k α k f(x k ), where α k := 2/(L f + µ) is the optimal step-size. Fast proximal-gradient scheme (FISTA µ) 1. Given x 0 R p as a starting point. Set y 0 := x For k = 0, 1,, generate two sequences {x k } k 0 and {y k } k 0 as: ( ) where α k := L 1 f x k+1 := prox αk g y k+1 := x k+1 + y k α k f(y k ) ( cf 1 cf +1 is the optimal step-size., ) (x k+1 x k ), Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 36/ 43

63 Strong convexity case: Convergence Assumption f is strongly convex with parameter µ > 0, i.e., f F 1,1 L,µ (Rp ). Condition number: c f := L f µ 0. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 37/ 43

64 Strong convexity case: Convergence Assumption f is strongly convex with parameter µ > 0, i.e., f F 1,1 L,µ (Rp ). Condition number: c f := L f µ 0. Theorem (ISTA µ [7]) F (x k ) F L f 2 ( ) 2k cf 1 x c f +1 0 x 2 2. ( ) 2 ( ) 2 cf 1 Lf µ Convergence rate: Linear with contraction factor: ω := =. c f +1 L f +µ Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 37/ 43

65 Strong convexity case: Convergence Assumption f is strongly convex with parameter µ > 0, i.e., f F 1,1 L,µ (Rp ). Condition number: c f := L f µ 0. Theorem (ISTA µ [7]) F (x k ) F L f 2 ( ) 2k cf 1 x c f +1 0 x 2 2. ( ) 2 ( ) 2 cf 1 Lf µ Convergence rate: Linear with contraction factor: ω := =. c f +1 L f +µ Theorem (FISTA µ [7]) F (x k ) F L f +µ 2 ( 1 µ L f ) k x 0 x 2 2. Convergence rate: Linear with contraction factor: ω f = Lf µ Lf < ω. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 37/ 43

66 A practical issue Stopping criterion Fact: If PG L (x ) = 0, then x is optimal to (1), where ( ) PG L (x) = L x prox (1/L)g (x (1/L) f(x)). Stopping criterion: (relative solution change) L k x k+1 x k 2 ε max{l 0 x 1 x 0 2, 1}, where ε is a given tolerance. Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 38/ 43

67 Summary of the worst-case complexities Software TFOCS is a good software package to learn about first order methods. Comparison with gradient scheme (F (x k ) F ε) Complexity Proximal-gradient scheme Fast proximal-gradient scheme Complexity [µ = 0] O ( R 2 0 (L f /ε) ) O ( R 0 Lf /ε ) Per iteration 1-gradient, 1-prox, 1-sv, 1-1-gradient, 1-prox, 2-sv, 3- v+ ( ) v+ ( Complexity [µ > 0] O κ log(ε 1 ) ) O κ log(ε 1 ) Per iteration 1-gradient, 1-prox, 1-sv, 1- v+ 1-gradient, 1-prox, 1-sv, 2- v+ Here: sv = scalar-vector multiplication, v+=vector addition. R 0 := max x0 x and κ = L f /µ f is the condition number. x S Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 39/ 43

68 Summary of the worst-case complexities Software TFOCS is a good software package to learn about first order methods. Comparison with gradient scheme (F (x k ) F ε) Complexity Proximal-gradient scheme Fast proximal-gradient scheme Complexity [µ = 0] O ( R 2 0 (L f /ε) ) O ( R 0 Lf /ε ) Per iteration 1-gradient, 1-prox, 1-sv, 1-1-gradient, 1-prox, 2-sv, 3- v+ ( ) v+ ( Complexity [µ > 0] O κ log(ε 1 ) ) O κ log(ε 1 ) Per iteration 1-gradient, 1-prox, 1-sv, 1- v+ 1-gradient, 1-prox, 1-sv, 2- v+ Here: sv = scalar-vector multiplication, v+=vector addition. R 0 := max x0 x and κ = L f /µ f is the condition number. x S Need alternatives when f is only self-concordant computing f(x) is much costlier than computing prox g Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 39/ 43

69 Examples Example (Sparse graphical model selection) { } min tr(σθ) log det(θ) + ρ vec(θ) 1 Θ 0 } {{ }} {{ } f(x) g(x) where Θ 0 means that Θ is symmetric and positive definite, and ρ > 0 is a regularization parameter and vec is the vectorization operator. Computing the gradient is expensive: f(θ) = Θ 1. f F 2 is self-concordant. However, if αi Θ βi, then f F 2,1 L,µ with L = p/α 2 and µ = (β 2 p) 1. Example (l 1 -regularized Lasso) 1 min x 2 b Ax ρ x 1 } {{ }} {{ } g(x) f(x) where n p, A R n p is a full column-rank matrix, and ρ > 0 is a regularization parameter. f F 2,1 L,µ and computing the gradient is O(n). = x \ + n p b A w n p Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 40/ 43

70 References I [1] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1): , [2] Patrick L. Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Heinz Bauschke, Regina S. Burachik, Patrick Combettes, Veit Elser, D. Russell Luck, and Henry Wolkowicz, editors, Fixed-Point Algorithms for Inverse Problems in Science and Engineering, chapter 10. Springer, New York, NY, [3] Osman Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control Optim., 29(2): , March [4] Yen-Huan Li, Ya-Ping Hsieh, and Volkan Cevher. A geometric view on constrained M-estimators. EPFL-REPORT , École Polytechnique Fédérale de Lausanne, [5] Yen-Huan Li, Jonathan Scarlett, Pradeep Ravikumar, and Volkan Cevher. Sparsistency of l 1 -regularized M-estimators. In Proc. 18th Inf. Conf. Artificial Intelligence and Statistics, pages , [6] Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci., 27(4): , [7] Yurii Nesterov. Introductory Lectures on Convex Optimization. Kluwer, Boston, MA, [8] R. T. Rockafellar. On the maximal monotonicity of subdifferential mappings. Pac. J. Math., 33(1): , [1] A. Beck and M. Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Imaging Sciences, 2(1): , Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 41/ 43

71 References II [2] P. Combettes and Pesquet J.-C. Signal recovery by proximal forward-backward splitting. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages Springer-Verlag, [3] O. Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control Optim., 29(2): , [4] Y. Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, [5] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1): , [6] Y. Nesterov and A. Nemirovski. Interior-point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, [7] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3): , [8] R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, [9] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14: , [10] P. Giselsson and S. Boyd. Monotonicity and Restart in Fast Gradient Methods. IEEE 53rd Annual Conference on Decision and Control, , Mathematics of Data Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 42/ 43

72 References III [11] Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. Composite self-concordant minimization. Tech. report, Lab. for Information and Inference Systems (LIONS), EPFL, Switzerland, CH-1015 Lausanne, Switzerland, January Mathematics of Data Prof. Volkan Cevher, Slide 43/ 43

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

The proximal mapping

The proximal mapping The proximal mapping http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/37 1 closed function 2 Conjugate function

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Optimization for Learning and Big Data

Optimization for Learning and Big Data Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications Professor M. Chiang Electrical Engineering Department, Princeton University March

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Mathematics of Data: From Theory to Computation

Mathematics of Data: From Theory to Computation Mathematics of Data: From Theory to Computation Prof. Volkan Cevher volkan.cevher@epfl.ch Lecture 13: Disciplined convex optimization Laboratory for Information and Inference Systems (LIONS) École Polytechnique

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

SPARSE SIGNAL RESTORATION. 1. Introduction

SPARSE SIGNAL RESTORATION. 1. Introduction SPARSE SIGNAL RESTORATION IVAN W. SELESNICK 1. Introduction These notes describe an approach for the restoration of degraded signals using sparsity. This approach, which has become quite popular, is useful

More information

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1 EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

On the acceleration of the double smoothing technique for unconstrained convex optimization problems On the acceleration of the double smoothing technique for unconstrained convex optimization problems Radu Ioan Boţ Christopher Hendrich October 10, 01 Abstract. In this article we investigate the possibilities

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems

On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems Radu Ioan Boţ Ernö Robert Csetnek August 5, 014 Abstract. In this paper we analyze the

More information

Composite Self-Concordant Minimization

Composite Self-Concordant Minimization Journal of Machine Learning Research X (2014) 1-47 Submitted 08/13; Revised 09/14; Published X Composite Self-Concordant Minimization Quoc Tran-Dinh Anastasios Kyrillidis Volkan Cevher Laboratory for Information

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Convex Optimization and l 1 -minimization

Convex Optimization and l 1 -minimization Convex Optimization and l 1 -minimization Sangwoon Yun Computational Sciences Korea Institute for Advanced Study December 11, 2009 2009 NIMS Thematic Winter School Outline I. Convex Optimization II. l

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

arxiv: v2 [math.oc] 20 Jan 2018

arxiv: v2 [math.oc] 20 Jan 2018 Composite convex minimization involving self-concordant-lie cost functions Quoc Tran-Dinh, Yen-Huan Li and Volan Cevher Laboratory for Information and Inference Systems LIONS) EPFL, Lausanne, Switzerland

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

Variable Metric Forward-Backward Algorithm

Variable Metric Forward-Backward Algorithm Variable Metric Forward-Backward Algorithm 1/37 Variable Metric Forward-Backward Algorithm for minimizing the sum of a differentiable function and a convex function E. Chouzenoux in collaboration with

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

2 Regularized Image Reconstruction for Compressive Imaging and Beyond EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

arxiv: v1 [math.oc] 13 Dec 2018

arxiv: v1 [math.oc] 13 Dec 2018 A NEW HOMOTOPY PROXIMAL VARIABLE-METRIC FRAMEWORK FOR COMPOSITE CONVEX MINIMIZATION QUOC TRAN-DINH, LIANG LING, AND KIM-CHUAN TOH arxiv:8205243v [mathoc] 3 Dec 208 Abstract This paper suggests two novel

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem Charles Byrne (Charles Byrne@uml.edu) http://faculty.uml.edu/cbyrne/cbyrne.html Department of Mathematical Sciences

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725 Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

More information

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method Optimization Methods and Software Vol. 00, No. 00, Month 200x, 1 11 On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method ROMAN A. POLYAK Department of SEOR and Mathematical

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725 Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

About Split Proximal Algorithms for the Q-Lasso

About Split Proximal Algorithms for the Q-Lasso Thai Journal of Mathematics Volume 5 (207) Number : 7 http://thaijmath.in.cmu.ac.th ISSN 686-0209 About Split Proximal Algorithms for the Q-Lasso Abdellatif Moudafi Aix Marseille Université, CNRS-L.S.I.S

More information

An adaptive accelerated first-order method for convex optimization

An adaptive accelerated first-order method for convex optimization An adaptive accelerated first-order method for convex optimization Renato D.C Monteiro Camilo Ortiz Benar F. Svaiter July 3, 22 (Revised: May 4, 24) Abstract This paper presents a new accelerated variant

More information

Lecture 3: Huge-scale optimization problems

Lecture 3: Huge-scale optimization problems Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1

More information

A Dykstra-like algorithm for two monotone operators

A Dykstra-like algorithm for two monotone operators A Dykstra-like algorithm for two monotone operators Heinz H. Bauschke and Patrick L. Combettes Abstract Dykstra s algorithm employs the projectors onto two closed convex sets in a Hilbert space to construct

More information

On the Convergence of the Iterative Shrinkage/Thresholding Algorithm With a Weakly Convex Penalty

On the Convergence of the Iterative Shrinkage/Thresholding Algorithm With a Weakly Convex Penalty On the Convergence of the Iterative Shrinkage/Thresholding Algorithm With a Weakly Convex Penalty İlker Bayram arxiv:5.78v [math.oc] Nov 5 Abstract We consider the iterative shrinkage/thresholding algorithm

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

A Tutorial on Primal-Dual Algorithm

A Tutorial on Primal-Dual Algorithm A Tutorial on Primal-Dual Algorithm Shenlong Wang University of Toronto March 31, 2016 1 / 34 Energy minimization MAP Inference for MRFs Typical energies consist of a regularization term and a data term.

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Sparse Optimization Lecture: Dual Methods, Part I

Sparse Optimization Lecture: Dual Methods, Part I Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration

More information

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs) ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality

More information

Sparse One Hidden Layer MLPs

Sparse One Hidden Layer MLPs Sparse One Hidden Layer MLPs Alberto Torres, David Díaz and José R. Dorronsoro Universidad Autónoma de Madrid - Departamento de Ingeniería Informática Tomás y Valiente 11, 8049 Madrid - Spain Abstract.

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.18

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.18 XVIII - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No18 Linearized alternating direction method with Gaussian back substitution for separable convex optimization

More information

Accelerated gradient methods

Accelerated gradient methods ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated

More information

Lecture 6 : Projected Gradient Descent

Lecture 6 : Projected Gradient Descent Lecture 6 : Projected Gradient Descent EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Consider the following update. x l+1 = Π C (x l α f(x l )) Theorem Say f : R d R is (m, M)-strongly

More information