Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
|
|
- Tobias Russell
- 5 years ago
- Views:
Transcription
1 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra
2 Optimal gradient methods 2 / 27
3 Optimal gradient methods We saw following efficiency estimates for the gradient method f CL 1 : f(x k ) f 2L x0 x 2 2 k + 4 ( ) L µ 2k f SL,µ 1 : f(x k ) f L 2 x 0 x 2 L + µ 2. 3 / 27
4 Optimal gradient methods We saw following efficiency estimates for the gradient method f CL 1 : f(x k ) f 2L x0 x 2 2 k + 4 ( ) L µ 2k f SL,µ 1 : f(x k ) f L 2 x 0 x 2 L + µ 2. We also saw lower complexity bounds f CL 1 : f(x k ) f(x ) 3L x0 x (k + 1) 2 ( fsl,µ : f(x k ) f(x ) µ ) 2k L µ 2 x 0 x 2 2. L + µ 3 / 27
5 Optimal gradient methods We saw following efficiency estimates for the gradient method f CL 1 : f(x k ) f 2L x0 x 2 2 k + 4 ( ) L µ 2k f SL,µ 1 : f(x k ) f L 2 x 0 x 2 L + µ 2. We also saw lower complexity bounds f CL 1 : f(x k ) f(x ) 3L x0 x (k + 1) 2 ( fsl,µ : f(x k ) f(x ) µ ) 2k L µ 2 x 0 x 2 2. L + µ Can we close the gap? 3 / 27
6 Gradient with momentum Polyak s method (aka heavy-ball) for f S 1 L,µ x k+1 = x k α k f(x k ) + β k (x k x k 1 ) 4 / 27
7 Gradient with momentum Polyak s method (aka heavy-ball) for f S 1 L,µ x k+1 = x k α k f(x k ) + β k (x k x k 1 ) Converges (locally, i.e., for x 0 x 2 ɛ) as x k x 2 2 ( L µ L + µ ) 2k x 0 x 2 2, ( ) 2 4 for α k = ( L+ and β L µ µ) 2 k = L+ µ 4 / 27
8 Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 5 / 27
9 Nesterov s optimal gradient method 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l min x f(x), where S 1 L,µ with µ 0 5 / 27
10 Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l 3. k-th iteration (k 0): a). Compute f(y k ) and f(y k ); update primary solution x k+1 = y k 1 L f(yk ) 5 / 27
11 Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l 3. k-th iteration (k 0): a). Compute f(y k ) and f(y k ); update primary solution x k+1 = y k 1 L f(yk ) b). Compute stepsize α k+1 by solving α 2 k+1 = (1 α k+1 )α 2 k + qα k+1 5 / 27
12 Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l 3. k-th iteration (k 0): a). Compute f(y k ) and f(y k ); update primary solution x k+1 = y k 1 L f(yk ) b). Compute stepsize α k+1 by solving α 2 k+1 = (1 α k+1 )α 2 k + qα k+1 c). Set β k = α k (1 α k )/(α 2 k + α k+1) d). Update secondary solution y k+1 = x k+1 + β k (x k+1 x k ) 5 / 27
13 Optimal gradient method rate Theorem Let { x k} be sequence generated by above algorithm. If α 0 µ/l, then f(x k ) f(x ) c 1 min { ( 1 where constants c 1, c 2 depend on α 0, L, µ. Proof: Somewhat involved; see notes. µ ) } k, 4L L (2, L + c 2 k) 2 6 / 27
14 Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ 7 / 27
15 Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ Optimal method simplifies to 1. Choose y 0 = x 0 R n 2. k-th iteration (k 0): 7 / 27
16 Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ Optimal method simplifies to 1. Choose y 0 = x 0 R n 2. k-th iteration (k 0): a). x k+1 = y k 1 L f(yk ) b). y k+1 = x k+1 + β(x k+1 x k ) 7 / 27
17 Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ Optimal method simplifies to 1. Choose y 0 = x 0 R n 2. k-th iteration (k 0): a). x k+1 = y k 1 L f(yk ) b). y k+1 = x k+1 + β(x k+1 x k ) Notice similarity to Polyak s method! 7 / 27
18 Summary so far Convex f(x) with f G subgradient method Differentiable f CL 1 using gradient methods Rate of convergence for smooth convex problems Faster rate of convergence for smooth, strongly convex Constrained gradient methods Frank-Wolfe method Constrained gradient methods gradient projection Nesterov s optimal gradient methods (smooth) 8 / 27
19 Summary so far Convex f(x) with f G subgradient method Differentiable f CL 1 using gradient methods Rate of convergence for smooth convex problems Faster rate of convergence for smooth, strongly convex Constrained gradient methods Frank-Wolfe method Constrained gradient methods gradient projection Nesterov s optimal gradient methods (smooth) Gap between lower and upper bounds O(1/ t) convex (subgradient method); O(1/t 2 ) for CL 1 ; linear for smoooth, strongly convex 8 / 27
20 Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. 9 / 27
21 Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. At each step, we access to f(x) and g f(x) 9 / 27
22 Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. At each step, we access to f(x) and g f(x) Find x k R n such that f(x k ) f ɛ 9 / 27
23 Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. At each step, we access to f(x) and g f(x) Find x k R n such that f(x k ) f ɛ First-order methods: x k x 0 + span { g 0,..., g k 1} 9 / 27
24 Nonsmooth optimization EXAMPLE Let φ(x) = x for x R 10 / 27
25 Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. 10 / 27
26 Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k k+2 (this stepsize is known to be optimal), then x k = 1 k+1 10 / 27
27 Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k k+2 (this stepsize is known to be optimal), then x k = 1 k+1 Thus, O( 1 ɛ 2 ) iterations are needed to obtain ɛ-accuracy. 10 / 27
28 Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k k+2 (this stepsize is known to be optimal), then x k = 1 k+1 Thus, O( 1 ɛ 2 ) iterations are needed to obtain ɛ-accuracy. This behavior typical for the subgradient method which exhibits O(1/ k) convergence in general 10 / 27
29 Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k k+2 (this stepsize is known to be optimal), then x k = 1 k+1 Thus, O( 1 ɛ 2 ) iterations are needed to obtain ɛ-accuracy. This behavior typical for the subgradient method which exhibits O(1/ k) convergence in general Can we do better in general? 10 / 27
30 Nonsmooth optimization Nope! 11 / 27
31 Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. 11 / 27
32 Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Should we give up? 11 / 27
33 Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Should we give up? No! 11 / 27
34 Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Should we give up? No! Several possibilities remain! 11 / 27
35 Nonsmooth optimization Blackbox too pessimistic 12 / 27
36 Nonsmooth optimization Blackbox too pessimistic Nesterov s breakthroughs Excessive gap technique Composite objective minimization 12 / 27
37 Nonsmooth optimization Blackbox too pessimistic Nesterov s breakthroughs Excessive gap technique Composite objective minimization Nemirovski s workshorse of general convex optimization Mirror-descent, NERML Mirror-prox 12 / 27
38 Nonsmooth optimization Blackbox too pessimistic Nesterov s breakthroughs Excessive gap technique Composite objective minimization Nemirovski s workshorse of general convex optimization Mirror-descent, NERML Mirror-prox Other techniques, problem classes, etc. 12 / 27
39 Proximal splitting 13 / 27
40 Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) 14 / 27
41 Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) l + r 14 / 27
42 Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing 14 / 27
43 Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing Example: l(x) : Logistic loss, and r(x) = λ x 1 L1-Logistic regression, sparse LR 14 / 27
44 Composite objective minimization minimize f(x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f(x k ) 15 / 27
45 Composite objective minimization minimize f(x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f(x k ) subgradient: converges slowly at rate O(1/ k) 15 / 27
46 Composite objective minimization minimize f(x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f(x k ) subgradient: converges slowly at rate O(1/ k) but: f is smooth plus nonsmooth we should exploit: smoothness of l for better method! 15 / 27
47 Projections: another view Let I X be the indicator function for closed, cvx X, defined as: { 0 if x X, I X (x) := otherwise. 16 / 27
48 Projections: another view Let I X be the indicator function for closed, cvx X, defined as: { 0 if x X, I X (x) := otherwise. Recall orthogonal projection P X (y) P X (y) := argmin 1 2 x y 2 2 s.t. x X. 16 / 27
49 Projections: another view Let I X be the indicator function for closed, cvx X, defined as: { 0 if x X, I X (x) := otherwise. Recall orthogonal projection P X (y) P X (y) := argmin 1 2 x y 2 2 s.t. x X. Rewrite orthogonal projection P X (y) as P X (y) := argmin x R n 1 2 x y I X (x). 16 / 27
50 Generalizing projections proximity Projection P X (y) := argmin x R n 1 2 x y I X (x) 17 / 27
51 Generalizing projections proximity Projection P X (y) := argmin x R n 1 2 x y I X (x) Proximity: Replace I X by some convex function! prox r (y) := argmin x R n 1 2 x y r(x) 17 / 27
52 Generalizing projections proximity Projection P X (y) := argmin x R n 1 2 x y I X (x) Proximity: Replace I X by some convex function! prox r (y) := argmin x R n 1 2 x y r(x) Def. prox R : R n R n is called a proximity operator 17 / 27
53 Proximity operator l 1 -norm ball of radius ρ(λ) 18 / 27
54 Proximity operators Exercise: Let r(x) = x 1. Solve prox λr (y). 1 min x R n 2 x y λ x 1. Hint 1: The above problem decomposes into n independent subproblems of the form min x R 1 2 (x y)2 + λ x. Hint 2: Consider the two cases separately: either x = 0 or x 0 19 / 27
55 Proximity operators Exercise: Let r(x) = x 1. Solve prox λr (y). 1 min x R n 2 x y λ x 1. Hint 1: The above problem decomposes into n independent subproblems of the form min x R 1 2 (x y)2 + λ x. Hint 2: Consider the two cases separately: either x = 0 or x 0 Aka: Soft-thresholding operator 19 / 27
56 Basics of proximal splitting Recall Gradient projection for solving min X f(x) for f C 1 L : x k+1 = P X (x k α k f(x k )) 20 / 27
57 Basics of proximal splitting Recall Gradient projection for solving min X f(x) for f C 1 L : x k+1 = P X (x k α k f(x k )) Proximal gradient method solves min l(x) + r(x) x k+1 = prox αk r(x k α k f(x k )). 20 / 27
58 Basics of proximal splitting Recall Gradient projection for solving min X f(x) for f C 1 L : x k+1 = P X (x k α k f(x k )) Proximal gradient method solves min l(x) + r(x) x k+1 = prox αk r(x k α k f(x k )). This method aka: Forward-backward splitting (FBS) Forward step: The gradient-descent step Backward step: The prox-operator 20 / 27
59 FBS example Lasso / L1-LS min 1 2 Ax b λ x / 27
60 FBS example Lasso / L1-LS min 1 2 Ax b λ x 1. prox λ x 1 y = sgn(y) max( y λ, 0) x k+1 = prox αk λ 1 (x k α k A T (Ax k b)). so-called iterative soft-thresholding algorithm! 21 / 27
61 FBS example Lasso / L1-LS min 1 2 Ax b λ x 1. prox λ x 1 y = sgn(y) max( y λ, 0) x k+1 = prox αk λ 1 (x k α k A T (Ax k b)). so-called iterative soft-thresholding algorithm! Exercise: Try solving the problem: min 1 2 Ax b λ x / 27
62 Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve min 1 2 DT x b λ x / 27
63 Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b λ x 1. Try different values of λ > 0 in your code 22 / 27
64 Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b λ x 1. Try different values of λ > 0 in your code Use different choices of b 22 / 27
65 Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b λ x 1. Try different values of λ > 0 in your code Use different choices of b Show a choice of b for which x = 0 (zero vector) is optimal 22 / 27
66 Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b λ x 1. Try different values of λ > 0 in your code Use different choices of b Show a choice of b for which x = 0 (zero vector) is optimal Experiment with different stepsizes (try α k = 1/4 and if that does not work try smaller values that might work). Do not expect monotonic descent 22 / 27
67 Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b λ x 1. Try different values of λ > 0 in your code Use different choices of b Show a choice of b for which x = 0 (zero vector) is optimal Experiment with different stepsizes (try α k = 1/4 and if that does not work try smaller values that might work). Do not expect monotonic descent Compare with versions of the subgradient method 22 / 27
68 Proximity operators 23 / 27
69 Proximity operators prox r has several nice properties Read / Skim the paper: Proximal Splitting Methods in Signal Processing, by Combettes and Pesquet (2010). Theorem The operator prox r is firmly nonexpansive (FNE) prox r x prox r y 2 2 prox r x prox r y, x y Proof: (blackboard) Corollary. The operator prox r is nonexpansive Proof: apply Cauchy-Schwarz to FNE. 24 / 27
70 Consequence of FNE Gradient projection x k+1 = P X (x k α k f(x k )) Proximal gradients / FBS x k+1 = prox αk r(x k α k f(x k )) Same convergence theory goes through! Exercise: Try extending proof of gradient-projection convergence to convergence for FBS. Hint: First show that at x, the fixed-point equation x = prox αr (x α f(x )), α > 0 25 / 27
71 Moreau Decomposition Aim: Compute prox r y Sometimes it is easier to compute prox r y r (u) := sup u T x r(x) x Moreau decomposition: y = prox R y + prox R y 26 / 27
72 Moreau decomposition Proof sketch: Consider min 1 2 x y r(x) 27 / 27
73 Moreau decomposition Proof sketch: Consider min 1 2 x y r(x) Introduce new variable z = x, to get prox r y := 1 2 x y r(z), s.t. x = z 27 / 27
74 Moreau decomposition Proof sketch: Consider min 1 2 x y r(x) Introduce new variable z = x, to get prox r y := 1 2 x y r(z), s.t. x = z Derive Lagrangian dual for this 27 / 27
75 Moreau decomposition Proof sketch: Consider min 1 2 x y r(x) Introduce new variable z = x, to get prox r y := 1 2 x y r(z), s.t. x = z Derive Lagrangian dual for this Simplify, and conclude! 27 / 27
Optimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationEE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1
EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationPrimal-dual Subgradient Method for Convex Problems with Functional Constraints
Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationGradient Methods Using Momentum and Memory
Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationOptimization for Learning and Big Data
Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationDual Decomposition.
1/34 Dual Decomposition http://bicmr.pku.edu.cn/~wenzw/opt-2017-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/34 1 Conjugate function 2 introduction:
More informationMATH 829: Introduction to Data Mining and Analysis Computing the lasso solution
1/16 MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution Dominique Guillot Departments of Mathematical Sciences University of Delaware February 26, 2016 Computing the lasso
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationConvergence of Fixed-Point Iterations
Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and
More informationSparse Optimization Lecture: Dual Methods, Part I
Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration
More informationComposite nonlinear models at scale
Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationSplitting methods for decomposing separable convex programs
Splitting methods for decomposing separable convex programs Philippe Mahey LIMOS - ISIMA - Université Blaise Pascal PGMO, ENSTA 2013 October 4, 2013 1 / 30 Plan 1 Max Monotone Operators Proximal techniques
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationr=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J
7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationLecture 23: November 19
10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 4 (Conjugates, subdifferentials) 31 Jan, 2013 Suvrit Sra Organizational HW1 due: 14th Feb 2013 in class. Please L A TEX your solutions (contact TA if this
More informationI P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION
I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1
More informationAdaptive Restarting for First Order Optimization Methods
Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation
More informationThe Proximal Gradient Method
Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationSelected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018
Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationCoordinate Update Algorithm Short Course Subgradients and Subgradient Methods
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationComposite Objective Mirror Descent
Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationNonlinear Optimization for Optimal Control
Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]
More informationProximal gradient methods
ELE 538B: Large-Scale Optimization for Data Science Proximal gradient methods Yuxin Chen Princeton University, Spring 08 Outline Proximal gradient descent for composite functions Proximal mapping / operator
More informationTight Rates and Equivalence Results of Operator Splitting Schemes
Tight Rates and Equivalence Results of Operator Splitting Schemes Wotao Yin (UCLA Math) Workshop on Optimization for Modern Computing Joint w Damek Davis and Ming Yan UCLA CAM 14-51, 14-58, and 14-59 1
More informationLecture: Smoothing.
Lecture: Smoothing http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s lecture notes Smoothing 2/26 introduction smoothing via conjugate
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationPrimal-dual coordinate descent
Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x
More informationAbout Split Proximal Algorithms for the Q-Lasso
Thai Journal of Mathematics Volume 5 (207) Number : 7 http://thaijmath.in.cmu.ac.th ISSN 686-0209 About Split Proximal Algorithms for the Q-Lasso Abdellatif Moudafi Aix Marseille Université, CNRS-L.S.I.S
More informationLecture 14: October 17
1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationNon-negative Matrix Factorization via accelerated Projected Gradient Descent
Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationORIE 4741: Learning with Big Messy Data. Proximal Gradient Method
ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:
More informationMath 273a: Optimization Overview of First-Order Optimization Algorithms
Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization
More informationConvex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization
Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationThe Alternating Direction Method of Multipliers
The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein December 2, 2015 1 / 25 Background The Dual Problem Consider
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More information8 Numerical methods for unconstrained problems
8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields
More informationSIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University
SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationOne Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties
One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,
More informationAlternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization
Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote
More informationGeneralized Conditional Gradient and Its Applications
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationA GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR TV MINIMIZATION
A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR TV MINIMIZATION ERNIE ESSER XIAOQUN ZHANG TONY CHAN Abstract. We generalize the primal-dual hybrid gradient (PDHG) algorithm proposed
More informationLecture 15 Newton Method and Self-Concordance. October 23, 2008
Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications
More informationDouglas-Rachford Splitting: Complexity Estimates and Accelerated Variants
53rd IEEE Conference on Decision and Control December 5-7, 204. Los Angeles, California, USA Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants Panagiotis Patrinos and Lorenzo Stella
More information10. Unconstrained minimization
Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation
More informationOptimization and Optimal Control in Banach Spaces
Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More information