Primal and Dual Variables Decomposition Methods in Convex Optimization

Size: px
Start display at page:

Download "Primal and Dual Variables Decomposition Methods in Convex Optimization"

Transcription

1 Primal and Dual Variables Decomposition Methods in Convex Optimization Amir Beck Technion - Israel Institute of Technology Haifa, Israel Based on joint works with Edouard Pauwels, Shoham Sabach, Luba Tetruashvili, Yakov Vaisbourd Optimization in Machine Learning, Vision and Image Processing Toulouse, France, 6-7 October 2015

2 A (too?) General Model (P) min H(x 1, x 2,..., x p ) s.t. x i X i, X i R n i so that n = p i=1 n i. At each iteration of a block variables decomposition method an operation involving only at one of the block variables x 1, x 2,..., x p is performed.

3 A (too?) General Model X i R n i (P) so that n = p i=1 n i. min H(x 1, x 2,..., x p ) s.t. x i X i, The Alternating Minimization method sequentially minimizes H w.r.t. each component in a cyclic manner. Alternating Minimization At step k, given x k, the next iterate x k+1 is computed as follows: For i = 1 : p.. x k+1 1 argmin H(x 1, x k 2,..., x k p). x 1 x k+1 2 argmin x 2 H(x k+1 1, x 2, x k 3,..., x k p). x k+1 p argmin H(x k+1 1,..., x k+1 p 1, x p). x p

4 Block Descent Methods The AM method is just one example of a block descent method or variables decomposition method. Other variants replace for example the exact minimization step with some kind of a descent operator. General Block Descent Method For i=1:p x k+1 i = T i (x k , x k+1 i 1, xk i, xk i+1,..., xk p). T i : R n R n i - a descent operator (such as one step of a minimization method)

5 Block Descent Methods The AM method is just one example of a block descent method or variables decomposition method. Other variants replace for example the exact minimization step with some kind of a descent operator. General Block Descent Method For i=1:p x k+1 i = T i (x k , x k+1 i 1, xk i, xk i+1,..., xk p). T i : R n R n i - a descent operator (such as one step of a minimization method) Additional variants of the method consider different index selection strategies other than cyclic (essentially cyclic,gauss-southwell) Deterministic index selection strategies can be replaced by randomized.

6 Example 1 of AM: IRLS - Iteratively Reweighted Least Squares The model: (N) min s(y) + m i=1 A iy + b i 2 s.t. y X, A i R k i n, b i R k i, i = 1, 2,..., m. s continuously differentiable over the closed and convex set X R n.

7 Example 1 of AM: IRLS - Iteratively Reweighted Least Squares The model: (N) min s(y) + m i=1 A iy + b i 2 s.t. y X, A i R k i n, b i R k i, i = 1, 2,..., m. s continuously differentiable over the closed and convex set X R n. Examples: l 1 -norm linear regression min By c 1 Fermat-Weber problem (FW ) m min ω i y a i i=1 l 1 -regularized least squares min By c λ Dy 1..

8 IRLS - Iteratively Reweighted Least Squares Wishful thinking... Initialization: y 0 X. General Step (k = 0, 1,...): { y k+1 argmin y X s(y) m i=1 } A i y + b i 2. A i y k + b i

9 IRLS - Iteratively Reweighted Least Squares Wishful thinking... Initialization: y 0 X. General Step (k = 0, 1,...): { y k+1 argmin y X s(y) m i=1 } A i y + b i 2. A i y k + b i Practical method η-irls Input: η > 0 - a given parameter. Initialization: y 0 X. General Step (k = 0, 1,...): { y k+1 argmin y X s(y) m i=1 } A i y + b i 2. Ai y k + b i 2 + η 2

10 IRLS - Literature popular approach in robust regression (McCullagh, Nedler 83 ) applications in sparse recovery (Daubechies et al, 10 ) same as Weiszfeld s method (from 1937) for solving the Fermat-Weber problem (η = 0???) Convergence results are known only for very specific instances [Bissantz et. al. 08 (specific unconstrained model, asymptotic linear rate of convergence), Daubechies et al 10 (asy. linear rate, basis pursuit problem)]

11 IRLS AM Auxiliary problem: (N) min h η (y, z) s(y) + 1 m 2 i=1 s.t. y X z [η/2, ) m, ( Ai y+b i 2 +η 2 z i + z i ). Minimizing w.r.t. z implies that the problem is a smoothed version of (N): (N η ) min s.t. s(y) + m y X i=1 Ai y + b i η2

12 IRLS AM (y k, z k ) - the k-th iterate of the AM method. The z-step in AM: z i = A i y k + b i 2 + η 2

13 IRLS AM (y k, z k ) - the k-th iterate of the AM method. The z-step in AM: z i = A i y k + b i 2 + η 2 The y-step in AM: y k+1 argmin y X { s(y) m i=1 } A i y + b i 2. Ai y k + b i 2 + η 2

14 IRLS AM (y k, z k ) - the k-th iterate of the AM method. The z-step in AM: z i = A i y k + b i 2 + η 2 The y-step in AM: y k+1 argmin y X { s(y) m i=1 } A i y + b i 2. Ai y k + b i 2 + η 2 The methods are equivalent given that the initial z 0 is given by [z 0 ] i = A i y 0 + b i 2 + η 2

15 Example 2 of AM: A Composite Model T = min {T (y) q(y) + r(ay)}, q : R n R { } closed, proper, convex. r : R m R real-valued convex function.

16 Example 2 of AM: A Composite Model T = min {T (y) q(y) + r(ay)}, q : R n R { } closed, proper, convex. r : R m R real-valued convex function. A popular penalized approach: Consider the problem min {q(y) + r(z) : z = Ay}. z,y Write a penalized version: (C) T ρ = min y,z {T ρ (y, z) = q(y) + r(z) + ρ z Ay 2} 2 Employ the AM method.

17 AM for solving (C) Alternating Minimization for Solving (C) Input: ρ > 0 - a given parameter. { Initialization: y 0 R n, z 0 argmin r(z) + ρ 2 z Ay 0 2}. General Step (k=0,1,... ): { y k+1 argmin q(y) + ρ y R n 2 z k Ay 2}, { z k+1 = argmin r(z) + ρ z R m 2 z Ay k+1 2}.

18 AM for solving (C) Alternating Minimization for Solving (C) Input: ρ > 0 - a given parameter. { Initialization: y 0 R n, z 0 argmin r(z) + ρ 2 z Ay 0 2}. General Step (k=0,1,... ): { y k+1 argmin q(y) + ρ y R n 2 z k Ay 2}, { z k+1 = argmin r(z) + ρ z R m 2 z Ay k+1 2}. Implementable in several important examples, e.g., min Cx d Lx 1. (prox of l 1 +solution of a linear system)

19 Example 3 of AM: k-means method in clustering Input: Clusters: n points A = {a 1, a 2,..., a n } R d. k - number of clusters. The idea is to partition the data A into k subsets (clusters) A 1,..., A k, called clusters. For each l {1,..., k}, the cluster A l is represented by its so-called center x l. The clustering problem: determine k cluster centers x 1, x 2,..., x k such that the sum of distances from each point a i to a nearest cluster center x l is minimized. min x 1,...,x k n i=1 min a i x l 2. l=1,2,...,k

20 Clustering: AM=k-means Using the trick: min{b 1,..., b k } = min{b T y : y k }. where k = {y R k : e T y = 1, y 0}, we can reformulate: n i=1 k l=1 y i l a i x l 2 min s.t. x 1,..., x n R d, y 1,..., y n k,

21 Clustering: AM=k-means Using the trick: min{b 1,..., b k } = min{b T y : y k }. where k = {y R k : e T y = 1, y 0}, we can reformulate: n i=1 k l=1 y i l a i x l 2 min s.t. x 1,..., x n R d, y 1,..., y n k, k-means. repeat: Assignment step. assign each point a i to closest cluster center: A l = {i : a i x l a i x j for all j = 1,..., k}, l = 1, 2,..., k. Update step. Cluster centers are averages: x l = 1 A l i A l a i.

22 Why Block Descent Methods? Simple and cheap updates at each iteration - suitable for large-scale applications. Allow larger step-sizes at each iteration. In some nonconvex settings - results with better quality solutions.

23 Convergence?? well... Convergence of the AM method is not always guaranteed. Powell s example (73 ): ϕ(x, y, z) = xy yz zx + [x 1] [ x 1] 2 + +[y 1] [ y 1] [z 1] [ z 1] 2 +. differentiable and nonconvex Fixing y, z, it is easy to show that that { sgn(y + z)(1 + 1 argmin ϕ(x, y, z) = 2 y + z ) y + z 0 x [ 1, 1] y + z = 0 Similar formulas for minimizing w.r.t. y and z.

24 Powell s Example Start with ( 1 ε, ε, ε). First six iterations ( ε, ε, 1 14 ε ) ( ε, ε, 1 1 ) 4 ε ( ε, ε, ) 32 ε

25 Powell s Example Start with ( 1 ε, ε, ε). First six iterations ( ε, ε, 1 14 ) ε ( ε, ε, ) ε ( ε, ε, 1 1 ) 4 ε ( ε, ε, ) 32 ε ( ε, ε, ) 32 ε ( ε, ) 1 ε, ε

26 Powell s Example Start with ( 1 ε, ε, ε). First six iterations ( ε, ε, 1 14 ) ε ( ε, ε, ) ε ( ε, ε, 1 1 ) 4 ε ( ε, ε, ) 32 ε ( ε, ε, ) 32 ε ( ε, ) 1 ε, ε We are essentially back at the first point, but with 1 64ε instead of ε. The process will continue to cycle around the 6 non-stationary points (1, 1, 1), (1, 1, 1), (1, 1, 1), ( 1, 1, 1), ( 1, 1, 1), ( 1, 1, 1).

27 Powell s Example Start with ( 1 ε, ε, ε). First six iterations ( ε, ε, 1 14 ) ε ( ε, ε, ) ε ( ε, ε, 1 1 ) 4 ε ( ε, ε, ) 32 ε ( ε, ε, ) 32 ε ( ε, ) 1 ε, ε We are essentially back at the first point, but with 1 64ε instead of ε. The process will continue to cycle around the 6 non-stationary points (1, 1, 1), (1, 1, 1), (1, 1, 1), ( 1, 1, 1), ( 1, 1, 1), ( 1, 1, 1). Can be rectified if certain uniqueness/convexity/boundedness assumptions are made [Zadeh 70, Grippo Sciandrone 00, Bertsekas & Tsitsikls 89, Luo & Tseng 93,...]

28 The Failure of Convexity Unfortunately, in the absence of differentiability, even convexity is not enough to guarantee convergence. Example: f (x 1, x 2 ) = 3x 1 +4x 2 + x 1 +2x 2 ( 2, 1.5) - a coordinatewise minimum, but not the global minimum (which is (0, 0)). Any block descent might converge to the non-global solution.

29 The Composite Model We should thus consider a combination of a smooth function (not necessarily convex, but with some boundedness assumptions). a nonsmooth separable convex function.

30 The Composite Model We should thus consider a combination of a smooth function (not necessarily convex, but with some boundedness assumptions). a nonsmooth separable convex function. The composite model (P) min f (x 1, x 2,..., x p ) + g(x) {}}{ p g i (x i ) i=1 } {{ } H(x 1,...,x p) f : R n R - continuously differentiable with bounded level sets. g i : R n i (, ] - closed, proper, convex. A coordinatewise minimum of (P) is a stationary point. i f (x) g i (x), i = 1, 2,..., p.

31 Basic Notation 1. x k,i = i. x k+1 x k+1 x k i+1 x k p x k,0 = x k, x k,p = x k+1 U i R n n i, i = 1,..., p are such that (U 1, U 2,..., U p ) = I n. Thus, p x i = U T i x, x = U i x i. i=1

32 Two Types of Inexact Descent Methods Moreau s proximal mapping: prox h (x) = argmin {h(u) + 12 u x 2 }. Block Proximal Gradient For i = 1 : p x k+1 i = prox α k i g i ( x k i α k i i f (x k,i 1 ) ). g i 0 - block gradient descent, g i = δ Xi - block projected gradient.

33 Two Types of Inexact Descent Methods Moreau s proximal mapping: prox h (x) = argmin {h(u) + 12 u x 2 }. Block Proximal Gradient For i = 1 : p x k+1 i = prox α k i g i ( x k i α k i i f (x k,i 1 ) ). g i 0 - block gradient descent, g i = δ Xi Block Conditional Gradient For i = 1 : p - block projected gradient. p k i argmin p i domg i { i f (x k,i 1 ), p i + g i (p i ) }, x k+1 i = x k i + α k i (p k i x k i ).

34 Rates of Convergence Main Objective: Study of nonasymptotic rates of convergence of block descent methods

35 Rates of Convergence Main Objective: Study of nonasymptotic rates of convergence of block descent methods Usually results on the convergence of BCGD and alternating minimization require assumptions such as strict/strong convexity, uniqueness of the minimum w.r.t. to each block (Auslender 76, Bertsekas 99, Griappo & Sciandrone 99). Luo & Tseng (93) showed that under strong convexity w.r.t. to each block and the existence of an error bound, linear rate of convergence can be obtained. Randomized BCGD. Nesterov (2012) - O(1/k) rate of convergence of expectation in RBCGD. A result on function values (in probability). Richartik & Takac (2012). Extension to the composite functions (sum of smooth and block separable nonsmooth). Improvement of bounds. Fercoq & Richtarik [13 ] - accelerated and parallelized. Shalev-Shwartz and Tewari (2011) - stochastic coordinate descent method applied to the l 1 -regularized loss minimization problem. O(1/ε) result for expected values.

36 Results on BCPG nonasymptotic O(1/k) rate of convergence of cyclic BCGD/BCPG (g i = δ Xi ) under general convexity and smoothness result. Accelerated O(1/k 2 ) version of BCGD.

37 Underlying Assumptions f is convex. f is block-coordinate-wise Lipschitz continuous with local Lipschitz constants L i : i f (x + U i h i ) i f (x) L i h i, for every h i R n i. (consequently) f is Lipschitz continuous. Its constant is denoted by L. f (x) f (y) L x y for every x, y R n. S = {x : f (x) f (x 0 )} is compact and we denote R(x 0 ) max max { x x R n x X x : f (x) f (x 0 )}.

38 Sublinear Rate of Convergence Theorem.[sublinear rate of convergence of the BCGD method] f (x k ) f 4 L max (1 + pl 2 / L 2 min)r 2 1 (x 0 ), k = 0, 1,... k + (8/p)

39 Sublinear Rate of Convergence Theorem.[sublinear rate of convergence of the BCGD method] f (x k ) f 4 L max (1 + pl 2 / L 2 min)r 2 1 (x 0 ), k = 0, 1,... k + (8/p) The Case p = 1 Taking L 1 = L, we obtain the result (using Fejer monotonicity) f (x k ) f 8L x x 0 2, k + 8 For the gradient descent method the following slightly better bound is known: f (x k ) f L x x 0 2, 2k

40 Linear Rate in the Strongly Convex Case Theorem.[Linear rate of convergence under strong convexity] Suppose that f is also assumed to be strongly convex with parameter σ > 0. Then f (x k ) f ( ) k σ 1 2 L max (1 + pl 2 / L 2 min ) (f (x 0 ) f ), k = 0, 1,...

41 Acceleration of BCGD Nesterov s optimal scheme: Input: M > 0. Initialization: x 0 R n, v 0 = x 0, γ 0 = M. General Step (k=0,1,... ): Compute α k (0, 1) which is a solution of the quadratic equation Set γ k+1 = (1 α k )γ k. y k = α k v k + (1 α k )x k. Mα 2 k = (1 α k )γ k Core step: Find x k+1 satisfying f (x k+1 ) f (y k ) 1 2M f (y k) 2 Set v k+1 = v k α k γ k+1 f (y k ).

42 The Core Step Usually x k+1 is chosen as y k t f (y k ). But we can replace it by: Core step for the Accelerated BCGD Method (input: y k output: x k+1 ) Set y 0 k = y k and define recursively x k+1 = y p k. y i k = y i 1 k 1 L i U i i f (y i 1 k ), i = 1,..., p,

43 The Core Step Usually x k+1 is chosen as y k t f (y k ). But we can replace it by: Core step for the Accelerated BCGD Method (input: y k output: x k+1 ) Set y 0 k = y k and define recursively x k+1 = y p k. y i k = y i 1 k 1 L i U i i f (y i 1 k ), i = 1,..., p, Thanks to the sufficient decrease lemma we can write Theorem. Let {x k } k 0 be the sequence generated by the optimal scheme with the BCGD core step. Then f (x k ) f 8 L max (1 + pl 2 / L 2 min ) x x 0 2 (k + 2) 2, k = 0, 1,...

44 The Constrained Problem min{f (x) : x X }. X X 1 X 2... X p. X i R n i closed convex (i = 1, 2,..., p).

45 The Constrained Problem min{f (x) : x X }. X X 1 X 2... X p. X i R n i closed convex (i = 1, 2,..., p). The Block Coordinate Projected Gradient Method (BCPG) method can be also be shown to have O(1/k) rate of convergence. Unknown Lipschitz Constants - Possible to incorporate a backtracking procedure. All rates of convergence results remain (up to slight changed in the constants). Teboulle and Shefi [14] - generalization of the analysis to general proper, closed, convex functions g i.

46 Results on the AM method with p = 2 More difficult to analyze in the absence of strong convexity - distances between consecutive iterates cannot be controlled. On the other hand, logic dictates that if possible, exact minimization is better. Can theory substantiate this intuition?

47 Results on the AM method with p = 2 More difficult to analyze in the absence of strong convexity - distances between consecutive iterates cannot be controlled. On the other hand, logic dictates that if possible, exact minimization is better. Can theory substantiate this intuition? Yes, at least for p = 2...

48 The General Minimization Problem (P): min {H(y, z) f (y, z) + g 1 (y) + g 2 (z) : y R n1, z R n2 } A. g 1 : R n1 (, ], g 2 : R n2 (, ] are closed, proper and convex B. f - convex and continuously differentiable function over dom g 1 dom g 2.

49 Block Notation x = (y, z) g : R n1 R n2 (, ] is defined by g(x) = g(y, z) g 1 (y) + g 2 (z). In this notation: H(x) = f (x) + g(x) 1 f (x) - gradient of f w.r.t y and 2 f (x) - gradient w.r.t to z.

50 Further Assumptions C. 1 f (uniformly) Lipschitz continuous w.r.t. to y over dom g 1 with constant L 1 (0, ): 1f (y + d 1, z) 1f (y, z) L 1 d 1, y, y + d 1 dom g 1, z dom g 2 D. 2 f (uniformly) Lipschitz continuous w.r.t. to z over dom g 2 with constant L 2 (0, ]: 2f (y, z + d 2) 1f (y, z) L 2 d 2, y dom g 1, z, z + d 2 dom g 2 When L 2 =, [D] is meaningless!

51 The Alternating Minimization Method Initialization: y 0 dom g 1, z 0 dom g 2 such that z 0 argmin f (y 0, z) + g 2 (z). z R n 2 General Step (k=0,1,... ): y k+1 argmin f (y, z k ) + g 1 (y), y R n 1 z k+1 argmin f (y k+1, z) + g 2 (z). z R n 2 E. The optimal set of (P), denoted X is nonempty. The minimization problems min f (ỹ, z) + g 2 (z), min f (y, z) + g 1 (y) z R n 2 y R n 1 have minimizers for any ỹ dom g 1, z dom g 2. Note: A half step is performed before invoking the method.

52 The sequence in between the sequence: x k = (y k, z k ) the sequence in between :. Monotonicity of AM: x k+ 1 2 = (y k+1, z k ) H(x 0 ) H(x 1 2 ) H(x 1) H(x 3 2 )...

53 Sublinear Rate of Convergence of AM Theorm. For all n 2 { (1 H(x n ) H max where R = max x R n 1 n 2 2 ) n 1 2 } (H(x 0 ) H ), 8 min{l 1, L 2 }R 2, n 1 max { x x X x : H(x) H(x 0 )} An ε-optimal solution is obtained after at most { 2 max ln(2) (ln(h(x 0) H ) + ln(1/ε)), 8 min{l 1, L 2 }R 2 } + 2 ε iterations. constant depends on min{l 1, L 2 } - an optimistic result. The rate is dictated by the best function. weak dependence on global Lipschitz constants.

54 IRLS - Sublinear Rate of Convergence (A) min h η (y, z) s(y) + 1 m 2 i=1 s.t. y X z [η/2, ) m, ( Ai y+b i 2 +η 2 z i + z i ).

55 IRLS - Sublinear Rate of Convergence (A) min h η (y, z) s(y) + 1 m 2 i=1 s.t. y X z [η/2, ) m, ( Ai y+b i 2 +η 2 z i + z i ). L 1 = L s + 1 η λ max L 2 = ( m ) A T i A i i=1

56 IRLS - Sublinear Rate of Convergence (A) min h η (y, z) s(y) + 1 m 2 i=1 s.t. y X z [η/2, ) m, ( Ai y+b i 2 +η 2 z i + z i ). L 1 = L s + 1 η λ max L 2 = ( m ) A T i A i i=1 Sublinear rate of convergence of IRLS: { (1 ) n 1 } S η (y n ) Sη 2 max (S η (y 0 ) Sη ), 8L 1R 2. 2 n 1

57 Asymptotic Rate of Convergence Theorem. There exists K > 0 such that for all n K + 1. S η (y n ) S η 48R2 η(n K) Rate does not depend on the data (s,a i, b i ). Possibly explains the fast empirical convergence of IRLS.

58 Conditional Gradient In some cases, prox/proj operators are more difficult to compute than linear oracles. Frank & Wolfe [1956] - a method for minimizing a convex smooth function over a compact polyhderal set (Frank and Wolfe s method). Problem: min{f (x) : x C} Conditional Gradient Method: where x k+1 = x k + α k (p(x k ) x k ) p(x k ) argmin{ f (x k ), p : p C}. O(1/k) rate of convergence in the original Frank and Wolfe s paper. Levitin and Polyak [66 ] - extension to arbitrary compact convex sets. Cannon and Culum [68 ] showed that the O(1/k) rate is tight even for strongly convex functions.

59 Conditional Gradient Several variants, e.g., incorporating away steps, can be shown to converge linearly under strong convexity. Bach [15 ] Showed that the analysis of CG can be extended to the composite model, resulting with the generalized conditional gradient. Problem: min{f (x) + g(x) : x C} Generalized Conditional Gradient Method: where x k+1 = x k + α k (p(x k ) x k ) p(x k ) argmin{ f (x k ), p + g(p) : p C}. Lacoste-Julien, Jaggi, Schmidt and Pletscher [13 ] showed that a randomized block version converges in a rate of of O(1/k) (in terms of expected values of the objective function). What about the deterministic case?

60 The Cyclic Block Conditional Gradient Method General Step. For k = 1, 2,..., (i) For any i = 1, 2,..., p, compute p k i argmin p i X i { i f (x k,i 1 ), p i + g i (p i ) }, and then x k,i = x k,i 1 + α k i U i (p k i x k,i 1 i ). (ii) Update x k+1 = x k,p.

61 The Cyclic Block Conditional Gradient Method General Step. For k = 1, 2,..., (i) For any i = 1, 2,..., p, compute p k i argmin p i X i { i f (x k,i 1 ), p i + g i (p i ) }, and then x k,i = x k,i 1 + α k i U i (p k i x k,i 1 i ). (ii) Update x k+1 = x k,p. Stepsize strategies: predefined diminishing stepsize αi k = α k 2 { } Adaptive stepsize αi k = min Si (x k,i 1 ) L i, 1. p k i xk i 2 Exact stepsize α k i k+2. argmin f (x k,i 1 + αu i (p k i x k,i 1 i )) α [0,1]

62 The Optimality Measure The analysis of conditional gradient methods is based on the following optimality measure: where S(x) f (x), x p(x) + g(x) g(p(x)), p(x) argmin { f (x), p + g(p)}. p X

63 The Optimality Measure The analysis of conditional gradient methods is based on the following optimality measure: where S(x) f (x), x p(x) + g(x) g(p(x)), p(x) argmin { f (x), p + g(p)}. p X The analysis of the CBCG method calls for the definition of partial optimality measures: where S i (x) max p i X i { f i (x), x i p i + g i (x i ) g i (p i )} p i (x) argmin p i X i { i f (x), p i + g i (p i )},

64 The Optimality Measure The analysis of conditional gradient methods is based on the following optimality measure: where S(x) f (x), x p(x) + g(x) g(p(x)), p(x) argmin { f (x), p + g(p)}. p X The analysis of the CBCG method calls for the definition of partial optimality measures: where S i (x) max p i X i { f i (x), x i p i + g i (x i ) g i (p i )} p i (x) argmin p i X i { i f (x), p i + g i (p i )}, Convergence of proximal gradient methods is based on the gradient mapping: [ ( G L (x) L x prox 1 L g x 1 )] L f (x)

65 Sublinear Rates of Convergence For all three stepsize rules it can be shown that ( ) 1 H(x k ) H = O. k The analysis of the three stepsize rules is very different. The result for the exact stepsize rule holds only for quadratic f. The result holds also if after each pass a different order (permutation) of the blocks is picked (cyclic-shuffle). The rate of convergence of S(x k ) to 0 can also be established: ( ) 1 min n=0,1,2,...,k S(xk ) = O. k

66 Numerical Results On Synthetic Data We solve the problem where y R 100 Q R min x 1 2 (x y)t Q(x y), (1) Generation of Q: Q = XT X where each component of X R is generated by N(0, 1). The entries of y are generated by N(0, 1). We compare the Conditional Gradient (CG), its random block version (RBDG) and its cyclic and cyclic-shuffle block versions (CBCG-C, CBCG-P) with the three different stepsize strategies based on 1000 randomly generated instances of problem. The central line is the median over the 1000 runs and the ribbons show 98%, 90%, 80%, 60% and 40% quantiles. k - number of effective passes through all the coordinates.

67 Comparison of Stepsize Rules and Methods 10 0 Predefined Adaptive Backtracking Line search f type CBCG P CBCG C RBCG CG k

68 Results on Structural SVM λ : λ : 0.01 λ : 0.1 gap type CBCG C CBCG P RBCG k

69 Duality? Main Question: How does a dual-based variables decomposition method look like?

70 The Model { p min f (x) + x E i=1 ψ i(x) }, f : E (, ] is a closed, proper extended valued σ-strongly convex function. ψ i : E (, ] (i = 1, 2,..., p) closed, proper extended real-valued convex. ri(dom f ) ( p i=1 ri(dom ψ i)).

71 Functional Decomposition - the Idea At each iteration of a functional decomposition method an operation involving only at most one of the functions ψ i is performed.

72 Functional Decomposition - the Idea At each iteration of a functional decomposition method an operation involving only at most one of the functions ψ i is performed. Suppose that either or problems of the form min x f (x) + ψ i (x) + a, x can be easily solved. prox ψi can be easily computed.

73 Functional Decomposition - the Idea At each iteration of a functional decomposition method an operation involving only at most one of the functions ψ i is performed. Suppose that either or problems of the form min x f (x) + ψ i (x) + a, x can be easily solved. prox ψi can be easily computed. Example of functional decomposition methods are incremental (sub)gradient methods (Kibradin [80 ], Luo and Tseng[94 ], Grippo [94 ], Bertsekas [97 ], Solodov [98 ], Nedic and Bertsekas [00,01,10 ], incremental subgradient-proximal (Bertsekas [10 ]) and certain variants of ADMM (Gabay and Mercier) and dual ADMM.

74 Example: 1D total variation denoising Given a noisy measurements vector y, we want to find a smooth vector x which is the solution to 1 min x 2 x n 1 y 2 + λ x i x i+1. i=1 }{{} ψ(x) Equivalent to finding the prox of the TV function. ψ has no useful separability properties.

75 Example: 1D total variation denoising Given a noisy measurements vector y, we want to find a smooth vector x which is the solution to 1 min x 2 x n 1 y 2 + λ x i x i+1. i=1 }{{} ψ(x) Equivalent to finding the prox of the TV function. ψ has no useful separability properties. However, we can decompose ψ as ψ = ψ 1 + ψ 2 where n/2 ψ 1 (x) = λ x 2i 1 x 2i i=1 (n 1)/2 ψ 2 (x) = λ x 2i x 2i+1 prox ψ1, prox ψ2 can be easily computed since they are separable w.r.t. pair of variables (e.g., ψ 1 is separable w.r.t. to {x 1, x 2 }, {x 3, x 4 },...). i=1

76 The Dual Problem The dual problem of (P) is { ( (D) max q(y) f ) p j=1 y j } p j=1 ψ j (y j) In minimization form: { p min H(y) F (y) + y E p i=1 Ψ i(y i ) }. ( with F (y) f ) p j=1 y j - a convex C 1,1 p/σ function. Ψ j (y j ) ψj (y j) - closed, proper, convex.

77 The Dual Problem The dual problem of (P) is { ( (D) max q(y) f ) p j=1 y j } p j=1 ψ j (y j) In minimization form: { p min H(y) F (y) + y E p i=1 Ψ i(y i ) }. ( with F (y) f ) p j=1 y j - a convex C 1,1 p/σ function. Ψ j (y j ) ψj (y j) - closed, proper, convex. Dual block variables decomposition = primal functional decomposition

78 Two Dual Block Steps Given ȳ 1,..., ȳ p, the objective is to compute ȳi new - a new value of the ith component by employing one of the following steps: dual exact minimization step. { ( yi new argmin f ) } p j=1,j i ȳj y i + ψi (y i ). the value of ȳ i is not being used. dual proximal gradient step. y new i = prox σψ i (ȳ i + σ f ( p j=1 ȳj)).

79 Primal Representations of the Dual Block Steps: Using Moreau decomposition and some conjugate/prox calculus... Primal Representation of the Dual Exact Minimization Step: ỹ i = j i ȳj, x argmin {f (x) + ψ i (x) + ỹ i, x }, x E yi new ψ i ( x). Primal Representation of the Dual Proximal Gradient Step: { x = argmin f (x) + } p j=1 ȳj, x, x E ) (ȳi ȳi new = ȳ i + σ x prox ψi /σ σ + x

80 Dual Cyclic Alternating Minimization Method (DAM-C) Initialization. y 0 = (y 0 0, y0 1,..., y0 m) E p. General Step (k = 0, 1, 2, 3,...). Set y k,0 = y k. For i = 0, 1,..., p 1 Define y k,i+1 as follows: { x k,i argmin f (x) + ψ i+1 (x) + p { x E y k,i+1 ψi+1 (x k,i ) j = i + 1, j = y k,i j j i + 1. Set y k+1 = y k,p and x k = x k,0. j=1,j i+1 yk,i If f C 1, then the update rule for y k,i+1 can be replaced by { y k,i+1 f (x k,i ) p j=1,j i+1 j = yk,i j j = i + 1, y k,i j j i + 1. } j, x

81 Dual Cyclic Block Proximal Gradient Method (DBPG-C) Initialization. (y 0 0, y0 1,..., y0 m) E p. General Step (k = 0, 1, 2, 3,...). Set y k,0 = y k. For i = 0, 1,..., m 1 Define y k,i+1 as follows { x k,i = argmin f (x) + } p j=1 yk,i j, x, x E ( y k,i+1 yi+1 k j = + σxk,i prox ψi+1 /σ Set y k+1 = y k,m and x k = x k,0. y k,i i+1 σ + xk,i ) j = i + 1, y k,i j, j i + 1.

82 Rate of Convergence of the Primal Sequence The rates of convergence of the dual objective function are already known. Does it imply corresponding rates of convergence of the primal sequence?

83 Rate of Convergence of the Primal Sequence The rates of convergence of the dual objective function are already known. Does it imply corresponding rates of convergence of the primal sequence? YES!

84 Rate of Convergence of the Primal Sequence The rates of convergence of the dual objective function are already known. Does it imply corresponding rates of convergence of the primal sequence? YES! The primal-dual relation. Let ȳ satisfy ȳ j dom ψj for any j {1, 2,..., p}. Let x be defined by either { x argmin f (x) + } p i=j ȳj, x x or { x argmin f (x) + ψ i (x) + } p j=1,j i ȳj, x x for some i {1, 2,..., p}. Then x x 2 2 σ (q opt q(ȳ)) First implication by [B.- Teboulle, 14 ]

85 Rates of Convergence of Functional Decomposition Methods method complexity result remarks DBPG-C x k x 2 2C1 σ(k+1) Ψ i indicators, general m DBPG-C x k x 2 2C2 σ(k+1) general Ψ i and m DBPG-R E( x k x 2 ) 2m σ(m+k) C 3 general Ψ i and m DAM-C x k x 2 2C4 σk m = 2 DAM-C x k x 2 2C5 σ(k+1) general m and ψ i 2m [(2m + 1)R + σm]2 C 1 = σ { } C 2 = 2σmGmaxR max σmgmaxr 2, 2 2 qopt q(y0 ), 2 C 3 = 1 2σ min y Y y0 y 2 + q opt q(y 0 ), C 4 = 3 max {q opt q(y 0 ), 1σ } R2, C 5 = 2m3 R 2 max { 2σ m 3 R 2 2, q opt q(y 0 ), 2 } σ

86 Numerical Example: Isotropic 2D TV denoising TV denoising: Isotropic TV: x R m n TV I = m 1 i=1 1 min x R m n 2 x b 2 F + θ TV I (x) n 1 j=1 (xi,j x i+1,j ) 2 + (x i,j x i,j+1 ) 2 j=1 x m,j x m,j+1, + m 1 i=1 x i,n x i+1,n + n 1 Chambolle, Pock[15 ] : anisotropic (l 1 l 1 ), decomposition intro rows and columns (two functions).

87 Decomposition of Isotropic TV m n TV I (x) = i=1 j=1 (xi,j x i+1,j ) 2 + (x i,j x i,j+1 ) 2 = k K 1 + (i,j) D (xi,j k x i+1,j ) 2 + (x i,j x i,j+1 ) 2 k K 2 + (i,j) D (xi,j k x i+1,j ) 2 + (x i,j x i,j+1 ) 2 k K 3 (i,j) D (xi,j k x i+1,j ) 2 + (x i,j x i,j+1 ) 2 = ψ 1 (x) + ψ 2 (x) + ψ 3 (x). D k - indices of the k diagonal. K i { k { (m 1),..., n 1} : (k + 1 i) mod 3 = 0 } i = 1, 2, 3. Using the separability of ψ i, computation of prox ψi is simple.

88 Illustration ψ 1 ψ 2 ψ 3

89 Numerical Comparison 30 iterations, λ = DAM ADMM dual ADMM FISTA Function Value Iteration In the first 100 iterations DAM-C is better than FISTA. However, after enough runs, FISTA wins...

90 References Amir Beck and Luba Tetruashvili, On The Convergence of Block Coordinate Descent Type Methods, SIAM J. Optim., 23, no. 4 (2013) Amir Beck, On the Convergence of Alternating Minimization for Convex Programming with Applications to Iteratively Reweighted Least Squares and Decomposition Schemes, SIAM J. Optim., 25, no. 1 (2015), Amir Beck, Edouard Pauwels and Shoham Sabach, The Cyclic Block Conditional Gradient Method for Convex Optimization Problems (2015), SIAM J. Optim. to appear. Amir Beck, Luba Tetruashvili, Yakov Vaisbourd, Rate of Convergence Analysis of Dual-Based Variables Decomposition Method

91 Last Slide THANK YOU FOR YOUR ATTENTION

Amir Beck August 7, Abstract. 1 Introduction and Problem/Model Formulation. In this paper we consider the following minimization problem:

Amir Beck August 7, Abstract. 1 Introduction and Problem/Model Formulation. In this paper we consider the following minimization problem: On the Convergence of Alternating Minimization for Convex Programming with Applications to Iteratively Reweighted Least Squares and Decomposition Schemes Amir Beck August 7, 04 Abstract This paper is concerned

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Primal and Dual Predicted Decrease Approximation Methods

Primal and Dual Predicted Decrease Approximation Methods Primal and Dual Predicted Decrease Approximation Methods Amir Beck Edouard Pauwels Shoham Sabach March 22, 2017 Abstract We introduce the notion of predicted decrease approximation (PDA) for constrained

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Primal and dual predicted decrease approximation methods

Primal and dual predicted decrease approximation methods Math. Program., Ser. B DOI 10.1007/s10107-017-1108-9 FULL LENGTH PAPER Primal and dual predicted decrease approximation methods Amir Beck 1 Edouard Pauwels 2 Shoham Sabach 1 Received: 11 November 2015

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Sequential convex programming,: value function and convergence

Sequential convex programming,: value function and convergence Sequential convex programming,: value function and convergence Edouard Pauwels joint work with Jérôme Bolte Journées MODE Toulouse March 23 2016 1 / 16 Introduction Local search methods for finite dimensional

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption Daniel Reem (joint work with Simeon Reich and Alvaro De Pierro) Department of Mathematics, The Technion,

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Optimization for Learning and Big Data

Optimization for Learning and Big Data Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Coordinate Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization / Slides adapted from Tibshirani

Coordinate Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization / Slides adapted from Tibshirani Coordinate Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Slides adapted from Tibshirani Coordinate descent We ve seen some pretty sophisticated methods

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

A user s guide to Lojasiewicz/KL inequalities

A user s guide to Lojasiewicz/KL inequalities Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Sparse and Regularized Optimization

Sparse and Regularized Optimization Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Convergence of Fixed-Point Iterations

Convergence of Fixed-Point Iterations Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and

More information

A first-order primal-dual algorithm with linesearch

A first-order primal-dual algorithm with linesearch A first-order primal-dual algorithm with linesearch Yura Malitsky Thomas Pock arxiv:608.08883v2 [math.oc] 23 Mar 208 Abstract The paper proposes a linesearch for a primal-dual method. Each iteration of

More information

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

arxiv: v3 [math.oc] 29 Jun 2016

arxiv: v3 [math.oc] 29 Jun 2016 MOCCA: mirrored convex/concave optimization for nonconvex composite functions Rina Foygel Barber and Emil Y. Sidky arxiv:50.0884v3 [math.oc] 9 Jun 06 04.3.6 Abstract Many optimization problems arising

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Expanding the reach of optimal methods

Expanding the reach of optimal methods Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM

More information

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

A Tutorial on Primal-Dual Algorithm

A Tutorial on Primal-Dual Algorithm A Tutorial on Primal-Dual Algorithm Shenlong Wang University of Toronto March 31, 2016 1 / 34 Energy minimization MAP Inference for MRFs Typical energies consist of a regularization term and a data term.

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

First-order methods for structured nonsmooth optimization

First-order methods for structured nonsmooth optimization First-order methods for structured nonsmooth optimization Sangwoon Yun Department of Mathematics Education Sungkyunkwan University Oct 19, 2016 Center for Mathematical Analysis & Computation, Yonsei University

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

The 2-Coordinate Descent Method for Solving Double-Sided Simplex Constrained Minimization Problems

The 2-Coordinate Descent Method for Solving Double-Sided Simplex Constrained Minimization Problems J Optim Theory Appl 2014) 162:892 919 DOI 10.1007/s10957-013-0491-5 The 2-Coordinate Descent Method for Solving Double-Sided Simplex Constrained Minimization Problems Amir Beck Received: 11 November 2012

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

9. Dual decomposition and dual algorithms

9. Dual decomposition and dual algorithms EE 546, Univ of Washington, Spring 2016 9. Dual decomposition and dual algorithms dual gradient ascent example: network rate control dual decomposition and the proximal gradient method examples with simple

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems Amir Beck and Shoham Sabach July 6, 2011 Abstract We consider a general class of convex optimization problems

More information

On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems

On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems Radu Ioan Boţ Ernö Robert Csetnek August 5, 014 Abstract. In this paper we analyze the

More information