Primal and Dual Variables Decomposition Methods in Convex Optimization

Size: px

Start display at page:

Download "Primal and Dual Variables Decomposition Methods in Convex Optimization"

Charity Rogers
5 years ago
Views:

1 Primal and Dual Variables Decomposition Methods in Convex Optimization Amir Beck Technion - Israel Institute of Technology Haifa, Israel Based on joint works with Edouard Pauwels, Shoham Sabach, Luba Tetruashvili, Yakov Vaisbourd Optimization in Machine Learning, Vision and Image Processing Toulouse, France, 6-7 October 2015

2 A (too?) General Model (P) min H(x 1, x 2,..., x p ) s.t. x i X i, X i R n i so that n = p i=1 n i. At each iteration of a block variables decomposition method an operation involving only at one of the block variables x 1, x 2,..., x p is performed.

3 A (too?) General Model X i R n i (P) so that n = p i=1 n i. min H(x 1, x 2,..., x p ) s.t. x i X i, The Alternating Minimization method sequentially minimizes H w.r.t. each component in a cyclic manner. Alternating Minimization At step k, given x k, the next iterate x k+1 is computed as follows: For i = 1 : p.. x k+1 1 argmin H(x 1, x k 2,..., x k p). x 1 x k+1 2 argmin x 2 H(x k+1 1, x 2, x k 3,..., x k p). x k+1 p argmin H(x k+1 1,..., x k+1 p 1, x p). x p

4 Block Descent Methods The AM method is just one example of a block descent method or variables decomposition method. Other variants replace for example the exact minimization step with some kind of a descent operator. General Block Descent Method For i=1:p x k+1 i = T i (x k , x k+1 i 1, xk i, xk i+1,..., xk p). T i : R n R n i - a descent operator (such as one step of a minimization method)

5 Block Descent Methods The AM method is just one example of a block descent method or variables decomposition method. Other variants replace for example the exact minimization step with some kind of a descent operator. General Block Descent Method For i=1:p x k+1 i = T i (x k , x k+1 i 1, xk i, xk i+1,..., xk p). T i : R n R n i - a descent operator (such as one step of a minimization method) Additional variants of the method consider different index selection strategies other than cyclic (essentially cyclic,gauss-southwell) Deterministic index selection strategies can be replaced by randomized.

6 Example 1 of AM: IRLS - Iteratively Reweighted Least Squares The model: (N) min s(y) + m i=1 A iy + b i 2 s.t. y X, A i R k i n, b i R k i, i = 1, 2,..., m. s continuously differentiable over the closed and convex set X R n.

7 Example 1 of AM: IRLS - Iteratively Reweighted Least Squares The model: (N) min s(y) + m i=1 A iy + b i 2 s.t. y X, A i R k i n, b i R k i, i = 1, 2,..., m. s continuously differentiable over the closed and convex set X R n. Examples: l 1 -norm linear regression min By c 1 Fermat-Weber problem (FW ) m min ω i y a i i=1 l 1 -regularized least squares min By c λ Dy 1..

8 IRLS - Iteratively Reweighted Least Squares Wishful thinking... Initialization: y 0 X. General Step (k = 0, 1,...): { y k+1 argmin y X s(y) m i=1 } A i y + b i 2. A i y k + b i

9 IRLS - Iteratively Reweighted Least Squares Wishful thinking... Initialization: y 0 X. General Step (k = 0, 1,...): { y k+1 argmin y X s(y) m i=1 } A i y + b i 2. A i y k + b i Practical method η-irls Input: η > 0 - a given parameter. Initialization: y 0 X. General Step (k = 0, 1,...): { y k+1 argmin y X s(y) m i=1 } A i y + b i 2. Ai y k + b i 2 + η 2

10 IRLS - Literature popular approach in robust regression (McCullagh, Nedler 83 ) applications in sparse recovery (Daubechies et al, 10 ) same as Weiszfeld s method (from 1937) for solving the Fermat-Weber problem (η = 0???) Convergence results are known only for very specific instances [Bissantz et. al. 08 (specific unconstrained model, asymptotic linear rate of convergence), Daubechies et al 10 (asy. linear rate, basis pursuit problem)]

11 IRLS AM Auxiliary problem: (N) min h η (y, z) s(y) + 1 m 2 i=1 s.t. y X z [η/2, ) m, ( Ai y+b i 2 +η 2 z i + z i ). Minimizing w.r.t. z implies that the problem is a smoothed version of (N): (N η ) min s.t. s(y) + m y X i=1 Ai y + b i η2

12 IRLS AM (y k, z k ) - the k-th iterate of the AM method. The z-step in AM: z i = A i y k + b i 2 + η 2

13 IRLS AM (y k, z k ) - the k-th iterate of the AM method. The z-step in AM: z i = A i y k + b i 2 + η 2 The y-step in AM: y k+1 argmin y X { s(y) m i=1 } A i y + b i 2. Ai y k + b i 2 + η 2

14 IRLS AM (y k, z k ) - the k-th iterate of the AM method. The z-step in AM: z i = A i y k + b i 2 + η 2 The y-step in AM: y k+1 argmin y X { s(y) m i=1 } A i y + b i 2. Ai y k + b i 2 + η 2 The methods are equivalent given that the initial z 0 is given by [z 0 ] i = A i y 0 + b i 2 + η 2

15 Example 2 of AM: A Composite Model T = min {T (y) q(y) + r(ay)}, q : R n R { } closed, proper, convex. r : R m R real-valued convex function.

16 Example 2 of AM: A Composite Model T = min {T (y) q(y) + r(ay)}, q : R n R { } closed, proper, convex. r : R m R real-valued convex function. A popular penalized approach: Consider the problem min {q(y) + r(z) : z = Ay}. z,y Write a penalized version: (C) T ρ = min y,z {T ρ (y, z) = q(y) + r(z) + ρ z Ay 2} 2 Employ the AM method.

17 AM for solving (C) Alternating Minimization for Solving (C) Input: ρ > 0 - a given parameter. { Initialization: y 0 R n, z 0 argmin r(z) + ρ 2 z Ay 0 2}. General Step (k=0,1,... ): { y k+1 argmin q(y) + ρ y R n 2 z k Ay 2}, { z k+1 = argmin r(z) + ρ z R m 2 z Ay k+1 2}.

18 AM for solving (C) Alternating Minimization for Solving (C) Input: ρ > 0 - a given parameter. { Initialization: y 0 R n, z 0 argmin r(z) + ρ 2 z Ay 0 2}. General Step (k=0,1,... ): { y k+1 argmin q(y) + ρ y R n 2 z k Ay 2}, { z k+1 = argmin r(z) + ρ z R m 2 z Ay k+1 2}. Implementable in several important examples, e.g., min Cx d Lx 1. (prox of l 1 +solution of a linear system)

19 Example 3 of AM: k-means method in clustering Input: Clusters: n points A = {a 1, a 2,..., a n } R d. k - number of clusters. The idea is to partition the data A into k subsets (clusters) A 1,..., A k, called clusters. For each l {1,..., k}, the cluster A l is represented by its so-called center x l. The clustering problem: determine k cluster centers x 1, x 2,..., x k such that the sum of distances from each point a i to a nearest cluster center x l is minimized. min x 1,...,x k n i=1 min a i x l 2. l=1,2,...,k

20 Clustering: AM=k-means Using the trick: min{b 1,..., b k } = min{b T y : y k }. where k = {y R k : e T y = 1, y 0}, we can reformulate: n i=1 k l=1 y i l a i x l 2 min s.t. x 1,..., x n R d, y 1,..., y n k,

21 Clustering: AM=k-means Using the trick: min{b 1,..., b k } = min{b T y : y k }. where k = {y R k : e T y = 1, y 0}, we can reformulate: n i=1 k l=1 y i l a i x l 2 min s.t. x 1,..., x n R d, y 1,..., y n k, k-means. repeat: Assignment step. assign each point a i to closest cluster center: A l = {i : a i x l a i x j for all j = 1,..., k}, l = 1, 2,..., k. Update step. Cluster centers are averages: x l = 1 A l i A l a i.

22 Why Block Descent Methods? Simple and cheap updates at each iteration - suitable for large-scale applications. Allow larger step-sizes at each iteration. In some nonconvex settings - results with better quality solutions.

23 Convergence?? well... Convergence of the AM method is not always guaranteed. Powell s example (73 ): ϕ(x, y, z) = xy yz zx + [x 1] [ x 1] 2 + +[y 1] [ y 1] [z 1] [ z 1] 2 +. differentiable and nonconvex Fixing y, z, it is easy to show that that { sgn(y + z)(1 + 1 argmin ϕ(x, y, z) = 2 y + z ) y + z 0 x [ 1, 1] y + z = 0 Similar formulas for minimizing w.r.t. y and z.

24 Powell s Example Start with ( 1 ε, ε, ε). First six iterations ( ε, ε, 1 14 ε ) ( ε, ε, 1 1 ) 4 ε ( ε, ε, ) 32 ε

25 Powell s Example Start with ( 1 ε, ε, ε). First six iterations ( ε, ε, 1 14 ) ε ( ε, ε, ) ε ( ε, ε, 1 1 ) 4 ε ( ε, ε, ) 32 ε ( ε, ε, ) 32 ε ( ε, ) 1 ε, ε

26 Powell s Example Start with ( 1 ε, ε, ε). First six iterations ( ε, ε, 1 14 ) ε ( ε, ε, ) ε ( ε, ε, 1 1 ) 4 ε ( ε, ε, ) 32 ε ( ε, ε, ) 32 ε ( ε, ) 1 ε, ε We are essentially back at the first point, but with 1 64ε instead of ε. The process will continue to cycle around the 6 non-stationary points (1, 1, 1), (1, 1, 1), (1, 1, 1), ( 1, 1, 1), ( 1, 1, 1), ( 1, 1, 1).

27 Powell s Example Start with ( 1 ε, ε, ε). First six iterations ( ε, ε, 1 14 ) ε ( ε, ε, ) ε ( ε, ε, 1 1 ) 4 ε ( ε, ε, ) 32 ε ( ε, ε, ) 32 ε ( ε, ) 1 ε, ε We are essentially back at the first point, but with 1 64ε instead of ε. The process will continue to cycle around the 6 non-stationary points (1, 1, 1), (1, 1, 1), (1, 1, 1), ( 1, 1, 1), ( 1, 1, 1), ( 1, 1, 1). Can be rectified if certain uniqueness/convexity/boundedness assumptions are made [Zadeh 70, Grippo Sciandrone 00, Bertsekas & Tsitsikls 89, Luo & Tseng 93,...]

28 The Failure of Convexity Unfortunately, in the absence of differentiability, even convexity is not enough to guarantee convergence. Example: f (x 1, x 2 ) = 3x 1 +4x 2 + x 1 +2x 2 ( 2, 1.5) - a coordinatewise minimum, but not the global minimum (which is (0, 0)). Any block descent might converge to the non-global solution.

29 The Composite Model We should thus consider a combination of a smooth function (not necessarily convex, but with some boundedness assumptions). a nonsmooth separable convex function.

30 The Composite Model We should thus consider a combination of a smooth function (not necessarily convex, but with some boundedness assumptions). a nonsmooth separable convex function. The composite model (P) min f (x 1, x 2,..., x p ) + g(x) {}}{ p g i (x i ) i=1 } {{ } H(x 1,...,x p) f : R n R - continuously differentiable with bounded level sets. g i : R n i (, ] - closed, proper, convex. A coordinatewise minimum of (P) is a stationary point. i f (x) g i (x), i = 1, 2,..., p.

31 Basic Notation 1. x k,i = i. x k+1 x k+1 x k i+1 x k p x k,0 = x k, x k,p = x k+1 U i R n n i, i = 1,..., p are such that (U 1, U 2,..., U p ) = I n. Thus, p x i = U T i x, x = U i x i. i=1

32 Two Types of Inexact Descent Methods Moreau s proximal mapping: prox h (x) = argmin {h(u) + 12 u x 2 }. Block Proximal Gradient For i = 1 : p x k+1 i = prox α k i g i ( x k i α k i i f (x k,i 1 ) ). g i 0 - block gradient descent, g i = δ Xi - block projected gradient.

33 Two Types of Inexact Descent Methods Moreau s proximal mapping: prox h (x) = argmin {h(u) + 12 u x 2 }. Block Proximal Gradient For i = 1 : p x k+1 i = prox α k i g i ( x k i α k i i f (x k,i 1 ) ). g i 0 - block gradient descent, g i = δ Xi Block Conditional Gradient For i = 1 : p - block projected gradient. p k i argmin p i domg i { i f (x k,i 1 ), p i + g i (p i ) }, x k+1 i = x k i + α k i (p k i x k i ).

34 Rates of Convergence Main Objective: Study of nonasymptotic rates of convergence of block descent methods

35 Rates of Convergence Main Objective: Study of nonasymptotic rates of convergence of block descent methods Usually results on the convergence of BCGD and alternating minimization require assumptions such as strict/strong convexity, uniqueness of the minimum w.r.t. to each block (Auslender 76, Bertsekas 99, Griappo & Sciandrone 99). Luo & Tseng (93) showed that under strong convexity w.r.t. to each block and the existence of an error bound, linear rate of convergence can be obtained. Randomized BCGD. Nesterov (2012) - O(1/k) rate of convergence of expectation in RBCGD. A result on function values (in probability). Richartik & Takac (2012). Extension to the composite functions (sum of smooth and block separable nonsmooth). Improvement of bounds. Fercoq & Richtarik [13 ] - accelerated and parallelized. Shalev-Shwartz and Tewari (2011) - stochastic coordinate descent method applied to the l 1 -regularized loss minimization problem. O(1/ε) result for expected values.

36 Results on BCPG nonasymptotic O(1/k) rate of convergence of cyclic BCGD/BCPG (g i = δ Xi ) under general convexity and smoothness result. Accelerated O(1/k 2 ) version of BCGD.

37 Underlying Assumptions f is convex. f is block-coordinate-wise Lipschitz continuous with local Lipschitz constants L i : i f (x + U i h i ) i f (x) L i h i, for every h i R n i. (consequently) f is Lipschitz continuous. Its constant is denoted by L. f (x) f (y) L x y for every x, y R n. S = {x : f (x) f (x 0 )} is compact and we denote R(x 0 ) max max { x x R n x X x : f (x) f (x 0 )}.

38 Sublinear Rate of Convergence Theorem.[sublinear rate of convergence of the BCGD method] f (x k ) f 4 L max (1 + pl 2 / L 2 min)r 2 1 (x 0 ), k = 0, 1,... k + (8/p)

39 Sublinear Rate of Convergence Theorem.[sublinear rate of convergence of the BCGD method] f (x k ) f 4 L max (1 + pl 2 / L 2 min)r 2 1 (x 0 ), k = 0, 1,... k + (8/p) The Case p = 1 Taking L 1 = L, we obtain the result (using Fejer monotonicity) f (x k ) f 8L x x 0 2, k + 8 For the gradient descent method the following slightly better bound is known: f (x k ) f L x x 0 2, 2k

40 Linear Rate in the Strongly Convex Case Theorem.[Linear rate of convergence under strong convexity] Suppose that f is also assumed to be strongly convex with parameter σ > 0. Then f (x k ) f ( ) k σ 1 2 L max (1 + pl 2 / L 2 min ) (f (x 0 ) f ), k = 0, 1,...

41 Acceleration of BCGD Nesterov s optimal scheme: Input: M > 0. Initialization: x 0 R n, v 0 = x 0, γ 0 = M. General Step (k=0,1,... ): Compute α k (0, 1) which is a solution of the quadratic equation Set γ k+1 = (1 α k )γ k. y k = α k v k + (1 α k )x k. Mα 2 k = (1 α k )γ k Core step: Find x k+1 satisfying f (x k+1 ) f (y k ) 1 2M f (y k) 2 Set v k+1 = v k α k γ k+1 f (y k ).

42 The Core Step Usually x k+1 is chosen as y k t f (y k ). But we can replace it by: Core step for the Accelerated BCGD Method (input: y k output: x k+1 ) Set y 0 k = y k and define recursively x k+1 = y p k. y i k = y i 1 k 1 L i U i i f (y i 1 k ), i = 1,..., p,

43 The Core Step Usually x k+1 is chosen as y k t f (y k ). But we can replace it by: Core step for the Accelerated BCGD Method (input: y k output: x k+1 ) Set y 0 k = y k and define recursively x k+1 = y p k. y i k = y i 1 k 1 L i U i i f (y i 1 k ), i = 1,..., p, Thanks to the sufficient decrease lemma we can write Theorem. Let {x k } k 0 be the sequence generated by the optimal scheme with the BCGD core step. Then f (x k ) f 8 L max (1 + pl 2 / L 2 min ) x x 0 2 (k + 2) 2, k = 0, 1,...

44 The Constrained Problem min{f (x) : x X }. X X 1 X 2... X p. X i R n i closed convex (i = 1, 2,..., p).

45 The Constrained Problem min{f (x) : x X }. X X 1 X 2... X p. X i R n i closed convex (i = 1, 2,..., p). The Block Coordinate Projected Gradient Method (BCPG) method can be also be shown to have O(1/k) rate of convergence. Unknown Lipschitz Constants - Possible to incorporate a backtracking procedure. All rates of convergence results remain (up to slight changed in the constants). Teboulle and Shefi [14] - generalization of the analysis to general proper, closed, convex functions g i.

46 Results on the AM method with p = 2 More difficult to analyze in the absence of strong convexity - distances between consecutive iterates cannot be controlled. On the other hand, logic dictates that if possible, exact minimization is better. Can theory substantiate this intuition?

47 Results on the AM method with p = 2 More difficult to analyze in the absence of strong convexity - distances between consecutive iterates cannot be controlled. On the other hand, logic dictates that if possible, exact minimization is better. Can theory substantiate this intuition? Yes, at least for p = 2...

48 The General Minimization Problem (P): min {H(y, z) f (y, z) + g 1 (y) + g 2 (z) : y R n1, z R n2 } A. g 1 : R n1 (, ], g 2 : R n2 (, ] are closed, proper and convex B. f - convex and continuously differentiable function over dom g 1 dom g 2.

49 Block Notation x = (y, z) g : R n1 R n2 (, ] is defined by g(x) = g(y, z) g 1 (y) + g 2 (z). In this notation: H(x) = f (x) + g(x) 1 f (x) - gradient of f w.r.t y and 2 f (x) - gradient w.r.t to z.

50 Further Assumptions C. 1 f (uniformly) Lipschitz continuous w.r.t. to y over dom g 1 with constant L 1 (0, ): 1f (y + d 1, z) 1f (y, z) L 1 d 1, y, y + d 1 dom g 1, z dom g 2 D. 2 f (uniformly) Lipschitz continuous w.r.t. to z over dom g 2 with constant L 2 (0, ]: 2f (y, z + d 2) 1f (y, z) L 2 d 2, y dom g 1, z, z + d 2 dom g 2 When L 2 =, [D] is meaningless!

51 The Alternating Minimization Method Initialization: y 0 dom g 1, z 0 dom g 2 such that z 0 argmin f (y 0, z) + g 2 (z). z R n 2 General Step (k=0,1,... ): y k+1 argmin f (y, z k ) + g 1 (y), y R n 1 z k+1 argmin f (y k+1, z) + g 2 (z). z R n 2 E. The optimal set of (P), denoted X is nonempty. The minimization problems min f (ỹ, z) + g 2 (z), min f (y, z) + g 1 (y) z R n 2 y R n 1 have minimizers for any ỹ dom g 1, z dom g 2. Note: A half step is performed before invoking the method.

52 The sequence in between the sequence: x k = (y k, z k ) the sequence in between :. Monotonicity of AM: x k+ 1 2 = (y k+1, z k ) H(x 0 ) H(x 1 2 ) H(x 1) H(x 3 2 )...

53 Sublinear Rate of Convergence of AM Theorm. For all n 2 { (1 H(x n ) H max where R = max x R n 1 n 2 2 ) n 1 2 } (H(x 0 ) H ), 8 min{l 1, L 2 }R 2, n 1 max { x x X x : H(x) H(x 0 )} An ε-optimal solution is obtained after at most { 2 max ln(2) (ln(h(x 0) H ) + ln(1/ε)), 8 min{l 1, L 2 }R 2 } + 2 ε iterations. constant depends on min{l 1, L 2 } - an optimistic result. The rate is dictated by the best function. weak dependence on global Lipschitz constants.

54 IRLS - Sublinear Rate of Convergence (A) min h η (y, z) s(y) + 1 m 2 i=1 s.t. y X z [η/2, ) m, ( Ai y+b i 2 +η 2 z i + z i ).

55 IRLS - Sublinear Rate of Convergence (A) min h η (y, z) s(y) + 1 m 2 i=1 s.t. y X z [η/2, ) m, ( Ai y+b i 2 +η 2 z i + z i ). L 1 = L s + 1 η λ max L 2 = ( m ) A T i A i i=1

56 IRLS - Sublinear Rate of Convergence (A) min h η (y, z) s(y) + 1 m 2 i=1 s.t. y X z [η/2, ) m, ( Ai y+b i 2 +η 2 z i + z i ). L 1 = L s + 1 η λ max L 2 = ( m ) A T i A i i=1 Sublinear rate of convergence of IRLS: { (1 ) n 1 } S η (y n ) Sη 2 max (S η (y 0 ) Sη ), 8L 1R 2. 2 n 1

57 Asymptotic Rate of Convergence Theorem. There exists K > 0 such that for all n K + 1. S η (y n ) S η 48R2 η(n K) Rate does not depend on the data (s,a i, b i ). Possibly explains the fast empirical convergence of IRLS.

58 Conditional Gradient In some cases, prox/proj operators are more difficult to compute than linear oracles. Frank & Wolfe [1956] - a method for minimizing a convex smooth function over a compact polyhderal set (Frank and Wolfe s method). Problem: min{f (x) : x C} Conditional Gradient Method: where x k+1 = x k + α k (p(x k ) x k ) p(x k ) argmin{ f (x k ), p : p C}. O(1/k) rate of convergence in the original Frank and Wolfe s paper. Levitin and Polyak [66 ] - extension to arbitrary compact convex sets. Cannon and Culum [68 ] showed that the O(1/k) rate is tight even for strongly convex functions.

59 Conditional Gradient Several variants, e.g., incorporating away steps, can be shown to converge linearly under strong convexity. Bach [15 ] Showed that the analysis of CG can be extended to the composite model, resulting with the generalized conditional gradient. Problem: min{f (x) + g(x) : x C} Generalized Conditional Gradient Method: where x k+1 = x k + α k (p(x k ) x k ) p(x k ) argmin{ f (x k ), p + g(p) : p C}. Lacoste-Julien, Jaggi, Schmidt and Pletscher [13 ] showed that a randomized block version converges in a rate of of O(1/k) (in terms of expected values of the objective function). What about the deterministic case?

60 The Cyclic Block Conditional Gradient Method General Step. For k = 1, 2,..., (i) For any i = 1, 2,..., p, compute p k i argmin p i X i { i f (x k,i 1 ), p i + g i (p i ) }, and then x k,i = x k,i 1 + α k i U i (p k i x k,i 1 i ). (ii) Update x k+1 = x k,p.

61 The Cyclic Block Conditional Gradient Method General Step. For k = 1, 2,..., (i) For any i = 1, 2,..., p, compute p k i argmin p i X i { i f (x k,i 1 ), p i + g i (p i ) }, and then x k,i = x k,i 1 + α k i U i (p k i x k,i 1 i ). (ii) Update x k+1 = x k,p. Stepsize strategies: predefined diminishing stepsize αi k = α k 2 { } Adaptive stepsize αi k = min Si (x k,i 1 ) L i, 1. p k i xk i 2 Exact stepsize α k i k+2. argmin f (x k,i 1 + αu i (p k i x k,i 1 i )) α [0,1]

62 The Optimality Measure The analysis of conditional gradient methods is based on the following optimality measure: where S(x) f (x), x p(x) + g(x) g(p(x)), p(x) argmin { f (x), p + g(p)}. p X

63 The Optimality Measure The analysis of conditional gradient methods is based on the following optimality measure: where S(x) f (x), x p(x) + g(x) g(p(x)), p(x) argmin { f (x), p + g(p)}. p X The analysis of the CBCG method calls for the definition of partial optimality measures: where S i (x) max p i X i { f i (x), x i p i + g i (x i ) g i (p i )} p i (x) argmin p i X i { i f (x), p i + g i (p i )},

64 The Optimality Measure The analysis of conditional gradient methods is based on the following optimality measure: where S(x) f (x), x p(x) + g(x) g(p(x)), p(x) argmin { f (x), p + g(p)}. p X The analysis of the CBCG method calls for the definition of partial optimality measures: where S i (x) max p i X i { f i (x), x i p i + g i (x i ) g i (p i )} p i (x) argmin p i X i { i f (x), p i + g i (p i )}, Convergence of proximal gradient methods is based on the gradient mapping: [ ( G L (x) L x prox 1 L g x 1 )] L f (x)

65 Sublinear Rates of Convergence For all three stepsize rules it can be shown that ( ) 1 H(x k ) H = O. k The analysis of the three stepsize rules is very different. The result for the exact stepsize rule holds only for quadratic f. The result holds also if after each pass a different order (permutation) of the blocks is picked (cyclic-shuffle). The rate of convergence of S(x k ) to 0 can also be established: ( ) 1 min n=0,1,2,...,k S(xk ) = O. k

66 Numerical Results On Synthetic Data We solve the problem where y R 100 Q R min x 1 2 (x y)t Q(x y), (1) Generation of Q: Q = XT X where each component of X R is generated by N(0, 1). The entries of y are generated by N(0, 1). We compare the Conditional Gradient (CG), its random block version (RBDG) and its cyclic and cyclic-shuffle block versions (CBCG-C, CBCG-P) with the three different stepsize strategies based on 1000 randomly generated instances of problem. The central line is the median over the 1000 runs and the ribbons show 98%, 90%, 80%, 60% and 40% quantiles. k - number of effective passes through all the coordinates.

67 Comparison of Stepsize Rules and Methods 10 0 Predefined Adaptive Backtracking Line search f type CBCG P CBCG C RBCG CG k

68 Results on Structural SVM λ : λ : 0.01 λ : 0.1 gap type CBCG C CBCG P RBCG k

69 Duality? Main Question: How does a dual-based variables decomposition method look like?

70 The Model { p min f (x) + x E i=1 ψ i(x) }, f : E (, ] is a closed, proper extended valued σ-strongly convex function. ψ i : E (, ] (i = 1, 2,..., p) closed, proper extended real-valued convex. ri(dom f ) ( p i=1 ri(dom ψ i)).

71 Functional Decomposition - the Idea At each iteration of a functional decomposition method an operation involving only at most one of the functions ψ i is performed.

72 Functional Decomposition - the Idea At each iteration of a functional decomposition method an operation involving only at most one of the functions ψ i is performed. Suppose that either or problems of the form min x f (x) + ψ i (x) + a, x can be easily solved. prox ψi can be easily computed.

73 Functional Decomposition - the Idea At each iteration of a functional decomposition method an operation involving only at most one of the functions ψ i is performed. Suppose that either or problems of the form min x f (x) + ψ i (x) + a, x can be easily solved. prox ψi can be easily computed. Example of functional decomposition methods are incremental (sub)gradient methods (Kibradin [80 ], Luo and Tseng[94 ], Grippo [94 ], Bertsekas [97 ], Solodov [98 ], Nedic and Bertsekas [00,01,10 ], incremental subgradient-proximal (Bertsekas [10 ]) and certain variants of ADMM (Gabay and Mercier) and dual ADMM.

74 Example: 1D total variation denoising Given a noisy measurements vector y, we want to find a smooth vector x which is the solution to 1 min x 2 x n 1 y 2 + λ x i x i+1. i=1 }{{} ψ(x) Equivalent to finding the prox of the TV function. ψ has no useful separability properties.

75 Example: 1D total variation denoising Given a noisy measurements vector y, we want to find a smooth vector x which is the solution to 1 min x 2 x n 1 y 2 + λ x i x i+1. i=1 }{{} ψ(x) Equivalent to finding the prox of the TV function. ψ has no useful separability properties. However, we can decompose ψ as ψ = ψ 1 + ψ 2 where n/2 ψ 1 (x) = λ x 2i 1 x 2i i=1 (n 1)/2 ψ 2 (x) = λ x 2i x 2i+1 prox ψ1, prox ψ2 can be easily computed since they are separable w.r.t. pair of variables (e.g., ψ 1 is separable w.r.t. to {x 1, x 2 }, {x 3, x 4 },...). i=1

76 The Dual Problem The dual problem of (P) is { ( (D) max q(y) f ) p j=1 y j } p j=1 ψ j (y j) In minimization form: { p min H(y) F (y) + y E p i=1 Ψ i(y i ) }. ( with F (y) f ) p j=1 y j - a convex C 1,1 p/σ function. Ψ j (y j ) ψj (y j) - closed, proper, convex.

77 The Dual Problem The dual problem of (P) is { ( (D) max q(y) f ) p j=1 y j } p j=1 ψ j (y j) In minimization form: { p min H(y) F (y) + y E p i=1 Ψ i(y i ) }. ( with F (y) f ) p j=1 y j - a convex C 1,1 p/σ function. Ψ j (y j ) ψj (y j) - closed, proper, convex. Dual block variables decomposition = primal functional decomposition

78 Two Dual Block Steps Given ȳ 1,..., ȳ p, the objective is to compute ȳi new - a new value of the ith component by employing one of the following steps: dual exact minimization step. { ( yi new argmin f ) } p j=1,j i ȳj y i + ψi (y i ). the value of ȳ i is not being used. dual proximal gradient step. y new i = prox σψ i (ȳ i + σ f ( p j=1 ȳj)).

79 Primal Representations of the Dual Block Steps: Using Moreau decomposition and some conjugate/prox calculus... Primal Representation of the Dual Exact Minimization Step: ỹ i = j i ȳj, x argmin {f (x) + ψ i (x) + ỹ i, x }, x E yi new ψ i ( x). Primal Representation of the Dual Proximal Gradient Step: { x = argmin f (x) + } p j=1 ȳj, x, x E ) (ȳi ȳi new = ȳ i + σ x prox ψi /σ σ + x

80 Dual Cyclic Alternating Minimization Method (DAM-C) Initialization. y 0 = (y 0 0, y0 1,..., y0 m) E p. General Step (k = 0, 1, 2, 3,...). Set y k,0 = y k. For i = 0, 1,..., p 1 Define y k,i+1 as follows: { x k,i argmin f (x) + ψ i+1 (x) + p { x E y k,i+1 ψi+1 (x k,i ) j = i + 1, j = y k,i j j i + 1. Set y k+1 = y k,p and x k = x k,0. j=1,j i+1 yk,i If f C 1, then the update rule for y k,i+1 can be replaced by { y k,i+1 f (x k,i ) p j=1,j i+1 j = yk,i j j = i + 1, y k,i j j i + 1. } j, x

81 Dual Cyclic Block Proximal Gradient Method (DBPG-C) Initialization. (y 0 0, y0 1,..., y0 m) E p. General Step (k = 0, 1, 2, 3,...). Set y k,0 = y k. For i = 0, 1,..., m 1 Define y k,i+1 as follows { x k,i = argmin f (x) + } p j=1 yk,i j, x, x E ( y k,i+1 yi+1 k j = + σxk,i prox ψi+1 /σ Set y k+1 = y k,m and x k = x k,0. y k,i i+1 σ + xk,i ) j = i + 1, y k,i j, j i + 1.

82 Rate of Convergence of the Primal Sequence The rates of convergence of the dual objective function are already known. Does it imply corresponding rates of convergence of the primal sequence?

83 Rate of Convergence of the Primal Sequence The rates of convergence of the dual objective function are already known. Does it imply corresponding rates of convergence of the primal sequence? YES!

84 Rate of Convergence of the Primal Sequence The rates of convergence of the dual objective function are already known. Does it imply corresponding rates of convergence of the primal sequence? YES! The primal-dual relation. Let ȳ satisfy ȳ j dom ψj for any j {1, 2,..., p}. Let x be defined by either { x argmin f (x) + } p i=j ȳj, x x or { x argmin f (x) + ψ i (x) + } p j=1,j i ȳj, x x for some i {1, 2,..., p}. Then x x 2 2 σ (q opt q(ȳ)) First implication by [B.- Teboulle, 14 ]

85 Rates of Convergence of Functional Decomposition Methods method complexity result remarks DBPG-C x k x 2 2C1 σ(k+1) Ψ i indicators, general m DBPG-C x k x 2 2C2 σ(k+1) general Ψ i and m DBPG-R E( x k x 2 ) 2m σ(m+k) C 3 general Ψ i and m DAM-C x k x 2 2C4 σk m = 2 DAM-C x k x 2 2C5 σ(k+1) general m and ψ i 2m [(2m + 1)R + σm]2 C 1 = σ { } C 2 = 2σmGmaxR max σmgmaxr 2, 2 2 qopt q(y0 ), 2 C 3 = 1 2σ min y Y y0 y 2 + q opt q(y 0 ), C 4 = 3 max {q opt q(y 0 ), 1σ } R2, C 5 = 2m3 R 2 max { 2σ m 3 R 2 2, q opt q(y 0 ), 2 } σ

86 Numerical Example: Isotropic 2D TV denoising TV denoising: Isotropic TV: x R m n TV I = m 1 i=1 1 min x R m n 2 x b 2 F + θ TV I (x) n 1 j=1 (xi,j x i+1,j ) 2 + (x i,j x i,j+1 ) 2 j=1 x m,j x m,j+1, + m 1 i=1 x i,n x i+1,n + n 1 Chambolle, Pock[15 ] : anisotropic (l 1 l 1 ), decomposition intro rows and columns (two functions).

87 Decomposition of Isotropic TV m n TV I (x) = i=1 j=1 (xi,j x i+1,j ) 2 + (x i,j x i,j+1 ) 2 = k K 1 + (i,j) D (xi,j k x i+1,j ) 2 + (x i,j x i,j+1 ) 2 k K 2 + (i,j) D (xi,j k x i+1,j ) 2 + (x i,j x i,j+1 ) 2 k K 3 (i,j) D (xi,j k x i+1,j ) 2 + (x i,j x i,j+1 ) 2 = ψ 1 (x) + ψ 2 (x) + ψ 3 (x). D k - indices of the k diagonal. K i { k { (m 1),..., n 1} : (k + 1 i) mod 3 = 0 } i = 1, 2, 3. Using the separability of ψ i, computation of prox ψi is simple.

88 Illustration ψ 1 ψ 2 ψ 3

89 Numerical Comparison 30 iterations, λ = DAM ADMM dual ADMM FISTA Function Value Iteration In the first 100 iterations DAM-C is better than FISTA. However, after enough runs, FISTA wins...

90 References Amir Beck and Luba Tetruashvili, On The Convergence of Block Coordinate Descent Type Methods, SIAM J. Optim., 23, no. 4 (2013) Amir Beck, On the Convergence of Alternating Minimization for Convex Programming with Applications to Iteratively Reweighted Least Squares and Decomposition Schemes, SIAM J. Optim., 25, no. 1 (2015), Amir Beck, Edouard Pauwels and Shoham Sabach, The Cyclic Block Conditional Gradient Method for Convex Optimization Problems (2015), SIAM J. Optim. to appear. Amir Beck, Luba Tetruashvili, Yakov Vaisbourd, Rate of Convergence Analysis of Dual-Based Variables Decomposition Method

91 Last Slide THANK YOU FOR YOUR ATTENTION

Amir Beck August 7, Abstract. 1 Introduction and Problem/Model Formulation. In this paper we consider the following minimization problem:

Amir Beck August 7, Abstract. 1 Introduction and Problem/Model Formulation. In this paper we consider the following minimization problem: On the Convergence of Alternating Minimization for Convex Programming with Applications to Iteratively Reweighted Least Squares and Decomposition Schemes Amir Beck August 7, 04 Abstract This paper is concerned