Tight Rates and Equivalence Results of Operator Splitting Schemes

Size: px

Start display at page:

Download "Tight Rates and Equivalence Results of Operator Splitting Schemes"

Florence Rich
5 years ago
Views:

1 Tight Rates and Equivalence Results of Operator Splitting Schemes Wotao Yin (UCLA Math) Workshop on Optimization for Modern Computing Joint w Damek Davis and Ming Yan UCLA CAM 14-51, 14-58, and / 45

2 Operator splitting methods They are methods for solving problems like minimize x minimize x,y find x C 1 C 2, f (x) + g(x), f (x) + g(y), subject to Ax + By = b, by iteratively performing simple operations. Algorithms: alternating projection, forward-backward splitting (FBS), Douglas-Rachford splitting (DRS), Peaceman-Rachford splitting (PRS), ADMM, etc. Most of them can be written as x k+1 T(x k ), where T satisfies x = T(x) x is a solution. T is nonexpansive. In particular, T(x k ) x 2 x k x 2. T is composed of I γ h, prox γh, and refl γh. 2 / 45

3 This talk Reviews some examples of prox and splitting algorithms. Establishes new convergence results, many of which are tight. Argues that convergence of DRS, PRS, and ADMM automatically improves upon better regularity properties. DRS, PRS, and ADMM are self-dual primal-dual algorithms. 3 / 45

4 Proximal operator Unlike those with explicit formulas, prox method is an optimization problem Examples: prox λf (v) := arg min f (x) + 1 x v 2 x 2λ f = ι C Euclidean projection prox f (v) = Proj C (v) closed-form formulas for norms and many separable functions Relation to resolvent: prox λf = (I + λ f ) 1, where f is proper closed convex S is maximally monotone (I + λs) 1 is a point-to-point mapping proximal-point algorithm (PPA): x k+1 = (I + λs) 1 (x k ) 4 / 45

5 Properties of prox λf Fixed point is optimal. f (x ) = min x f (x) x = prox γf (x ) T = prox λf is firmly nonexpansive, i.e., T(x) T(y) 2 x y 2 (x T(x)) (y T(y)) 2 weak convergence in Hilbert space, and the rate of fixed-point residual Interpretation: backward Euler / implicit gradient x k+1 = prox λf (x k ) x k+1 = (I + λ f ) 1 (x k ) x k x k+1 + λ f (x k+1 ) x k+1 = x k λ f (x k+1 ) (We use f for the subgradient of f, uniquely determined by proxλf ) Moreau decomposition: x = prox f (x) + prox f (x) For linear subspace S and f = ι S, reduces to x = Proj S (x) + Proj S (x) 5 / 45

6 Forward-backward splitting (FBS) minimize x r(x) + f (x) Suppose A = r and B = f (f is differentiable). Optimality condition has the operator form 0 ( r + f )x 0 (A + B)x (I γb)x (I + γa)x Prox-gradient (prox-linear) iteration (Sub)gradient form (I + γa) 1 (I γb) x = x. }{{}}{{} backward forward x k+1 = prox γr (x k γ f (x k )). x k+1 = x k γ r(x k+1 ) γ f (x k ). 6 / 45

7 Reflection operator and averaged operator Definition: refl f = 2prox f I. Subgradient form: x k f = prox f (z k ) = z k f (x k f ) z k+1 = refl f (z k ) = z k 2 f (x k f ). refl f is nonexpansive, but not firmly nonexpansive. Averaged operator: weighted average of I and a nonexpansive T. So, prox f = (refl f ) 1/2. Property: λ (0, 1], x, y T λ := (1 λ)i + λt. T λ (x) T λ (y) 2 x y 2 1 λ λ (x T λ(x)) (y T λ (y)) 2 7 / 45

8 Peaceman-Rachford splitting (PRS) Iteration: Subgradient form: Diagram: minimize z f (z) + g(z) z k+1 = T PRS (z k ) := refl γf refl γg(z k ) z k+1 = z k 2γ f (x k f ) 2γ g(x k g ). refl γg (z k ) x k g = prox γg (z k ) x k f = prox γf refl γg(z k ) z k T PRS (z k ) 8 / 45

9 Peaceman-Rachford splitting (PRS) Iteration: Subgradient form: Diagram: minimize z f (z) + g(z) z k+1 = T PRS (z k ) := refl γf refl γg(z k ) z k+1 = z k 2γ f (x k f ) 2γ g(x k g ). refl γg(z k ) γ g(x k g ) γ f (x k f ) x k g = prox γg (z k ) γ g(x k g ) x k f = prox γf refl γg(z k ) γ f (x k f ) z k T PRS (z k ) 9 / 45

10 PRS iteration may not converge. Example: Let C 1 = x 1 axis and C 2 = x 2 axis. minimize ι C1 (x) + ι C2 (x). x 2 z even x x 1 z odd Converges if one of the two functions is strongly convex Most well-known example of PRS: method of alternating project 10 / 45

11 Douglas-Rachford splitting (DRS) and relaxed PRS Relaxed PRS: fix z 0, γ > 0 and relaxation parameters (λ j) j 0 (0, 1] z k+1 = (T PRS ) λk (z k ). DRS: Corresponds to λ k 1/2. Always converges weakly, when a solution exists 1. (T PRS ) λk reflect, reflect, λ k -average. Fixed points minimizers of f + g. prox γg (z k ) a minimizer (proved in 2011 in Banach space). 2 1 Eckstein and Bertsekas, On the Douglas-Rachford Splitting Method and the Proximal Point Algorithm for Maximal Monotone Operators 2 Svaiter, On weak convergence of the Douglas-Rachford method 11 / 45

12 First-order algorithms: subgradient forms minimize f (x) + g(x) x (Sub)gradient descent: z k+1 = z k γ f (z k ) γ g(z k ). Proximal point algorithm (PPA): z k+1 = z k γ f (z k+1 ) γ g(z k+1 ). Forward backward splitting (FBS): z k+1 = z k γ f (z k+1 ) γ g(z k ). Relaxed Peaceman-Rachford splitting (PRS): ( ) z k+1 = z k λ γ f (x k k f ) + γ g(x g ). 12 / 45

13 ADMM ADMM iteration: minimize x,y f (x) + g(y) subject to Ax + By = b 1. x k+1 = arg min x f (x) + (w k ) T Ax + γ 2 Ax + Byk b 2 ; 2. y k+1 = arg min y g(y) + (w k ) T By + γ 2 Ax k+1 + By b 2 3. w k+1 = w k + γ(ax k+1 + By k+1 b). Equivalent to DRS applied to the dual problem 3 : Lagrangian:L(x, y; w) = f (x) + w T Ax + g(y) + w T By w T b }{{}}{{} L 1 (x;w) L 2 (y;w) Define: Dual problem: d 1(w) := min L 1(x; w), x d 2(w) := min L 2(y; w). y minimize d 1(w) + d 2(w). w 3 Gabay / 45

14 Diagram of ADMM refl γd1 (z k ) w k = prox γd1 (z k ) z k z k+1 T PRS (z k ) 14 / 45

15 Diagram of ADMM refl γd1 (z k ) γax k γ(by k+1 b) w k = prox γd1 (z k ) γax k γax k z k z k+1 T PRS (z k ) 15 / 45

16 Diagram of ADMM refl γd1 (z k ) γax k γ(by k+1 b) w k+1 = prox γd1 (z k+1 ) w k = prox γd1 (z k ) γax k γax k γax k+1 z k z k+1 T PRS (z k ) 16 / 45

17 Diagram of ADMM refl γd1 (z k ) γax k γ(by k+1 b) w k = prox γd1 (z k ) γ(ax k+1 + By k+1 b) w k+1 = prox γd1 (z k+1 ) γax k γax k γax k+1 z k z k+1 T PRS (z k ) 17 / 45

18 Generally, Krasnosel skiĭ-mann (KM) iteration 4 5 Definitions: H Hilbert space. T : H H nonexpansive. Fixed points: z H such that Tz = z. Averaged iteration of T, (aka KM iteration) z k+1 = T λk (z k ) := (1 λ k )z k + λ k Tz k. Convergence: Converges weakly to a fixed point if λ k bounded away from 0 and 1. If no fixed point and λ k is bounded away from 0, the sequence (z j ) j 0 is unbounded. (Browder-Göhde-Kirk fixed-point theorem.) Special cases: DRS, PRS, ADMM, FBS, PPA,... 4 Krasnosel skiĭ: Two remarks on the method of successive approximations (1955) 5 Mann: Mean value methods in iteration (1953) 18 / 45

19 Part 2: Convergence rates Fixed-point residual The fixed-point residual (FPR) of the KM iteration: Tz k z k 2 = 1 λ 2 k z k+1 z k 2. Tz z = 0 often means z is optimal. Small FPR implies Tz k z k. The property Tz k z k 0 is called asymptotic regularity. In general, convergence of z k z can be arbitrarily slow. In optimization Tz k z k is usually some sorts of gradients or subgradients, so it is a dual measure of optimality The rate of Tz k z k 2 controls the progress of convergence In ADMM: Tz k z k = 2γ(Ax k + By k b). 19 / 45

20 History of FPR 1978 (λ = 1/2): Brèzis and Lions 6 show FPR satisfies ( ) Tz k z k 2 1 = O. k + 1 If T = prox γf, then ( ) Tz k z k 2 1 = O. (k + 1) (General λ): Baillon and Bruck 7 conjecture O (1/(k + 1)) for nonexpansive maps on Banach spaces (General λ): Cominetti, Soto, and Vaisman 8 prove the conjecture of Baillon and Bruck. 6 Produits infinis de resolvantes 7 The rate of asymptotic regularity is O(1/ k) 8 On the rate of convergence of Krasnosel skiĭ-mann iterations and their connection with sums of Bernoullis 20 / 45

21 Convergence rates Objective error (Non-ergodic) error: consider minimizing h(x) and x is a minimizer of h: h(x k ) h(x ) Its convergence to zero does not imply strong convergence. Useful as a filter through which we view the distance to the solution. Ergodic error: Define ergodic iterates: x k = (1/Λ k ) We measure the quantity k k λ ixg k, where Λ k = λ i, i=0 h(x k ) h(x ) i=0 21 / 45

22 History of objective error 1967 Polyak proved the subgradient method achieves O(1/ k + 1). 1980s Nemirovsky and Yudin show lower complexity of Ω(1/ k + 1) for general class of subgradient methods. 1980s? showed gradient descent O(1/(k + 1)) Nesterov proposed accelerated gradient descent to achieve O(1/(k + 1) 2 ) Güler proved O(1/(k + 1)) convergence for PPA Beck and Teboulle proved O(1/(k + 1)) for FBS, and proposed accelerated variant that achieves O(1/(k + 1) 2 ) Goldstein, O Donoghue, and Setzer proved O(1/(k + 1)) for ADMM when objectives both primal objectives are strongly convex Wei and Ozdaglar showed O(1/(k + 1)) ergodic convergence of ADMM with specific binary matrix A and B He and Yuan showed O(1/(k + 1)) of VI-based optimality violation. Recently, Bot, Chambolle, Deng, Falidi, Lai, Ma, Monteiro, Peyre, Pock, Svaiter, Zhang, violation to VI and Lagrangian optimality, duality gap 22 / 45

23 Contributions on rates (with Damek Davis) KM iteration: FPR o(1/k), tight, improved from O to o. PPA based on prox f : FPR o(1/k 2 ), tight (by an example in Brezis-Lions 78) improved; objective o(1/k), tight (by an infinite-dim example). FBS based on I g and prox f : same rates as PPA, tight. 23 / 45

24 Relaxed PRS (including DRS and, for some, also PRS): all are new FPR: o(1/k), tight (by an infinite-dim example) Ergodic squared feasibility: O(1/k 2 ), tight (by a 2D example) Lipschitz f or g: ergodic objective: o(1/k), tight (by a 1D example) objective: o(1/ k), tight (by an infinite-dim example) Strongly convex f or g: strong sequence convergence, best sequence error o(1/k) ergodic error O(1/k) Gradient Lipschitz f or g: best objective o(1/k); Limit γ properly: objective o(1/k) and FPR o(1/k 2 ) Strongly convex + gradient Lipschitz (applied to either the same or different functions): all rates (FPR, objective, sequence) are linear 24 / 45

25 ADMM (as dual DRS) by d f = A ( f ) A and d g = B ( g ) B. f strongly convex d f is differentiable and Lipschitz; (same for g) f differentiable and AA is full-rank d f is strongly convex; (same for g) Translate the results from relaxed PRS to ADMM: general case: ergodic squared constraint feasibility: O(1/k 2 ) squared constraint feasibility: o(1/k) ergodic objective: O(1/k) objective: o(1/ k) strongly convex f or g: squared feasibility o(1/k 2 ), objective o(1/k) strongly convex function + gradient Lipschitz + matrix full-rank: everything linear convergence Note: Results in Deng-Yin-2012 cover more cases. 25 / 45

26 Method of alternating projection for finding x C 1 C 2 : linear regularity: a special case of PRS, all rates linear in general: same as relax PRS with gradient Lipschitz objectives results extended to x C 1 C n when C 1 C 2 =, converge to the shortest line segment between DRS ( reflect, reflect, average ) for finding x C 1 C 2 : in general: a sequence of points in each set distance to other set: general o(1/k), ergodic O(1/k 2 ) 26 / 45

27 Results in a nutshell Essentially tight upper and lower bounds on fixed-point residual (FPR) for KM iterations. Relaxed PRS point sequence can converge strongly yet arbitrarily slowly. Objective convergence: On average, relaxed PRS performs as well as PPA. In the worst case, relaxed PRS performs nearly as slowly as the subgradient method. When g is Lipschitz, DRS performs as well as FBS, yet no knowledge of Lipschitz constant is needed. 27 / 45

28 Relaxed PRS algorithm converges linearly whenever one of the objectives is strongly convex and one has a Lipschitz derivative. They can be either the same or different functions. For feasibility problems relaxed PRS converges linearly under regularity assumptions on the intersection. For feasibility problems with no regularity, we can generate a point in each set and bound their distance to each other. ADMM produces similar rates for objective functions and the feasibility separately. 28 / 45

29 Part 3: Basic lemma for summable and monotonic sequence Lemma Suppose that nonnegative scalar sequences (λ j) j 0 and (a j) j 0 that are summable i=0 λiai <. Let Λ k := k λi for k 0. i=0 1. If (a j) j 0 is monotonically nonincreasing, then ( a k 1 ) ( ) 1 λ ia i and a k = o. (1) Λ k Λ k Λ k/2 i=0 1.1 If (λ j) j 0 is bounded away from 0 and, then a k = o(1/(k + 1)); 1.2 If λ k = (k + 1) p for p 0 and all k 1, then a k = o(1/(k + 1) p+1 ). 2. Suppose that the nonnegative scalar sequence (b j) j 0 is monotonically nonincreasing and satisfies b k λ k a k λ k+1 a k+1. Then for all k 0 ( ) 1 ( ) 1 b k λ ia (k + 1) 2 k and b k = o. (2) (k + 1) 2 i=0 29 / 45

30 Intuitions Every convergence rate follows from this lemma. Sequence (1/(j + 1)) j 0 is not summable, so a k must decrease faster. Follows because 2i i+1 λiai 0 Extensions: Same assumptions except quasi monotonic: a k+1 a k + e k. Then, e k has to converge one-order faster than a k to preserve its rate. Same assumptions but no monotonicity: k best := arg min{a i : i = 0,..., k}. i Then, all the rates hold for a kbest instead of a k. 30 / 45

31 The idea of FPR convergence rate In general: The term Tz k z k 2 = 1 z k z k+1 2 is monotonic. λ 2 k Furthermore λ i=0 k(1 λ k ) Tz k z k 2 <. Thus, convergence controlled by k i=0 λ k(1 λ k ) When g is Lipschitz (PPA, FBS, DRS) Still have monotonicity, but i=0 (i + 1) Tz k z k 2 <. Requires information about the objective functions.. 31 / 45

32 Example: PPA convergence rate minimize x h(x) PPA iteration: z k+1 = z k γ h(z k+1 ) Minimizer z Objective error sequence a k = h(z k+1 ) h(z ) FPR sequence b k = (1/γ) z k+2 z k+1 2 For any z, h(z k+1 ) h(z) z k+1 z, h(z k+1 ) ((sub)gradient inequality) = 1 γ zk+1 z, z k z k+1 = 1 ( z k z 2 z k+1 z 2 z k+1 z k 2). 2γ 32 / 45

33 h(z k+1 ) h(z) 1 ( z k z 2 z k+1 z 2 z k+1 z k 2). 2γ Nonnegativity: obvious Summability: at z = z : a k (1/2γ) ( z k z 2 z k+1 z 2 z k+1 z k 2) = i=0 a k (1/2γ) z 0 z. Monotonicity: at z = z k 0 b k = (1/γ) z k+2 z k+1 2 h(z k+1 ) h(z k+2 ) = a k a k+1. By the lemma: ( ) 1 a k = o. k + 1 Also, b k (= FPR) is monotonic = ( ) 1 b k = o. (k + 1) 2 33 / 45

34 A fundamental inequality Proposition If z + = (T PRS ) λ (z), then for all x dom(f ) dom(g) 4γλ(f (x f ) + g(x g) f (x) g(x)) ( z x 2 z + x ) z + z 2 λ ( = 2 z + x, z z ) z + z 2. 2λ k Nonergodic rate: Use Cauchy-Schwarz on inner product. The objective error involves both x f and x g. It can be negative. The inequality also has the other side, i.e., a lower bound. Additional regularity properties enable same-point objective error Ergodic rate: Sum both sides, divide by Λ j, and use Jensen s inequality. 34 / 45

35 The other cases If f or g is lipschitz then λ k (f (x k ) + g(x k ) f (x ) g(x )) <. i=0 = best-point convergence rates When λ k 1/2 and g is Lipschitz, we have to construct an auxiliary monotonic sequence that dominates the objective. Under strong convexity, i=0 λ k x k x 2 < = running-best convergence rates. The feasibility problem and the linear convergence result use same fundamental inequality. 35 / 45

36 Other applications More applications in paper: Feasibility; Parallelized model fitting; Linear Programming (linear convergence); Semidefinite programming. 36 / 45

37 Part 4: Primal-dual equivalence (with Ming Yan) Definition: apply the same algorithm to both primal and dual problems, with proper initialization and parameters, the iterates of one can be explicitly reconstructed from those of the other. Eckstein 9 shows DRS is equivalent to DRS on the dual, for a special case Eckstein and Fukushima 10 shows ADMM is equivalent to ADMM on the dual, for a special case AA T = I. Rarely mentioned in the literature. We extend the result to ADMM and relaxed PRS (including DRS and PRS) for general cases, assuming only convexity and the existence of primal-dual solutions. We introduce an equivalent primal-dual algorithm for the saddle-point problem. We establish conditions for the equivalence between ADMMs with swapped orders of subproblems. 9 Eckstien. Splitting methods for monotone operators with applications to parallel optimization, PhD thesis, Eckstein and Fukushima. Some reformulations and applications of the alternating direction, / 45

38 Remarks Different splitting leads to different ADMM iterates. Specifically, we consider minimize x,y f (x) + g(y) (P1) and its dual minimize v subject to Ax + By = b f ( A v) + g ( B v) + v, b. ADMM is applied to (P1) the reformulated dual problem: minimize u,v f ( A u) + (g ( B v) + v, b ) (D1) subject to u v = 0. Examples: YALL1 package 11, l 1-l 1 model 12, traffic equilibrium 13, dual alternating projection. 11 J.Yang and Y.Zhang, Alternating direction algorithms for l1 -problems in compressive sensing, Y.Xiao, H.Zhu, S.-Y. Wu. Primal and dual alternating direction algorithms for l1-l1-norm minimization problems in compressive sensing, Primal: Fukushima 96; dual: Gabay / 45

39 Remarks Penalty parameter λ in the primal ADMM becomes λ 1 in the dual ADMM. It balances primal-dual progress. The perfect symmetry between primal and dual ADMMs suggest that ADMM is a primal-dual algorithm to a saddle-point formulation. 39 / 45

40 Saddle-point formulation and its algorithm The original problem (P1) is equivalent to min max g(y) + u, By b f ( A u). y u Primal-Dual Algorithm: Initialize u 0, u 1, y 0, λ > 0, for k = 0, 1,..., do: ū k = 2u k u k 1 y k+1 = arg min y u k+1 = arg min u Remarks: g(y) + (2λ) 1 By By k + λū k 2 2 f ( A u) + λ/2 u u k λ 1 (By k+1 b) 2 2 If B = I, then it is equivalent to Chambolle-Pock, whose paper also noted the equivalence between it and ADMM. ADMM and PD have the same # iterations but different flops per-iteration. 40 / 45

41 Application: extended monotropic programming minimize x 1,x 2,,x N N f i (x i ), i=1 subject to N A i x i = b. Convert the problem into the following ADMM-ready formulation N minimize f i(x i) + ι N (y) {x i },{y i } {y: i=1 i=1 y i =b} subject to A i x i y i = 0. i=1 ADMM: iteratively update {x i}, {y i}, {u i} Primal-Dual: iteratively update {y i}, {u i}, and at the end recover {x i} 41 / 45

42 Assumption: f ( A u) has an easy form, for example, when f i( ) = (1/2) 2 A i R m n i and A ia T i = I. For each iteration k and block i: ADMM: 10m + 2mn i PD: 10m due to the hiding of x i Pre/post-processing: ADMM has a pre-step of mn i for each i PD has a post-step of mn i for each i Distributed computing: Same communication for ADMM and PD PD has better load balance since its per-iteration flop is independent of n i 42 / 45

43 Swap x/y-update order Two similar ADMM on the same problem: ADMM 1 updates y, then x, then dual variable z ADMM 2 updates x, then y, then dual variable z In general, they produce different iterates, but there are exceptions. Define: F(s) := min f (x) + ι {x:ax=s} (x), x (3a) Theorem G(t) := min g(y) + ι {y:by=b t} (y). (3b) y 1. Assume prox G is affine. Given the iterates of ADMM 2, if z 0 2 G(b By 0 2), then the iterates of ADMM 1 can be recovered as x k 1 = x k+1 2, z k 1 = z k 2 + λ 1 (Ax k By k 2 b). 2. Assume prox F is affine. Given the iterates of ADMM 1, if z 0 1 G(Ax 0 1), then the iterates of ADMM 2 can be recovered as y k 2 = y k+1 1, z k 2 = z k 1 + λ 1 (Ax k By k+1 1 b). 43 / 45

44 Affine proximal mapping Definition A mapping T is affine if, for any r 1 and r 2, ( 1 T 2 r1 + 1 ) 2 r2 = 1 2 Tr Tr2. Proposition Let G be a proper, closed, convex function. The following statements are equivalent: 1. prox G( ) is affine; 2. prox λg( ) is affine for λ > 0; 3. aprox G( ) bi + ci is affine for any scalars a, b and c; 4. prox G ( ) is affine; 5. G is convex quadratic (or, affine or constant) and has an affine domain (either G or the intersection of hyperplanes in G). If function g obeys Part 5, then G defined in (3b) satisfies Part 5, too. 44 / 45

45 Conclusion Our work Analyzed relaxed PRS, ADMM, and KM iterations. Provided worst-case non-asymptotic convergence analysis. Provided lower complexity bounds for the basic rates. Showed the limitations of the methods. Established primal-dual equivalence and conditions for order-swapping equivalence Reflections The methods are essentially nonexpansive operator splitting iterations applied to optimality conditions of the original problem When splitting methods use points other than z k, it lacks an objective function for monotonic decrease or for acceleration Splitting methods based on implicit steps automatically adjust to regularity properties present. (That s why they are practically fast.) 45 / 45

Convergence of Fixed-Point Iterations

Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and