arxiv: v1 [math.oc] 21 Apr 2016

Size: px

Start display at page:

Download "arxiv: v1 [math.oc] 21 Apr 2016"

Austin Harrington
5 years ago
Views:

1 Accelerated Douglas Rachford methods for the solution of convex-concave saddle-point problems Kristian Bredies Hongpeng Sun April, 06 arxiv: v [math.oc] Apr 06 Abstract We study acceleration and preconditioning strategies for a class of Douglas Rachford methods aiming at the solution of convex-concave saddle-point problems associated with Fenchel Rockafellar duality. While the basic iteration converges weakly in Hilbert space with O(/k) ergodic convergence of restricted primal-dual gaps, acceleration can be achieved under strong-convexity assumptions. Namely, if either the primal or dual functional in the saddle-point formulation is strongly convex, then the method can be modified to yield O(/k ) ergodic convergence. In case of both functionals being strongly convex, similar modifications lead to an asymptotic convergence of O(ϑ k ) for some 0 < ϑ <. All methods allow in particular for preconditioning, i.e., the inexact solution of the implicit linear step in terms of linear splitting methods with all convergence rates being maintained. The efficiency of the proposed methods is verified and compared numerically, especially showing competitiveness with respect to state-of-the-art accelerated algorithms. Key words. Saddle-point problems, Douglas Rachford splitting, acceleration strategies, linear preconditioners, convergence analysis, primal-dual gap. AMS subject classifications. 65K0, 49K35, 90C5, 65F08. Introduction In this work we are concerned with Douglas Rachford-type algorithms for solving the saddle-point problem min max Kx, y + F(x) G(y) () x dom F y dom G where F : X ], ], G : Y ], ] are proper, convex and lower semi-continuous functionals on the real Hilbert spaces X and Y, respectively, and K : X Y is a linear and continuous mapping. Problems of this type commonly occur as primal-dual formulations in the context of Fenchel Rockafellar duality. In case the latter holds, solutions of () are exactly solutions of the primal-dual problem min F(x) + x X G (Kx), max F ( K y) G(y) () y Y with F and G denoting the Fenchel conjugate functionals of F and G, respectively. We propose and study a class of Douglas Rachford iterative algorithms for the solution of () with a focus Institute for Mathematics and Scientific Computing, University of Graz, Heinrichstraße 36, A-800 Graz, Austria. kristian.bredies@uni-graz.at. The Institute for Mathematics and Scientific Computing is a member of NAWI Graz ( Institute for Mathematical Sciences, Renmin University of China, No. 59, Zhongguancun Street, Haidian District, Beijing, 0087 Beijing, People s Republic of China. hpsun@amss.ac.cn.

2 on acceleration and preconditioning strategies. The main aspects are threefold: First, the algorithms only utilize specific implicit steps, the proximal mappings of F and G as well as the solution of linear equations. In particular, they converge independently from the choice of step-sizes associated with these operations. Here, convergence of the iterates is in the weak sense and we have a O(/k) rate for restricted primal-dual gaps evaluated at associated ergodic sequences. Second, in case of strong convexity of F or G, the iteration can be modified to yield accelerated convergence. Specifically, in case of strongly convex F, the associated primal sequence {x k } converges strongly at rate O(/k ) in terms of the quadratic norm distance while there is weak convergence for the associated dual sequence {y k }. The rate O(/k ) also follows for restricted primal-dual gaps evaluated at associated ergodic sequences. Moreover, in case of F and G both being strongly convex, another accelerating modification yields strongly convergent primal and dual sequences with a rate of o(ϑ k ) in terms of squared norms for some 0 < ϑ <. Analogously, the full primal-dual gap converges with the rate O(ϑ k ) when evaluated at associated ergodic sequences. Third, both the basic and accelerated versions of the Douglas Rachford iteration may be amended by preconditioning allowing for inexact solution of the implicit linear step. Here, preconditioning is incorporated by performing one or finitely many steps of linear operator splitting method instead of solving the full implicit equation in each iteration step. For appropriate preconditioners, the Douglas Rachford method as well as its accelerated variants converge with the same asymptotic rates. To the best knowledge of the authors, this is the first time that accelerated Douglas Rachfordtype methods for the solution of () are proposed for which the optimal rate of O(/k ) can be proven. Nevertheless, a lot of related research has already been carried out regarding the solution of () or (), the design and convergence analysis of Douglas Rachford-type methods as well as acceleration and preconditioning strategies. For the solution of (), first-order primal-dual methods are commonly used, such as the popular primal-dual iteration of [9] which performs explicit steps involving the operator K and its adjoint as well as proximal steps with respect to F and G. Likewise, a commonly-used approach for solving the primal problem in () is the alternating direction method of multipliers (ADMM) in which the arguments of F and G are treated independently and coupled by linear constraints involving the operator K, see [3]. The latter is often identified as a Douglas Rachford iteration on the dual problem in (), see [3, 4], however, as this iteration works, in its most general form, for maximally monotone operators (see [3]), it may also be applied directly on the optimality conditions for (), see [7], as it is also done in this paper. Besides this application, Douglas Rachford-type algorithms are applied for the minimization of sums of convex functionals (see [], for instance), potentially with additional structure (see [, ], for instance). Regarding acceleration strategies, the work [7] firstly introduced approaches to minimize a class of possibly non-smooth convex functional with optimal complexity O(/k ) on the functional error. Since then, several accelerated algorithms have been designed, most notably FISTA [] which is applicable for the primal problem in () when G is smooth, and again, the acceleration strategy presented in [9, 0], which is applicable when F is strongly convex and yields, as in this paper, optimal rates for restricted primal-dual gaps evaluated at ergodic sequences. Recently, an accelerated Douglas Rachford method for the minimization of the primal problem in () was proposed in [0], which, however, requires G to be quadratic and involves step-size constraints. In contrast, the acceleration strategies for the solution of () in the present paper only require strong convexity of the primal (or dual) functional, which are the same assumptions as in [9]. We also refer to [6, 9] for accelerated ADMM-type algorithms as well as to [8, ] for a further overview on acceleration strategies. Finally, preconditioning techniques for the solution of () or () can, for instance, be found in [] and base on modifying the Hilbert-space norms leading to different proximal mappings. In particular, for non-linear proximal mappings, this approach can be quite limiting with respect to the choice of preconditioners and often, only diagonal preconditioners are applicable in practice.

3 The situation is different for linear mappings where a variety of preconditioners is applicable. This circumstance has been exploited in [4, 5] where Douglas Rachford methods for the solution () have been introduced that realize one or finitely iteration steps of classical linear operator splitting methods, as it is also done in the present work. The latter allows in particular for the inexact solution of the implicit linear step without losing convergence properties or introducing any kind of error control. Besides these approaches, the Douglas Rachford iteration and related methods may be preconditioned by so-called metric selection, see [5]. This paper focuses, however, on obtaining linear convergence only and needs restrictive strong convexity and smoothness assumptions on the involved functionals. In contrast, the preconditioning approaches in the present work are applicable for the proposed Douglas Rachford methods for the solution of () and, in particular, for the accelerated variants without any additional assumptions. The outline of the paper is as follows. First, in Section, the Douglas Rachford-type algorithm is derived from the optimality conditions for () which serves as a basis for acceleration. We prove weak convergence to saddle-points, an asymptotic ergodic rate of O(/k) for restricted primal-dual gaps and shortly discuss preconditioning. This iteration scheme is then modified, in Section 3, to yield accelerated iterations in case of strong convexity of the involved functionals. In particular, we obtain the ergodic rates O(/k ) and O(ϑ k ), respectively, in case that one or both functionals are strongly convex. For these accelerated schemes, preconditioning is again discussed. Numerical experiments and comparisons are then shown in Section 4 for the accelerated schemes and variational image denoising with quadratic discrepancy and total variation (TV) as well as Huber-regularized total variation. We conclude with a summary of the convergence results and an outlook in Section 5. The basic Douglas Rachford iteration To set up for the accelerated Douglas Rachford methods, let us first discuss the basic solution strategy for problem (). We proceed by writing the iteration scheme as a proximal point algorithm associated with as degenerate metric, similar to [4, 5]. Here, however, the reformulation differs in the way how the linear and non-linear terms are treated and by the fact that step-size operators are introduced. The latter allow, in particular, for different primal and dual step-sizes, a feature that is crucial for acceleration, as it will turn out in Section 3. The proposed Douglas Rachford iteration is derived as follows. First observe that in terms of subdifferentials, saddle-point pairs (x, y) X Y can be characterized by the inclusion relation ( ) 0 0 ( 0 K K 0 ) ( ) ( ) ( ) x F 0 x +. (3) y 0 G y This may, in turn, be written as the finding the root of the sum of two monotone operators: Denoting, Z = X Y, z = (x, y), we find the equivalent representation with the data ( 0 K Az = K 0 ) ( x y 0 Az + Bz ), Bz = ( F 0 0 G ) ( ) x. (4) y Both A and B are maximally monotone operators on Z. Since A has full domain, also A + B is maximally monotone [6]. Among the many possibilities for solving the root-finding problem 0 Az + Bz, the Douglas Rachford method is fully implicit, i.e., a reformulation in terms of the proximal point iteration suggests itself. This can indeed be achieved with a degenerate metric on Z Z. Introduce a step-size operator Σ : Z Z which is assumed to be a continuous, symmetric, positive operator. 3

4 Then, setting Σ z Az, we see that { 0 Bz + Σ z, 0 Az + Bz Σ z Az { 0 Bz + Σ z, 0 Σ z + Σ A Σ z. Hence, one arrives at the equivalent problem ( ) ( ) z B Σ 0 A, A = z Σ Σ A Σ. (5) We employ the proximal point algorithm with respect to the degenerate metric ( Σ Σ M = ), Σ Σ which corresponds to ( z 0 M k+ z k ) z k+ z k + A ( z k+ z k+ ), (6) or, equivalently, { z k+ = (id +ΣB) (z k z k ), z k+ = (id +A Σ ) (z k+ z k + z k ). Using Moreau s identity and substituting z k = z k z k, one indeed arrives at the Douglas Rachford iteration { z k+ = (id +ΣB) ( z k ), z k+ = z k + (id +ΣA) (z k+ z k ) z k+ (7). Plugging in the data (4) as well as setting, for σ, τ > 0, ( ) σ id 0 Σ =, (8) 0 τ id the iterative scheme (7) turns out to be equivalent to x k+ = (id +σ F) ( x k ), y k+ = (id +τ G) (ȳ k ), x k+ = x k+ σk (y k+ + ȳ k+ ȳ k ), ȳ k+ = y k+ + τk(x k+ + x k+ x k ). (9) The latter two equations are coupled but can easily be decoupled, resulting, e.g., with denoting d k+ = x k+ + x k+ x k, in an update scheme as shown in Table. With the knowledge of the resolvents (id +σ F), (id +τ G), (id +στk K), the iteration can be computed.. Convergence analysis Regarding convergence of the iteration (9), introduce the Lagrangian L : dom F dom G R associated with the saddle-point problem (): L(x, y) = Kx, y + F(x) G(y). If Fenchel Rockafellar duality holds, then we may consider restricted primal-dual gaps G X0 Y 0 : X Y ], ] associated with X 0 Y 0 dom F dom G given by G X0 Y 0 (x, y) = sup L(x, y ) L(x, y). (x,y ) X 0 Y 0 4

5 DR Objective: Solve min x dom F max y dom G Kx, y + F(x) G(y) Initialization: ( x 0, ȳ 0 ) X Y initial guess, σ > 0, τ > 0 step sizes Iteration: x k+ = (id +σ F) ( x k ) y k+ = (id +τ G) (ȳ k ) b k+ = (x k+ x k ) σk (y k+ ȳ k ) d k+ = (id +στk K) b k+ x k+ = x k x k+ + d k+ ȳ k+ = y k+ + τkd k+ Output: {(x k, y k )} primal-dual sequence (DR) Table : The basic Douglas Rachford iteration for the solution of convex-concave saddle-point problems of type (). Letting (x, y ) be a primal-dual solution pair of (), we furthermore consider the following restricted primal and dual error functionals E p Y 0 (x) = [ F(x) + sup y Y 0 Kx, y G(y ) ] [ F(x ) + sup y Y 0 Kx, y G(y ) ], E d X 0 (y) = [ G(y) + sup x X 0 K y, x F(x ) ] [ G(y ) + sup x X 0 K y, x F(x ) ]. The special case X 0 = dom F, Y 0 = dom G gives the well-known primal-dual gap as well as the primal and dual error, i.e., G(x, y) = F(x) + G (Kx) + G(y) + F ( K y), E p (x) = F(x) + G (Kx) [ F(x ) + G (Kx ) ], E d (y) = G(y) + F ( K y) [ G(y ) + F ( K y ) ], where F and G are again the Fenchel conjugates. For this reason, the study of the difference of Lagrangian is appropriate for convergence analysis. We will later find situations where G X0 Y 0, E p Y 0 and E d X 0 coincide, for the iterates of the proposed methods, with G, E p and E d, respectively, for X 0 and Y 0 not corresponding to the full domains of F and G. However, the restricted variants are already suitable for measuring optimality, as the following proposition shows. Proposition. Let X 0 Y 0 dom F dom G contain a saddle-point (x, y ) of (). Then, G X0 Y 0, E p Y 0 and E d X 0 are non-negative and we have G X0 Y 0 (x, y) = E p X 0 (x) + E d Y 0 (y), E p Y 0 (x) G {x } Y 0 (x, y), E d X 0 (y) G X0 {y }(x, y) for each (x, y) X Y. Proof. We first show the statements for G X0 Y 0. As (x, y ) is a saddle-point, L(x, y ) L(x, y) 0 for any (x, y) X Y, hence G X0 Y 0 (x, y) 0. By optimality of saddle-points, it follows that G (Kx ) = Kx, y G(y ) sup y Y 0 Kx, y G(y ) Kx, y G(y ), and analogously, F ( K y ) = sup x X 0 K y, x F(x ). By Fenchel Rockafellar duality, F(x ) + G (Kx ) + G(y ) + F ( K y ) = 0, hence E p Y 0 (x) + E d X 0 (y) = F(x) + sup y Y 0 Kx, y G(y ) + G(y) + sup x X 0 K y, x F(x ) = sup (x,y ) X 0 Y 0 L(x, y ) L(x, y) = G X0 Y 0 (x, y). 5

6 Next, observe that optimality of (x, y ) and the subgradient inequality implies E p Y 0 (x) F(x) + Kx, y G(y ) F(x ) Kx, y + G(y ) = F(x) F(x ) K y, x x 0. Analogously, one sees that E d X 0 0. Using the non-negativity and the already-proven identity for G X0 Y 0 yields E p Y 0 (x) E p Y 0 (x) + E d {x } (y) = G {x } Y 0 (x, y) and the analogous estimate E d X 0 (x) G X0 {y }(x, y). We are particularly interested in the restricted error functionals associated with bounded sets containing a saddle-point. These can, however, coincide with the full functionals under mild assumptions. Lemma. Suppose a saddle point (x, y ) of L exists.. If F is strongly coercive, i.e., F(x)/ x if x, then for each bounded Y 0 dom G there is a bounded X 0 dom F such that E d = E d X 0 on Y 0.. If G is strongly coercive, then for each bounded X 0 dom F there is a bounded Y 0 dom G such that E p = E p Y 0 on X If both F and G are strongly coercive, then for bounded X 0 Y 0 dom F dom G there exist bounded X 0 Y 0 dom F dom G such that G = G X 0 Y 0 on X 0 Y 0. Proof. Clearly, it suffices to show the first statement as the second follows by analogy and the third by combining the first and the second. For this purpose, let Y 0 dom G be bounded, i.e., y C for each y Y 0 {y } and some C > 0 independent of y. Pick x 0 dom F and choose M x 0 such that F(x ) C K ( x + x 0 ) + F(x 0 ) for each x > M which is possible by strong coercivity of F. Then, for all y Y 0 {y } and x > M, K y, x F(x ) K x y C K x C K x 0 F(x 0 ) K y, x 0 F(x 0 ), hence the supremum F ( K y) = sup x dom F K y, x F(x ) is attained on the bounded set X 0 = { x M}. This shows E d = E d X 0 on Y 0. As the next step preparing the convergence analysis, let us denote the symmetric bilinear form associated with the positive semi-definite operator M: [ ] [ ] z z (z, z ), (z, z ) M = M, = Σ (z z z z ), z z. With the choice (8) and the substitution z = z z = ( x, ȳ) for (z, z) = (x, y, x, ỹ), this becomes, with a slight abuse of notation, (z, z ), (z, z ) M = z, z M = σ x, x + τ ȳ, ȳ. Naturally, the induced squared norm will be denoted by (z, z) M = z M = σ x + τ ȳ. Lemma. For each k N and z = (x, y) dom F dom G, iteration (9) satisfies L(x k+, y) L(x, y k+ ) zk (id ΣA)z M zk+ (id ΣA)z M zk z k+ M = ( x k x + σk y xk+ x + σk y xk x k+ ) σ + ( ȳ k y τkx ȳk+ y τkx ȳk ȳ k+ ). τ (0) 6

7 Proof. The iteration (9) tells that σ ( xk x k+ ) F(x k+ ) and τ (ȳk y k+ ) G(y k+ ), meaning, in particular, that F(x k+ ) F(x) σ xk x k+, x k+ x, G(y k+ ) G(y) τ ȳk y k+, y k+ y. With the identity σ xk x k+, x k+ x = σ xk x k+, x k+ x + σ xk+ x k+, x k+ + x k+ x k x and the update rule for x k+ in (9) in the difference x k+ x k+, we get σ xk x k+, x k+ x Kx, y k+ = σ xk x k+, x k+ x K(x k+ + x k+ x k x), y k+ + ȳ k+ ȳ k + Kx, y k+ = σ xk x k+, x k+ x Kx, ȳ k ȳ k+ K(x k+ + x k+ x k ), y k+ + ȳ k+ ȳ k. Analogously, we may arrive at τ ȳk y k+, y k+ y + Kx k+, y = τ ȳk ȳ k+, ȳ k+ y + x k x k+, K y + K(x k+ + x k+ x k ), y k+ + ȳ k+ ȳ k. Consequently, the estimate L(x k+, y) L(x, y k+ ) σ xk x k+, x k+ x Kx, y k+ + τ ȳk y k+, y k+ y + Kx k+, y = σ xk x k+, x k+ x + σk y + τ ȳk ȳ k+, ȳ k+ y τkx = z k z k+, z k+ (id ΣA)z M, () remembering the definitions (4) and (8). Employing the Hilbert space identity u, v = u + v u v leads us to (0). Lemma 3. Iteration (9) possesses the following properties:. A point (x, y, x, ȳ ) is a fixed point if and only if (x, y ) is a saddle point for () and x σk y = x, y + τkx = ȳ.. If w-lim i ( x k i, ȳ k i ) = ( x, ȳ ) and lim i ( x k i x k i+, ȳ k i ȳ k i+ ) = (0, 0), then (x k i+, y k i+, x k i+, ȳ k i+ ) converges weakly to a fixed point. Proof. Regarding the first point, observe that as (9) is equivalent to (6) with the choice (8) and z k = z k z k for all k, fixed-points (z, z ) are exactly the solutions to (5) of the form (z, z ), z = z z. With A and B according to (4), fixed-points are equivalent to solutions of (5) which obey 0 Az + Bz and z ΣAz = z which means that z = (x, y ) is a solution of () and z = ( x, ȳ ) satisfies x = x σk y and ȳ = y + τkx. Next, suppose that some subsequence ( x k i, ȳ k i ) converges weakly to ( x, ȳ ) and lim i ( x ki+ x k i, ȳ ki+ ȳ k i ) = (0, 0) as i. The iteration (9) then gives [ ] [ x k i + id σk = τk id y k i+ ] [ x k i + + σk (ȳ k i+ ȳ k i ) ȳ k i+ τk( x k i+ x k i ) ], 7

8 so the sequence (x ki+, y ki+ ) converges weakly to (x, y ) with x σky = x and y +τkx = ȳ by weak sequential continuity of continuous linear operators. Now, iteration (9) can also be written as σ ( xk i x ki+ ) K (ȳ ki+ ȳ k i ) K y ki+ + F(x ki+ ) τ (ȳk i ȳ ki+ ) + K( x ki+ x k i ) Kx ki+ + G(y ki+ ) where the left-hand side converges strongly to (0, 0). As (x, y) ( K y + F(x), Kx + G(y) ) is maximally monotone, it is in particular weak-strong closed. Consequently, (0, 0) ( K y + F(x ), Kx + G(y ) ), so (x, y ) is a saddle-point of (). In particular, x σky = x and y + τkx = ȳ, hence, the weak limit of {(x ki+, y ki+, x ki+, ȳ ki+ )} is a fixed-point of the iteration. Proposition. If () possesses a solution, then iteration (DR) converges weakly to a (x, y, x, ȳ ) for which (x, y ) is a solution of () and ( x, ȳ ) = (x σk y, y + τkx ). Proof. Let (x, y ) be a solution of (). Then, 0 L(x k+, y ) L(x, y k+ ) for each k, so by (0) and recursion we have for k 0 k that x k x + σk y + ȳk y τkx [ k + xk x k + + ȳk ȳ k + ] σ τ σ τ k =k 0 xk 0 x + σk y σ + ȳk 0 y τkx. () τ With x = x σk y and y = y + τkx and d k = σ xk x + τ ȳk y, this implies that {d k } is a non-increasing sequence with limit d. Furthermore, x k x k+ 0 as well as ȳ k ȳ k+ 0 as k. Now, () says that {( x k, ȳ k )} is bounded whenever () has a solution, hence there exists a weak accumulation point ( x, ȳ ). Setting (x, y ) Z as the unique solution to the equation { x σk y = x, y + τkx = ȳ (3), gives, by virtue of Lemma 3, that (x, y, x, ȳ ) is a fixed-point of the iteration with (x, y ) being a solution of () and an accumulation point of {(x k, y k )}. Suppose that ( x, ȳ ) is another weak accumulation point of the same sequence and choose the corresponding (x, y ) according to (3). Then, σ xk, x x + τ ȳk, ȳ ȳ = xk x + σk y + ȳk y τkx σ τ xk x + σk y ȳk y τkx x σ τ σ ȳ τ + x σ + ȳ. τ As both (x, y ) and (x, y ) are solutions to (), the right-hand side converges as a consequence of (). Plugging in the subsequences weakly converging to ( x, ȳ ) and ( x, ȳ ), respectively, on the left-hand side, we see that both limits must coincide: σ x, x x + τ ȳ, ȳ ȳ = σ x, x x + τ ȳ, ȳ ȳ, hence x = x and ȳ = ȳ, implying that the whole sequence {( x k, ȳ k )} converges weakly to ( x, ȳ ). Inspecting the iteration (9) we see that w-lim k (x k σk y k, y k +τkx k ) = ( x, ȳ ) and as solving for (x k, y k ) constitutes a continuous linear operator, also w-lim k (x k, y k ) = (x, y ), what remained to show. 8

9 In particular, restricted primal-dual gaps vanish for the iteration and can thus be used as a stopping criterion. Corollary. In the situation of Proposition, for X 0 Y 0 dom F dom G bounded and containing a saddle-point, the associated restricted primal-dual gap obeys G X0 Y 0 (x k, y k ) 0 for each k, lim k G X 0 Y 0 (x k, y k ) = 0. Proof. The convergence follows from applying the Cauchy Schwarz inequality to (), taking the supremum and noting that z k z k+ M 0 as k as shown in Proposition. This convergence comes, however, without a rate, making it necessary to switch to ergodic sequences. Here, the usual O(/k) convergence speed follows almost immediately. Theorem. Let X 0 dom F, Y 0 dom G be bounded and such that X 0 Y 0 contains a saddle-point, i.e., a solution of (). Then, in addition to w-lim k (x k, y k ) = (x, y ) the ergodic sequences x k erg = k x k, yerg k = k y k (4) k k k = converge weakly to x and y, respectively, and the restricted primal-dual gap obeys G X0 Y 0 (x k erg, y k erg) k [ x 0 x + σk y sup (x,y) X 0 Y 0 σ k = + ȳ0 y τkx ] = O(/k). (5) τ Proof. First of all, the assumptions state the existence of a saddle-point, so by Proposition, {(x k, y k )} converges weakly to a solution (x, y ) of (). To obtain the weak convergence of {x k erg}, test with an x and utilize the Stolz Cesàro theorem to obtain k lim k xk k erg, x = lim = xk, x k k k = = lim k xk, x = x, x. The property w-lim k y k erg = y can be proven analogously. Finally, in order to show the estimate on G X0 Y 0 (x k erg, y k erg), observe that for each (x, y) X 0 Y 0, the function (x, y ) L(x, y) L(x, y ) is convex. Thus, using (0) gives L(x k erg, y) L(x, y k erg) k k k k =0 k k =0 L(x k +, y) L(x, y k + ) ( x k x + σk y σ xk + x + σk y ) + ( ȳ k y τkx ȳk + y τkx ) τ ( x 0 x + σk y + ȳ0 y τkx ). k σ τ Taking the supremum over all (x, y) X 0 Y 0 gives G X0 Y 0 (x k erg, y k erg) on the left-hand side and its estimate from above in (5). 9

10 pdr Objective: Solve min x dom F max y dom G Kx, y + F(x) G(y) Initialization: Iteration: Output: ( x 0, ȳ 0, d 0 ) X Y X initial guess, σ > 0, τ > 0 step sizes, M feasible preconditioner for T = id +στk K x k+ = (id +σ F) ( x k ) y k+ = (id +τ G) (ȳ k ) b k+ = (x k+ x k ) σk (y k+ ȳ k ) d k+ = d k + M (b k+ T d k ) x k+ = x k x k+ + d k+ ȳ k+ = y k+ + τkd k+ {(x k, y k )} primal-dual sequence (pdr) Table : The preconditioned Douglas Rachford iteration for the solution of convex-concave saddle-point problems of type ().. Preconditioning The iteration (DR) can now be preconditioned as follows. We amend the dual variable y by a preconditioning variable x p X and the linear operator K by H : X X linear, continuous and consider the saddle-point problem min x dom F max Kx, y + Hx, x p + F(x) G(y) I {0} (x p ) (6) y dom G, x p=0 whose solutions have exactly the form (x, y, 0) with (x, y ) being a saddle-point of (). Plugging this into the Douglas Rachford iteration for (6) yields by construction that x k+ p = 0 as well as x k+ p = τhd k+ for each k, so no new quantities have to be introduced into the iteration. Introducing the notation T = id +στk K and M = id +στ(k K + H H), we see that the step d k+ = T b k+ in (DR) is just replaced by d k+ = d k + M (b k+ T d k ), which can be interpreted as one iteration step of an operator splitting method for the solution of T d k+ = b k+ with respect to the splitting T = M (M T ). The full iteration is shown in Table (the term feasible preconditioner is defined after the next paragraph). Given T = id +στk K, one usually aims at choosing M such that it corresponds to wellknown solution techniques such as the Gauss Seidel or successive over-relaxation procedure in the finite-dimensional case, for instance. For this purpose, one has to ensure that a H : X X linear and continuous can be found for given M. The following definition from [5] gives a necessary and sufficient criterion. Definition. Let M, T : X X linear, self-adjoint, continuous and positive semi-definite. Then, M is called a feasible preconditioner for T if M is boundedly invertible and M T is positive semi-definite. Indeed, given H : X X linear and continuous, it is clear from the definition of M and T that M is a feasible preconditioner. The other way around, if M is a feasible preconditioner for T, then one can take the square root of the operator M T and consequently, there is linear, continuous and self-adjoint H : X X such that H = στ (M T ) which implies M = id +στ(k K + H H). Summarized, the preconditioned Douglas Rachford iterations based on (6) exactly corresponds to (pdr) with M being a feasible preconditioner. Its convergence then follows from Proposition and Theorem, with weak convergence and asymptotic rate being maintained. 0

11 Preconditioner T λ id (λ + )D M SGS M SSOR Conditions λ T λ λ max (T D) ω (0, ) Douglas Symmetric Symmetric Iteration type Richardson Damped Jacobi Rachford Gauss Seidel SOR D = diag(t ), T = D E E, E lower triangular, M SGS = (D E)D (D E ), M SSOR = ( ω ω D E)( ω D) ( ω D E ) Table 3: Summary of common preconditioners and possible conditions for feasibility. Theorem. Let X 0 dom F, Y 0 dom G be bounded and such that X 0 Y 0 contains a saddlepoint, i.e., a solution of (). Further, let M be a feasible preconditioner for T = id +στk K. Then, for the iteration generated by (pdr), it holds that w-lim k (x k, y k ) = (x, y ) with (x, y ) being a saddle-point of (). Additionally, the ergodic sequence {(x k erg, y k erg)} given by (4) converges weakly to (x, y ) and the restricted primal-dual gap obeys G X0 Y 0 (x k erg, y k erg) k where x M T = (M T )x, x. [ x 0 x + σk y sup + d0 x M T (x,y) X 0 Y 0 σ σ + ȳ0 y τkx ] τ (7) Proof. The above considerations show that one can find a H : X X such that the iteration (pdr) corresponds to the Douglas Rachford iteration for the solution of (6). Proposition then already yields weak convergence and Theorem gives the estimate (5) on the restricted primal-dual gap associated with (6) (note that we apply it for X 0 (Y 0 {0})). Now, observe that as x p = 0 must hold, the latter coincides with the restricted primal-dual gap associated with (). Plugging in x 0 p = τhd 0 and M T = στh H into (5) finally yields (7). The condition that M should be feasible for T allows for a great flexibility in preconditioning. Basically, classical splitting methods such as symmetric Gauss Seidel or symmetric successive over-relaxation are feasible as a n-fold application thereof. We summarize the most important statements regarding feasibility of classical operator splittings in Table 3 and some general results in the following propositions and refer to [4, 5] for further details. Proposition 3. Let T : X X linear, continuous, symmetric and positive definite be given. Then, the following update schemes correspond to feasible preconditioners.. For M 0 : X X linear, continuous such that M 0 T is positive definite, { d k+/ = d k + M0 (bk+ T d k ), d k+ = d k+/ + M 0 (bk+ T d k+/ ).. For M : X X feasible for T and n, { d k+(i+)/n = d k+i/n + M (b k+ T d k+i/n ), i = 0,..., n. (8) (9) 3. For T = T + T, with T, T : X X linear, continuous, symmetric, T positive definite, T positive semi-definite and M : X X feasible for T, M T boundedly invertible, { d k+/ = d k + M ( (b k+ T d k ) T d k), d k+ = d k + M ( (b k+ T d k+/ ) T d k) (0).

12 The update scheme (8) is useful if one has a non-symmetric feasible preconditioner M 0 for T as then, the concatenation with the adjoint preconditioner will be symmetric and feasible. Likewise, (9) corresponds to the n-fold application of a feasible preconditioner which is again feasible. Finally, (0) is useful if T can be split into T + T for which T can be easily preconditioned by M. Then, one obtains a feasible preconditioner for T only using forward evaluation of T. 3 Acceleration strategies We now turn to the main results of the paper. As already mentioned in the introduction, in the case that the functionals F and G possess strong convexity properties, the iteration (DR) can be modified such that the asymptotic ergodic rates for the restricted primal-dual gap can further be improved. Definition. A convex functional F : X R admits the modulus of strong convexity γ 0 if for each x, x dom F and λ [0, ] we have F ( λx + ( λ) x ) λf(x) + ( λ)f( x) γ λ( λ) x x. Likewise, a concave functional F admits the modulus of strong concavity γ if F admits the modulus of strong convexity γ. Note that for x dom F, ξ F(x) and x dom F it follows that F(x) + ξ, x x + γ x x F( x). The basic idea for acceleration is now, in case the modulus of strong convexity is positive, to use the additional quadratic terms to obtain estimates of the type ρ ( L(x k+, y) L(x, y k+ ) ) + res(ˆx k, ŷ k, x k+, y k+, x ρ, y ρ ) ˆxk x ρ + σk y ρ + ŷk y ρ τkx ρ σ τ [ ˆx k+ x ρ + σ K y ρ ϑ σ + ŷk+ y ρ τ Kx ρ ] τ for a modified iteration based on (DR). Here, the convex combination (x ρ, y ρ ) = ( ρ)(x, y ) + ρ(x, y) of (x, y) dom F dom G and a saddle-point (x, y ) is taken for ρ [0, ], (ˆx k, ŷ k ) denotes an additional primal-dual pair that is updated during the iteration, res a non-negative residual functional that will become apparent later, 0 < ϑ < an acceleration factor and σ > 0, τ > 0 step-sizes that are possibly different from σ and τ. In order to achieve this, we consider a pair (ˆx k, ŷ k ) X Y and set { x k+ = (id +σ F) (ˆx k ), y k+ = (id +τ G) (ŷ k () ), as well as, with ϑ p, ϑ d > 0, { x k+ = σk [y k+ + ϑ p (ȳ k+ ŷ k )], ỹ k+ = τk[x k+ + ϑ d ( x k+ ˆx k )], () where, throughout this section, we always denote x k+ = x k+ x k+ and ỹ k+ = y k+ ȳ k+. Observe that for ϑ p = ϑ d = and (ˆx k+, ŷ k+ ) = ( x k+, ȳ k+ ), the iteration (9) is recovered which corresponds to (DR). In the following, however, we will choose these parameters differently depending on the type of strong convexity of the F and G. To analyze the step () and (), let us first provide an estimate analogous to (0) in Lemma.

13 Lemma 4. Let ρ [0, ], denote by γ, γ 0 the moduli of strong convexity for F and G, respectively, and let (x, y ) dom F dom G be a saddle-point of L. For (ˆx k, ŷ k ) X Y as well as (x, y) dom F dom G, the equations () and () imply ρ ( L(x k+, y) L(x, y k+ ) ) + σ xk+ x ρ + σk y ρ + γ + ρ xk+ x ρ + σ ˆxk x k+ + ( ϑ p ) ŷ k ȳ k+, K[x k+ + x k+ ˆx k x ρ ] + τ ȳk+ y ρ τkx ρ + γ + ρ yk+ y ρ + τ ŷk ȳ k+ ( ϑ d ) y k+ + ȳ k+ ŷ k y ρ, K[ˆx k x k+ ] σ ˆxk x ρ + σk y ρ + τ ŷk y ρ τkx ρ (3) where x ρ = ( ρ)x + ρx and y ρ = ( ρ)y + ρy. Proof. First, optimality of (x, y ) according to (3), strong convexity and the subgradient inequality give γ xk+ x + γ yk+ y L(x k+, y ) L(x, y k+ ). Combined with strong concavity of x L(x, y k+ ) and y L(x k+, y) with moduli γ and γ, respectively, this yields for the convex combinations x ρ and y ρ that ( ρ) γ [ x k+ x + ρ x x ] + ( ρ) γ [ y k+ y + ρ y y ] + ρ ( L(x k+, y) L(x, y k+ ) ) L(x k+, y ρ ) L(x ρ, y k+ ). Employing convexity once more, we arrive at ρ γ + ρ xk+ x ρ ( ρ) γ [ x k+ x + ρ x x ] and the analogous statement for the dual variables. Putting things together yields ρ [ γ + ρ xk+ x ρ + γ yk+ y ρ ] + ρ ( L(x k+, y) L(x, y k+ ) ) L(x k+, y ρ ) L(x ρ, y k+ ). (4) The next steps aim at estimating the right-hand side of (4). Using again the strong convexity of F and G and the subgradient inequality for (), we obtain γ xk+ x ρ + F(x k+ ) F(x ρ ) as well as σ ˆxk x k+, x k+ x ρ = σ ˆxk x k+ + x k+, x k+ x k+ x ρ + σ ˆxk + x k+ x k+ + x ρ, x k+ = σ ˆxk x k+, x k+ x ρ σ xk+ + x k+ ˆx k x ρ, x k+ γ yk+ y ρ + G(y k+ ) G(y ρ ) τ ŷk ȳ k+, ȳ k+ y ρ τ yk+ + ȳ k+ ŷ k y ρ, ỹ k+. 3

14 Collecting the terms involving x k+ and ỹ k+, adding the primal-dual coupling terms Kx k+, y ρ Kx ρ, y k+ as well as employing () gives σ xk+ + x k+ ˆx k x ρ, x k+ τ yk+ + ȳ k+ ŷ k y ρ, ỹ k+ It then follows that + Kx k+, y ρ Kx ρ, y k+ = y k+ + ȳ k+ ŷ k y ρ, K[x k+ + x k+ ˆx k ] y k+ + ȳ k+ ŷ k, K[x k+ + x k+ ˆx k x ρ ] + ( ϑ p ) ȳ k+ ŷ k, K[x k+ + x k+ ˆx k x ρ ] ( ϑ d ) y k+ + ȳ k+ ŷ k y ρ, K[ x k+ ˆx k ] + Kx k+, y ρ Kx ρ, y k+ = ( ϑ p ) ȳ k+ ŷ k, K[x k+ + x k+ ˆx k x ρ ] ( ϑ d ) y k+ + ȳ k+ ŷ k y ρ, K[ x k+ ˆx k ] + y ρ, K[ˆx k x k+ ] ŷ k ȳ k+, Kx ρ. F(x k+ ) F(x ρ ) + G(y k+ ) G(y ρ ) + Kx k+, y ρ Kx ρ, y k+ σ ˆxk x k+, x k+ x ρ + σk y ρ + τ ŷk ȳ k+, ȳ k+ y ρ τkx ρ γ xk+ x ρ + ( ϑ p ) ȳ k+ ŷ k, K[x k+ + x k+ ˆx k x ρ ] γ yk+ y ρ ( ϑ d ) y k+ + ȳ k+ ŷ k y ρ, K[ x k+ ˆx k ]. (5) As before, employing the Hilbert-space identity u, v = u + v u v, plugging the result into (4) and rearranging finally yields (3). 3. Strong convexity of F Now, assume that γ > 0 while for γ, no restrictions are made, such that we set γ = 0. In this case, we can rewrite the quadratic terms involving x k+ and ȳ k+ in (3) as follows. Lemma 5. In the situation of Lemma 4 and for γ > 0 we have where σ xk+ x ρ +σk y ρ + γ xk+ x ρ + τ ȳk+ y ρ τkx ρ = [ ϑ σ ˆxk+ x ρ + σ K y ρ + τ ŷk+ y ρ τ Kx ρ ] + ϑ [ ϑ τ yk+ y ρ, τkx ρ + ỹ k+ σ xk+ x ρ, σk y ρ x ] k+ σγ τ τkx ρ + ỹ k+ (6) ϑ = + σγ, { σ = ϑσ, τ = ϑ τ, { ˆx k+ = x k+ ϑ x k+, ŷ k+ = y k+ ϑ ỹ k+. Proof. This is a result of straightforward computations which we present for completeness and (7) 4

15 reader s convenience. Computing σ xk+ x ρ + σk y ρ + γ xk+ x ρ = + σγ σ xk+ x ρ + + σγ x k+ x ρ, σ + + σγ σ + σγ [σk y ρ x k+ ] + σγ [σk y ρ x k+ ] σ ( + σγ ) x k+ x ρ, σk y ρ x k+ = + σγ + σγ x k+ x ρ + [σk y ρ x k+ ] σ + σγ σ ( + σγ ) x k+ x ρ, σk y ρ x k+ as well as τ ȳk+ y ρ τkx ρ = τ yk+ y ρ τ yk+ y ρ, + σγ[τkx ρ + ỹ k+ ] + τ + σγ[τkx ρ + ỹ k+ ] + τ ( + σγ ) y k+ y ρ, τkx ρ + ỹ k+ σγ τ τkx ρ + ỹ k+ + σγ = τ + σγ yk+ y ρ + σγ[τkx ρ + ỹ k+ ] + τ ( + σγ ) y k+ y ρ, τkx ρ + ỹ k+ σγ τ τkx ρ + ỹ k+ and plugging in the definitions (7) already gives (6). As the next step, we aim at combining the error terms on the right-hand side of (6) and left-hand side of (3) such that they become non-negative. One important point here is to choose ϑ p and ϑ d such that possibly negative terms involving y k+ y ρ cancel out. Lemma 6. In the situation of Lemma 5, with ϑ p = ϑ and ϑ d = ϑ, we have ( ϑ p ) ŷ k ȳ k+, K[x k+ + x k+ ˆx k x ρ ] ( ϑ d ) y k+ + ȳ k+ ŷ k y ρ, K[ˆx k x k+ ] + ϑ [ ϑ τ yk+ y ρ, τkx ρ + ỹ k+ σ xk+ x ρ, σk y ρ x ] k+ σγ τ τkx ρ + ỹ k+ c σ ˆxk x k+ c τ ŷk ȳ k+ whenever σ γτ K < c < holds for some c > 0. c( + σγ)σγτ K (c σ γτ K ) xk+ x ρ (8) Proof. We start with rewriting the scalar-product terms from the right-hand side of (6) by plugging in () as follows τ yk+ y ρ,τkx ρ + ỹ k+ σ xk+ x ρ, σk y ρ x k+ = y k+ y ρ, K[x ρ x k+ + x k+ ˆx k ] y k+ y ρ + ȳ k+ ŷ k, K[x ρ x k+ ] ( + ϑ d ) y k+ y ρ, K[ x k+ ˆx k ] + ( ϑ p ) ȳ k+ ŷ k, K[x ρ x k+ ] = ŷ k ȳ k+, K[x ρ x k+ + x k+ ˆx k ] y k+ y ρ + ȳ k+ ŷ k, K[ˆx k x k+ ] + ( + ϑ d ) y k+ y ρ, K[ˆx k x k+ ] + ( ϑ p ) ȳ k+ ŷ k, K[x ρ x k+ ] = ŷ k ȳ k+, K[ϑ p (x ρ x k+ ) + x k+ ˆx k ] ϑ d (y k+ y ρ ) + ȳ k+ ŷ k, K[ˆx k x k+ ]. 5

16 Incorporating the scalar-product terms from the left-hand side of (3), the choice of ϑ p and ϑ d yields ( ϑ p ) ŷ k ȳ k+, K[x k+ + x k+ ˆx k x ρ ] ( ϑ d ) y k+ + ȳ k+ ŷ k y ρ, K[ˆx k x k+ ] ϑ ϑ ϑ d(y k+ y ρ ) + ȳ k+ ŷ k, K[ˆx k x k+ ] + ϑ ϑ ŷk ȳ k+, K[ϑ p (x ρ x k+ ) + x k+ ˆx k ] = ŷ k ȳ k+, K[( ϑ p)(x k+ x ρ ) + (ϑ d ϑ p )( x k+ ˆx k )] = ( ϑ p) ŷ k ȳ k+, K[(x k+ x ρ ) + ϑ d ( x k+ ˆx k )]. Using Young s inequality, this allows to estimate, as c > 0, ( ϑ p) ŷ k ȳ k+, K[(x k+ x ρ ) + ϑ d ( x k+ ˆx k )] c τ ŷk ȳ k+ ( ϑ p) τ K[(x k+ x ρ ) + ϑ d ( x k+ ˆx k )]. c Taking the missing quadratic term into account, using that c < and once again the definitions of ϑ, ϑ p and ϑ d, gives, for ε > 0, the estimate ( ϑ p) τ K[(x k+ x ρ ) + ϑ d ( x k+ ˆx k )] + σγ c τ τkx ρ + ỹ k+ ( (ϑ p ) = + ϑ p ) τ K[(x k+ x ρ ) + ϑ d ( x k+ ˆx k )] c We would like to set which is equivalent to ϑ4 p ϑ p τ K[(x k+ x ρ ) + ϑ d ( x k+ ˆx k )] c ( + ε) ϑ p τ K x k+ ˆx k + c ( + ε ( + ε) ϑ p τ K = c c σ, ε = c σ γτ K σ γτ K ) ϑ 4 p ϑ p τ K x k+ x ρ. c and leading to ε > 0 since σ γτ K < c by assumption. Plugged into the last term in the above estimate, we get ( + ) ϑ 4 p ϑ p τ K x k+ x ρ = ε c Putting all estimates together then yields (8). = c ϑ p c σ γτ K c ϑ pτ K x k+ x ρ c( + σγ)σγτ K (c σ γτ K ) xk+ x ρ. In view of combining (3), (6) and (8), the factor in front of x k+ x ρ in (8) should not be too large. Choosing γ > 0 small enough, one can indeed control this quantity. Doing so, one arrives at the following result. Proposition 4. With the definitions in (7) and γ chosen such that 0 < γ < γ + ρ + ( + ρ + σγ )στ K (9) 6

17 there is a 0 < c < and a c > 0 such that the following estimate holds: ρ ( L(x k+, y) L(x, y k+ ) ) + [ ϑ σ ˆxk+ x ρ + σ K y ρ + τ ŷk+ y ρ τ Kx ρ ] [ + ( c) σ ˆxk x k+ + τ ŷk ȳ k+ ] + c xk+ x ρ Proof. Note that (9) implies that σ γτ K < and σ ˆxk x ρ + σk y ρ + τ ŷk y ρ τkx ρ. (30) ( + σγ)σγτ K ( σ γτ K ) < γ + ρ γ, so by continuity of c c( + σγ)σγτ K / ( (c σ γτ K ) ) in c =, one can find a c < with c > σ γτ K such that c = γ + ρ γ c( + σγ)σγτ K (c σ γτ K ) > 0. The prerequisites for Lemma 6 are satisfied, hence one can combine (3), (6) and (8) in order to get (30). Remark. From the proof it is also immediate that if (9) holds for a σ 0 > 0 instead of σ, the estimate (30) will still hold for all 0 < σ σ 0 and τ > 0 such that στ = σ 0 τ 0 with c and c independent from σ. The estimate (30) suggests to adapt the step-sizes (σ, τ) (σ, τ ) in each iteration step. This yields the accelerated Douglas Rachford iteration which obeys the following recursion: ϑ k = + σk γ, x k+ = (id +σ k F) (ˆx k ), y k+ = (id +τ k G) (ŷ k ), x k+ = σ k K [y k+ + ϑ k (ȳk+ ŷ k )], ỹ k+ = τ k K[x k+ + ϑ k ( x k+ ˆx k )], ˆx k+ = x k+ ϑ k x k+, ŷ k+ = y k+ ϑ k ỹk+, σ k+ = ϑ k σ k, τ k+ = ϑ k τ k. As before, the iteration can be written down explicitly and in a simplified manner, as, for instance, the product satisfies σ k τ k = σ 0 τ 0 and some auxiliary variables can be omitted. Such a version is summarized in Table 4 along with conditions we will need in the following convergence analysis. One sees in particular that (adr) requires only negligibly more computational and implementational effort than (DR). Remark. Of course, the iteration (adr) can easily be adapted to involve (id +σ 0 τ 0 KK ) instead of (id +σ 0 τ 0 K K), in case the former can be computed more efficiently, for instance. A straightforward application of Woodbury s formula, however, would introduce an additional evaluation of K and K, respectively, and hence, potentially higher computational effort. As in this situation, the issue cannot be resolved by simply interchanging primal and dual variable, we explicitly state the necessary modifications in order to maintain one evaluation of K and K in each iteration step: { b k+ = ( ( + ϑ k ˆx k+ = x k+ ϑ k σ k K d k+, )yk+ ϑ k (3) ŷk) + τ k K ( ( + ϑ k )xk+ ˆx k), d k+ = (id +σ 0 τ 0 KK ) b k+ ŷ k+ = ϑ k (ŷk y k+ ) + d k+. 7

18 adr Objective: Solve min x dom F max y dom G Kx, y + F(x) G(y) Prerequisites: F is strongly convex with modulus γ > 0 Initialization: Iteration: (ˆx 0, ŷ 0 ) X Y initial guess, σ 0 > 0, τ 0 > 0 initial step sizes, γ 0 < γ < +σ 0 τ 0 acceleration factor, ϑ K 0 = +σ0 γ x k+ = (id +σ k F) (ˆx k ) y k+ = (id +τ k G) (ŷ k ) b k+ = (( + ϑ k )x k+ ϑ k ˆx k ) σ k K (( + ϑ k )y k+ ŷ k ) d k+ = (id +σ 0 τ 0 K K) b k+ ˆx k+ = ϑ k (ˆx k x k+ ) + d k+ ŷ k+ = y k+ + ϑ k τ kkd k+ σ k+ = ϑ k σ k, τ k+ = ϑ k τ k, ϑ k+ = +σk+ γ (adr) Output: {(x k, y k )} primal-dual sequence Table 4: The accelerated Douglas Rachford iteration for the solution of convex-concave saddlepoint problems of type (). Now, as F is strongly convex, we might hope for improved convergence properties for {x k } compared to (DR). This is indeed the case. For the convergence analysis, we introduce the following quantities. λ k = k ϑ k k =0, ν k = [ k k =0 Lemma 7. The sequences {λ k } and {ν k } obey, for all k, + λ k ] (3) kσ 0 γ + σ0 γ + λ k + kσ [ 0γ, ν k k + (k )kσ ] 0γ ( = O(/k ). (33) + σ 0 γ + ) Proof. Observing that λ k+ = ϑ k λ k as well as σ k = σ 0 λ k λ k+ λ k = λ k ( + σk γ ) = we find that λ k and λ k + σk γ + σ kγ = σ 0 γ + σk γ +. for all k 0. The sequence {σ k } is monotonically decreasing and positive, so estimating the denominator accordingly gives the bounds on λ k. Summing up yields the estimate on ν k, in particular, for k, we have ν k 4( +σ 0 γ+) σ 0 γ, i.e., ν k k = O(/k ). Proposition 5. If () possesses a solution, then the iteration (adr) converges to a saddle-point (x, y ) of () in the following sense: lim k xk = x with x k x = O(/k ), w-lim k yk = y. In particular, each saddle-point (x, y ) of () satisfies x = x. Proof. Observe that since (adr) requires γ < γ /( + σ 0 τ 0 K ), and since σ k = σ 0 /λ k as well as Lemma 7 implies lim k σ k = 0, there is a k 0 0 such that (9) is satisfied for ρ = 0 and σ k 8

19 for all k k 0. Letting (x, y ) be a saddle-point, applying (30) recursively gives λ k [ σ k ˆx k x + σ k K y + τ k ŷ k y τ k Kx ] + k k =k 0 λ k [ c σ k ˆx k x k + + c ŷ k ȳ k + + c + τ k xk x ] λ k0 [ σ k0 ˆx k 0 x + σ k0 K y + τ k0 ŷ k 0 y τ k0 Kx ]. (34) This implies, on the one hand, the convergence of the series [ λ k ˆx k x k+ + ŷ k ȳ k+ + σ k τ k xk+ x ] <. k=0 As λ k = σ 0 /σ k = τ k /τ 0, we have in particular that lim k σ k (ˆx k x k+ ) = 0 and lim k (ŷ k ȳ k+ ) = 0. Furthermore, defining d k (x, y ) = λ k [ σ k ˆx k x + σ k K y + τ k ŷ k y τ k Kx ], the limit lim k d k (x, y ) = d (x, y ) has to exist: On the one hand, the sequence is bounded from below and admits a finite limes inferior. On the other hand, traversing with k 0 a subsequence that converges to the limes inferior, the estimate (34) yields that limes superior and limes inferior have to coincide, hence, the sequence is convergent. Denoting by ξ k = ˆx k x + σ k K y, ζ k = ŷ k y τ k Kx, we further conclude that { σ k ξ k } as well as {ζ k } are bounded. Plugging in the iteration (3) gives the identities { x k x σ k K (y k y ) = ξ k + ϑ k σ kk (ȳ k ŷ k ), y k y + τ k K(x k x ) = ζ k ϑ k τ k K( x k ˆx k (35) ), which can be solved with respect to x k x and y k y yielding { x k x = (id +σ 0 τ 0 K K) [ ξ k + σ k K ( ζ k + ϑ k (ȳk ŷ k ) ) σ 0 τ 0 ϑ k K K( x k ˆx k ) ], y k y = (id +σ 0 τ 0 KK ) [ ζ k τ k K ( ξ k + ϑ k ( x k ˆx k ) ) σ 0 τ 0 ϑ k KK (ȳ k ŷ k ) ]. Regarding the norm of x k x, the boundedness and convergence properties of the involved terms allow to conclude that x k x Cσ k = O(/k ) and, in particular, lim k x k = x as well as coincidence of the primal part for each saddle-point. Further, choosing a subsequence associated with the indices {k i } such that w-lim i σ ki ξ k i = ξ and w-lim i ζ k i = ζ (such a subsequence must exist), we see with τ k = σ 0 τ 0 /σ k that w-lim i σ ki (x k i x ) = (id +σ 0 τ 0 K K) (ξ + K ζ ), w-lim i (yk i y ) = (id +σ 0 τ 0 KK ) (ζ σ 0 τ 0 Kξ ). Weakening the statements, we conclude that w-lim i (x k i, y k i ) = (x, y ) for some y Y. Looking at the iteration (3) again yields σ ki τ ki (ˆx k i x k i ) ϑ k i K (ȳ k i ŷ k i ) K y k i + F(x k i ), (ŷ k i ȳ k i ) + ϑ ki K( x k i ˆx k i ) Kx k i + G(y k i ), 9

20 so by weak-strong closedness of maximally monotone operators, it follows that (0, 0) ( K y + F(x ), Kx + G(y ) ), i.e., (x, y ) is a saddle-point. As the subsequence was arbitrary, each weak accumulation point of {(x k, y k )} is a saddle-point. Suppose that (q, y ) and (q, y ) are both weak accumulation points of { ( σ k (x k x ), y k) }, ) = (q, y ) and w-lim i ( σ ki (x k i x ), y k i) = (q, y ) for some ( i.e., w-lim i σ ki (x k i x ), y k i {k i } and {k i }, respectively. Denote by (ξk, ζ k ) as above but with y replaced by y. Without loss of generality, we may assume that w-lim ( i σ ki ξ k i, ζ k i ) = (ξ, ζ ), w-lim ( i σ ξ k i k, ζ k i ) = (ξ, ζ ) i for some (ξ, ζ ), (ξ, ζ ) X Y. Dividing the first identity in (35) by σ k and passing both identities to the respective weak subsequential limits yields ξ ξ = q q K (y y ), ζ ζ = y y + σ 0 τ 0 K(q q ). Now, plugging in the definitions, we arrive at d k (x, y ) d k (x, y ) = σ 0 σ k ξ k + K (y y ), K (y y ) + τ 0 ζ k + y y, y y so that rearranging implies σ 0 σ k ξ k, K (y y ) + τ 0 ζ k, y y = d k (x, y ) d k (x, y ) + σ 0 K (y y ) + τ 0 y y. The right-hand side converges for the whole sequence, so passing the left-hand side to the respective subsequential weak limits and using the identities for ξ ξ and ζ ζ gives 0 = σ 0 ξ ξ, K (y y ) + τ 0 ζ ζ, y y = σ 0 K (y y ) + τ 0 y y, and hence, y = y. Consequently, y k y as k what was left to show. In addition to the convergence speed O(/k ) for x k x, restricted primal-dual gaps also converge and a restricted primal error of energy also admits a rate. Corollary. Let X 0 Y 0 dom F dom G be bounded and contain a saddle-point. Then, for the sequence generated by (adr), it holds that G X0 Y 0 (x k, y k ) 0 as k, E p Y 0 (x k ) = o(/k). Proof. From the estimate (5) in the proof of Lemma 4 with ρ = as well as the estimates ϑ k σ k γ and /ϑ k σ k γ we infer for (x, y) X 0 Y 0 that L(x k+, y) L(x, y k+ ) σ k σ k (ˆx k x k+ ) + σ k (ˆx k x k+ ) ˆx k x + σ k K y + σ k σ 0 τ 0 [ ŷ k ȳ k+ + ŷ k ȳ k+ ŷ k y τ k Kx ] + σ k γ K [ ȳ k+ ŷ k x k+ + x k+ ˆx k x + x k+ ˆx k y k+ + ȳ k+ ŷ k y ]. (36) 0

21 In the proof of Proposition 5 we have seen that σ k ˆx k x k+ 0 and ŷ k ȳ k+ 0 as k. Moreover, denoting by (x, y ) X 0 Y 0 a saddle-point of () and using the notation and results from the proof of Proposition 5, it follows that ˆx k x + σ k K y ξ k + sup x X 0 x x + σ 0 K sup y Y 0 y y C, σ k ŷ k y τ k Kx σ 0 ζ k + σ 0 sup y Y 0 y y + σ 0 τ 0 K sup x X 0 x x C for a suitable C > 0, so we see together with the boundedness of {(x k, y k )} that the right-hand side of (36) tends to zero as k, giving the estimate on G X0 Y 0. Regarding the rate on E p Y 0 (x k ), choosing X 0 = {x } and employing the boundedness of { σ k ξ k }, the refined estimates ˆx k x + σ k K y σ k [ C + K sup y Y 0 y y ], σ k ŷ k y τ k Kx σ k [ C + sup y Y 0 y y ] follow for some C > 0. Plugged into (36), one obtains with σ k = O(/k) the rate G {x } Y 0 (x k+, y k+ ) = sup y Y 0 Proposition then yields the desired estimate for E p Y 0. L(x k+, y) L(x, y k+ ) = o(/k). Now, switching again to ergodic sequences, the optimal rate O(/k ) can be obtained with the accelerated iteration. Theorem 3. For X 0 Y 0 dom F dom G bounded and containing a saddle-point (x, y ), the ergodic sequences according to k x k erg = ν k k =0 λ k x k +, k yerg k = ν k k =0 λ k y k +, where {(x k, y k )} is generated by (adr) and λ k, ν k are given by (3), converge to the same limit (x, y ) as {(x k, y k )}, in the sense that x k erg x = O(/k ) and w-lim k yerg k = y. The associated restricted primal-dual gap satisfies, for some k 0 0 and ρ > 0, the estimate G X0 Y 0 (x k erg, y k erg) ν k sup (x,y) X 0 Y 0 [ σ0 ρσ k 0 ˆx k 0 x ρ + σ k0 K y ρ + ρτ 0 ŷ k 0 y ρ τ k0 Kx ρ λ k ( L(x k +, y) L(x, y k + ) )] = O(/k ), k 0 + k =0 where again x ρ = ρx + ( ρ)x and y ρ = ρy + ( ρ)y. Proof. As each x k erg is a convex combination, the convergence x k x = O(/k) implies, together with the estimates of Lemma 7, that k x k erg x ν k k =0 k λ k x k + x Cν k k =0 λ k k + C ν k k = O(/k) for appropriate C, C > 0. The weak convergence of {yerg} k follows again by the Stolz Cesàro theorem: Indeed, for y Y, we have k lim k yk k erg, y = lim =0 λ k yk +, y λ k y k, y k k k =0 λ = lim = y, y. k k λ k (37)

22 For proving the estimate on the restricted primal-dual gap, first observe that as (adr) requires γ < γ /( + σ 0 τ 0 K ) we also have γ < γ /( + ρ + ( + ρ)σ 0 τ 0 K ) for some ρ > 0 and we can choose again k 0 such that Proposition 4 can be applied for k k 0. For (x, y) X 0 Y 0, convexity of (x, y ) L(x, y) L(x, y ) and estimate (30) yield ρ ( L(x k erg, y) L(x, yerg) k ) k ν k k 0 ν k k =0 k + ν k k 0 = ν k k =0 k =0 λ k ρ ( L(x k +, y) L(x, y k + ) ) k =k 0 λ k [ σ k [ λ k + λ k ρ ( L(x k +, y) L(x, y k + ) ) ˆx k x ρ + σ k K y ρ + ŷ k y ρ τ k Kx ρ ] τ k σ k + ˆx k + x ρ + σ k +K y ρ + ŷ k + y ρ τ k τ +Kx ρ ] k + λ k ρ ( L(x k +, y) L(x, y k + ) ) + ν k λ k0 [ σ k0 ˆx k 0 x ρ + σ k0 K y ρ + τ k0 ŷ k 0 y ρ τ k0 Kx ρ ] ν k λ k [ σ k ˆx k x ρ + σ k K y ρ + τ k ŷ k y ρ τ k Kx ρ ]. The estimate from above for the restricted primal-dual gap in (34) is thus valid. Moreover, the rate follows by Lemma 7 as ν k = O(/k ) and the fact that the supremum on the right-hand side of (37) is bounded. Indeed, the latter follows, from applying (5) for k = 0,..., k 0 as well as taking the boundedness of X 0 and Y 0 into account. Corollary 3. The full dual error obeys E d (y k erg) = O(/k ). If, moreover, G is strongly coercive, then additionally, the primal error satisfies E p (x k erg) = O(/k ). Proof. Employing Lemma on a bounded Y 0 dom G containing y k erg for all k and, subsequently, Proposition, it is sufficient for the estimate on E d that F is strongly coercive. This follows immediately from the strong convexity. The analogous argument can be used in order to obtain the rate for E p, however, we have to assume strong coercivity of G. In particular, we have an optimization scheme for the dual problem with the optimal rate as well as weak convergence, comparable to the modified FISTA method presented in [8]. Moreover, in many situations, even the primal problem obeys the optimal rate. Remark 3. If (9) is already satisfied for σ 0, τ 0 and ρ > 0, which is possible for fixed σ 0 τ 0 by choosing σ 0 and ρ > 0 small enough, then one may choose k 0 = 0 such that the estimate on the restricted primal-dual gap reads as G X0 Y 0 (x k erg, y k erg) ν k sup (x,y) X 0 Y 0 [ ˆx 0 x ρ + σ 0 K y ρ ρσ 0 + ŷ0 y ρ τ 0 Kx ρ ρτ 0 ], (38) which has the same structure as (5), the corresponding estimate for the basic Douglas Rachford iteration. However, for small σ 0 γ, see (33), the smallest constant C > 0 such that ν k C/k for all k might become large. This potentially leads to (38) being worse than (37) for a greater σ 0. Thus, choosing σ 0 sufficiently small such (38) holds does not offer a clear advantage.

23 3. Strong convexity of F and G Assume in the following that both F and G are strongly convex, i.e., γ > 0 and γ > 0. We obtain an analogous identity to (6), however, with different ϑ and without changing the step-sizes σ and τ. Lemma 8. Let, in the situation of Lemma 4, the values γ > 0 and γ > 0 satisfy σγ = τγ. Then, for σ xk+ x ρ + σk y ρ + γ xk+ x ρ + τ ȳk+ y ρ τkx ρ + γ yk+ y ρ = [ ϑ σ xk+ x ρ + σk y ρ + τ ȳk+ y ρ τkx ρ ] + ϑ [ ϑ τ yk+ y ρ, τkx ρ + ỹ k+ σ xk+ x ρ, σk y ρ x ] k+ γ σk y ρ x k+ γ τkx ρ + ỹ k+ (39) ϑ = + σγ = + τγ. Proof. This follows immediately from the identities σ xk+ x ρ + σk y ρ + γ xk+ x ρ as well as = + σγ σ xk+ x ρ + σk y ρ γ x k+ x ρ, σk y ρ x k+ γ σk y ρ x k+ τ ȳk+ y ρ τkx ρ + γ yk+ y ρ = + τγ ȳ k+ y ρ τkx ρ + γ y k+ y ρ, τkx ρ + ỹ k+ γ τ τkx ρ + ỹ k+. Proposition 6. In the situation of Lemma 8, with ϑ p = ϑ d = ϑ, ρ [0, ] and γ > 0, γ > 0 chosen such that γ γ + ρ + ( + ρ + σγ )στ K, γ = γγ (40) γ there is a 0 < c < and a c > 0 such that the following estimate holds: ρ ( L(x k+, y) L(x, y k+ ) ) + ϑ [ + ( c) [ σ xk+ x ρ + σk y ρ + τ ȳk+ y ρ τkx ρ ] σ ˆxk x k+ + τ ŷk ȳ k+ ] + c xk+ x ρ + c yk+ y ρ σ ˆxk x ρ + σk y ρ + τ ŷk y ρ τkx ρ. (4) Proof. Plugging in the definitions (), the scalar-product terms in (39) can be written as ϑ [ ϑ τ yk+ y ρ, τkx ρ + ỹ k+ σ xk+ x ρ, σk y ρ x ] k+ = ( ϑ) [ K[x k+ x ρ ], ȳ k+ ŷ k y k+ y ρ, K[ x k+ ˆx k ] ]. 3

24 Adding the scalar-product terms in (3), we arrive at ( ϑ) ŷ k ȳ k+, K[x k+ + x k+ ˆx k x ρ ] ( ϑ) y k+ + ȳ k+ ŷ k y ρ, K[ˆx k x k+ ] + ( ϑ) K[x k+ x ρ ], ȳ k+ ŷ k ( ϑ) y k+ y ρ, K[ x k+ ˆx k ] = ( ϑ) ŷ k ȳ k+, K[ x k+ ˆx k ] ( ϑ) ȳ k+ ŷ k, K[ˆx k x k+ ] = 0, so one only needs to estimate the squared-norm terms on the right-hand side of (39). Doing so, employing the restriction on γ yields σγστ K <, so choosing c according to allows to obtain as well as ( max σγστ K ), ( + σγ) < c < γ σk y ρ x k+ = σ γ K [y k+ y ρ + ϑ(ȳ k+ ŷ k )] c τ ȳk+ ŷ k c( + σγ) σ γ K + (c( + σγ) σ τγ K ) yk+ y ρ, γ τkx ρ + ỹ k+ c σ xk+ ˆx k c( + τγ ) τ γ K + (c( + τγ ) στ γ K ) xk+ x ρ. As before, we would like to have that the factors in front of x k+ x ρ and y k+ y ρ are strictly below γ +ρ γ and γ +ρ γ, respectively. Indeed, with the restriction on γ, γ = σ τ γ, γ = σ τ γ, strict monotonicity as well as c( + σγ) >, one sees that σ γ K σγστ K + ρ γ γ and c( + σγ) σ γ K (c( + σγ) σ τγ K ) γ + ρ γ c for a c > 0. Analogously, again with γ = σ τ γ, it also follows that τ γ K τγ στ K + ρ γ γ and c( + τγ ) τ γ K (c( + τγ ) στ γ K ) γ + ρ γ c for a c > 0. Choosing c = min(c, c ) and combining (3), (39) with the above estimates thus yields (4). Letting ˆx k+ = x k+ and ŷ k+ = ȳ k+, we arrive at the accelerated Douglas Rachford method for strongly convex-concave saddle-point problems x k+ = (id +σ F) ( x k ), y k+ = (id +τ G) (ȳ k ), x k+ = x k+ σk [y k+ + ϑ(ȳ k+ ȳ k (4) )], ȳ k+ = y k+ + τk[x k+ + ϑ( x k+ x k )], which, together with the restrictions σγ = τγ and (40), is also summarized in Table 5. As we will see in the following, the estimate (4) implies R-linear convergence of the sequences {(x k, y k )} and {( x k, ȳ k )}, i.e., an exponential rate. 4

25 adr sc Objective: Solve min x dom F max y dom G Kx, y + F(x) G(y) Prerequisites: F, G are strongly convex with respective moduli γ, γ > 0 Initialization: ( x 0, ȳ 0 ) X Y initial guess, σ, τ > 0 step sizes with σγ = τγ, γ γ +(+σγ acceleration factor, ϑ = )στ K +σγ Iteration: x k+ = (id +σ F) ( x k ) y k+ = (id +τ G) (ȳ k ) b k+ = ( +ϑ ϑ xk+ x k ) ϑσk ( +ϑ ϑ yk+ ȳ k ) d k+ = (id +ϑ στk K) b k+ x k+ = x k ϑ x k+ + d k+ ȳ k+ = y k+ + ϑτkd k+ (adr sc ) Output: {(x k, y k )} primal-dual sequence Table 5: The accelerated Douglas Rachford iteration for the solution of strongly convex-concave saddle-point problems of type (). Theorem 4. The iteration (adr sc ) converges R-linearly to the unique saddle-point (x, y ) in the following sense: lim k (xk, y k ) = (x, y ), x k x = o(ϑ k ), y k y = o(ϑ k ). Furthermore, denoting x = x σk y and ȳ = y + τkx, we have lim k ( xk, ȳ k ) = ( x, ȳ ), x k x = o(ϑ k ), ȳ k ȳ = o(ϑ k ). Proof. As F and G are strongly convex, a unique saddle-point (x, y ) must exist. Letting (x, y ) = (x, y ), the choice of γ in (adr sc ) leads to Proposition 6 being applicable for ρ = 0, such that (4) yields [ ϑ k σ xk x + σk y + + k ϑ k k =0 [ ( c) τ ȳk y τkx ] [ x k x k + σ + ȳk ȳ k + τ σ x0 x + σk y + τ ȳ0 y τkx ] + c xk + x + y k + y ] for each k. Consequently, the sequences {ϑ k x k x k+ }, {ϑ k ȳ k ȳ k+ } as well as {ϑ k x k x }, {ϑ k y k y } are summable. This implies the stated rates for {x k } and {y k } as well as x k+ x k = o(ϑ k ) and ȳ k+ ȳ k = o(ϑ k ). The convergence statements and rates for { x k } and {ȳ k } are then a consequence of (4). Corollary 4. The primal-dual gap obeys G(x k, y k ) = o(ϑ k/ ). Proof. As F and G are strongly convex, these functionals are also strongly coercive and Lemma yields that G(x k, y k ) = G X0 Y 0 (x k, y k ) for some bounded X 0 Y 0 dom F dom G containing the saddle-point (x, y ) and all k. For (x, y) X 0 Y 0, the estimate (5) gives, for ρ =, ] L(x k+, y) L(x, y k+ ) x k x [ k+ σ xk+ x + σk y + K (y k+ + ȳ k+ ȳ k y) ] + ȳ k ȳ [ k+ τ ȳk+ y τkx + K(x k+ + x k+ x k x), 5

26 so by boundedness, G X0 Y 0 (x k+, y k+ ) C( x k x k+ + ȳ k ȳ k+ ) = o(ϑ k/ ). Remark 4. Considering the ergodic sequences k x k erg = ν k k =0 ϑ k x k +, k yerg k = ν k k =0 ϑ k y k +, ν k = ( [ ϑ)ϑk k ϑ k = ϑ k ], k =0 gives the slightly worse rates x k erg x = O(ϑ k ) and yerg k y = O(ϑ k ). Moreover, if the restriction on γ in (adr sc γ ) is not sharp, i.e., γ < +(+σγ, then (40) holds for some )στ K ρ > 0 and the estimate (4) also implies ρ ( L(x k erg, y) L(x, y k erg) ) ν k [ σ x0 x ρ + σk y ρ + τ ȳ0 y ρ τkx ρ ] ν k ϑ k[ σ xk x ρ + σk y ρ + τ ȳk y ρ τkx ρ ]. This leads to the following estimate on the restricted primal-dual gap [ G X0 Y 0 (x k erg, yerg) k ν k sup X 0 Y 0 ρσ x0 x ρ + σk y ρ + ρτ ȳ0 y ρ τkx ρ ], and, choosing a bounded X 0 Y 0 such that G(x k erg, y k erg) = G X0 Y 0 (x k erg, y k erg) for all k, to the rate G(x k erg, y k erg) = O(ϑ k ). Thus, while having roughly the same qualitative convergence behavior, the ergodic sequences admit a better rate in the exponential decay. Remark 5. With the choice σγ = τγ, the best ϑ with respect to the restrictions (40) for fixed σ, τ > 0 is given by ϑ = + σγ, γ = γ + ( + σγ )στ K. For varying σ and τ, this choice of ϑ becomes minimal when σγ becomes maximal. Performing the maximization in the non-trivial case K = 0 leads to the problem of finding the positive root of the third-order polynomial σ γ γ K σ 4 γ γ K σ 3. A very rough approximation of this root can be obtained by dropping the cubic term, leading to the easily computable parameters γ σ = γ K, τ = γ γ K, γ = γ K K +, γ γ K = γ γ K + γ γ which, in turn, can be expected to give reasonable convergence properties. 3.3 Preconditioning The iterations (adr) and (adr sc ) can now be preconditioned with the same technique as for the basic iteration (see Subsection.), i.e., based on the amended saddle-point problem (6). Let us first discuss preconditioning of (adr): With F being strongly convex, the primal functional in (6) is strongly convex, so we may indeed employ the accelerated Douglas Rachford iteration to this problem and obtain, in particular, an ergodic convergence rate of O(/k ). Taking the explicit structure into account similar to Subsection., one arrives at the preconditioned accelerated Douglas Rachford method according to Table 6. In particular, the update step for d k+ again corresponds to an iteration step for the solution of T d k+ = b k+ with respect to the matrix splitting T = M + (T M). In order to obtain convergence, one needs that M is a feasible preconditioner for T = id +σ 0 τ 0 K K and has choose the parameter γ according to γ < γ /( + σ 0 τ 0 K K + H H ) = γ / M. 6

27 padr Objective: Solve min x dom F max y dom G Kx, y + F(x) G(y) Prerequisites: F is strongly convex with modulus γ > 0 Initialization: Iteration: (ˆx 0, ŷ 0 ) X Y initial guess, σ 0 > 0, τ 0 > 0 initial step sizes, M feasible preconditioner for T = id +σ 0 τ 0 K K, 0 < γ < γ M acceleration factor, ϑ 0 = +σ0 γ x k+ = (id +σ k F) (ˆx k ) y k+ = (id +τ k G) (ŷ k ) b k+ = (( + ϑ k )x k+ ϑ k ˆx k ) σ k K (( + ϑ k )y k+ ŷ k ) d k+ = d k + M (b k+ T d k ) ˆx k+ = ϑ k (ˆx k x k+ ) + d k+ ŷ k+ = y k+ + ϑ k τ kkd k+ σ k+ = ϑ k σ k, τ k+ = ϑ k τ k, ϑ k+ = +σk+ γ (padr) Output: {(x k, y k )} primal-dual sequence Table 6: The preconditioned accelerated Douglas Rachford iteration for the solution of convexconcave saddle-point problems of type (). The latter condition, however, requires an estimate on M which usually has to be obtained case by case. One observes, nevertheless, the following properties for the feasible preconditioner M n corresponding to the n-fold application of the feasible preconditioner M. In particular, it will turn out that applying a preconditioner multiple times only requires an estimate on M. Furthermore, the norms M n approach T as n. Proposition 7. Let M be a feasible preconditioner for T. For each n, denote by M n the n-fold application of M, i.e., such that d k+ = d k + Mn (b k+ T d k ) holds for d k+ in (9) of Proposition 3. Then, the norm of the M n is monotonically decreasing for increasing n, i.e., M n M n... M = M. Furthermore, if T is invertible, the spectral radius of id M T obeys ρ(id M T ) < and M n whenever the expression on the right-hand side is non-negative. T M ρ(id M T ) n (43) Proof. The proof is adapted from the proof of Proposition.5 of [5] from which we need the following two conclusions. One is that for any linear, self-adjoint and positive definite pair of operators B and S in the Hilbert space X, the following three statements are equivalent: B S 0, id B / SB / 0 and σ(id B S) [0, [ where σ denotes the spectrum. Here, B / is the square root of B, and B / is its inverse. Actually, we also have σ(id B S) = σ ( B(id B S)B ) = σ(id SB ) [0, [, while σ(id SB ) [0, [ also means S B. The other conclusion from Proposition.5 of [5] is the recursion relation for M n according to Mn+ = M n + (id M T ) n M = Mn + M / (id M / T M / ) n M / 7

28 with id M / T M / being a positive semidefinite operator. Actually, since M / is also self-adjoint, we have that id M / T M / is also self-adjoint, and it follows that M / (id M / T M / ) n M / 0 M n+ M n. By the above equivalence, this leads to M n+ M n, and M n+ M n from which the monotonicity behavior follows by induction. For the estimate on M n in case T is invertible, introduce an equivalent norm on X associated with the M-scalar product x, x M = Mx, x. Then, id Mn T is self-adjoint with respect to the M-scalar product: (id M n T )x, x M = M(id M T ) n x, x = (id T M ) n Mx, x = Mx, (id M T ) n x = x, (id Mn T )x M for x, x X. Let 0 q and suppose that x 0 is chosen such that (id M T )x M q x M. Then, ( q ) Mx, x = ( q ) x M T x, x + (id T / M T / )T / x, T / x T x, x, as id T / M T / is positive semi-definite due to σ(id T / M T / ) = σ(id M T ) [0, [. Consequently, since T is continuously invertible, T x ( q ) M x q T M. Hence, q must be bounded away from and, consequently, ρ(id M T ) = id M T M < as the contrary would result in a contradiction. ρ(id M T ) n as well as Now, id M n T M id M T n M = id M n Eventually, estimating T = M / (id Mn T )M / M M / Mρ(id M T ) n = M ρ(id M T ) n. M n T + M n id Mn T T + M n M ρ(id M T ) n gives, if M ρ(id M T ) n <, the desired estimate (43). Remark 6. Let us shortly discuss possible ways to estimate M for the classical preconditioners in Table 3. For the Richardson preconditioner, it is clear that M = λ with λ T being an estimate on T. For the damped Jacobi preconditioner, one has to choose λ λ max (T D) where λ max denotes the greatest eigenvalue, which results in M = (λ + ) D. The norm of the diagonal matrix D is easily obtainable as the maximum of the entries. 8

29 For symmetric SOR preconditioners, which include the symmetric Gauss Seidel case, we can express M T as M T = ω ( ω ω ω D + E) D ( ω ω D + E ) (44) for T = D E E, D diagonal and E lower triangular matrix of T, respectively, as well as ω ]0, [. Hence, M T can be estimated by estimating the squared norm of ω ω D/ + D / E. This can, for instance, be done as follows. Denoting by t ij the entries of T and performing some computational steps, we get for fixed i that ( ω ( ω ω D/ x + D / E x) i = ii x i ) t ij x j leading to ( ω ω ω t/ t / ii j>i [ ω ( t ii + t ) ij ω t ii j>i ] max t ij x j>i D/ + D / E ) ω ω D + diag(d E ) max t ij = K. j i where is the maximum row-sum norm and symmetry of T has been used. This results in M T + ω ω K. In particular, in case that T is weakly diagonally dominant, then D E such that for the Gauss Seidel preconditioner one obtains the easily computable estimate M T + max j i t ij. Let us next discuss preconditioning of (adr sc ). It is immediate that for strongly convex F and G, the acceleration strategy in (4) may be employed for the modified saddle-point problem (6). As before, this results in a preconditioned version of (adr sc ) with M = id +ϑ στ(k K + H H) being a feasible preconditioner for T = id +ϑ στk K. However, as one usually constructs M for given T, it will in this case depend on ϑ and, consequently, on the restriction on γ which reads in this context as γ γ + ( + σγ )ϑ ( M ). (45) As ϑ = +σγ, the condition on γ is implicit and it is not clear whether it will be satisfied for a given M. Nevertheless, if the latter is the case, preconditioning yields the iteration scheme (padr sc ) in Table 7 which inherits all convergence properties of the unpreconditioned iteration. So, in order to satisfy the conditions on M and γ, we introduce and discuss a notion that leads to sufficient conditions. Definition 3. Let (M ϑ ) ϑ, ϑ [0, ] be a family of feasible preconditioners for T ϑ = id +ϑ T where T 0. We call this family ϑ-norm-monotone with respect to T, if ϑ ϑ implies ϑ ( M ϑ ) (ϑ ) ( M ϑ ) for ϑ, ϑ [0, ]. Lemma 9. If (M ϑ ) ϑ is ϑ-norm-monotone with respect to στk K and γ 0 > 0 satisfies, for some ϑ [0, ], γ 0 < γ 0 + ( + σγ )(ϑ ) ( M ϑ ), ϑ, + σγ 0 then, γ = γ 0 satisfies (45) for M = M ϑ with ϑ = implication γ 0 γ γ + ( + σγ )ϑ ( M ϑ ) +σγ 0 and we have, for each γ > 0, the + σγ ϑ. 9

30 padr sc Objective: Solve min x dom F max y dom G Kx, y + F(x) G(y) Prerequisites: F, G are strongly convex with respective moduli γ, γ, > 0 Initialization: ( x 0, ȳ 0 ) X Y initial guess, σ, τ > 0 step sizes with σγ = τγ, γ acceleration factor, M feasible preconditioner for T = id +ϑ στk K γ and such that γ +(+σγ )ϑ ( M ) where ϑ = +σγ Iteration: x k+ = (id +σ F) ( x k ) Output: y k+ = (id +τ G) (ȳ k ) b k+ = ( +ϑ ϑ xk+ x k ) ϑσk ( +ϑ ϑ yk+ ȳ k ) d k+ = d k + M (b k+ T d k ) x k+ = x k ϑ x k+ + d k+ ȳ k+ = y k+ + ϑτkd k+ {(x k, y k )} primal-dual sequence (padr sc ) Table 7: The preconditioned accelerated Douglas Rachford iteration for the solution of strongly convex-concave saddle-point problems of type (). Proof. By the choice of γ 0 and ϑ-norm-monotonicity we immediately see that γ 0 γ / ( + ( + σγ )ϑ ( M ϑ ) ) for ϑ = ( + σγ 0 ), giving (45) as stated. The remaining conclusion is just a consequence of the monotonicity of γ ( + σγ). Remark 7. The result of Lemma 9 can be used as follows. Choose a family of feasible preconditioners (M ϑ ) ϑ for id +ϑ στk K which is ϑ-norm-monotone with respect to στk K. Starting with ϑ = and M, choosing γ 0 according to the lemma already establishes (45) for γ = γ 0. However, the corresponding M ϑ yields an improved bound for γ 0 and a greater γ 0 can be used instead without violating (45). This procedure may be iterated in order to obtain a good acceleration factor γ. Observe that the argumentation remains valid if the norm M ϑ is replaced by some estimate K ϑ M ϑ in Definition 3 and Lemma 9. Remark 8. Let us discuss feasible preconditioners for T ϑ = id +ϑ T, T 0 that are ϑ-normmonotone with respect to T. For the Richardson preconditioner, choose λ : [0, ] R for which ϑ ϑ λ(ϑ) is monotonically increasing and λ(ϑ) ϑ T for each ϑ [0, ]. Then, M ϑ = (λ(ϑ) + ) id defines a family of feasible preconditioners for T ϑ which is ϑ-norm-monotone as ϑ ( M ϑ ) = ϑ λ(ϑ) for each ϑ ]0, ]. For the damped Jacobi preconditioner, we see that for D ϑ the diagonal matrix of T ϑ we have D ϑ = + ϑ D where D is the diagonal matrix of T. Furthermore, T ϑ D ϑ = ϑ (T D ), so choosing λ : [0, ] R such that ϑ ϑ λ(ϑ) is monotonically increasing and λ(ϑ) ϑ T D yields feasible M ϑ = (λ(ϑ)+)d ϑ. This family is ϑ-norm-monotone with respect to T as ϑ ( M ϑ ) = ϑ λ(ϑ)+(λ(ϑ)+) D is monotonically increasing. For the symmetric Gauss Seidel preconditioner, we have M ϑ = (D ϑ E ϑ )D ϑ (D ϑ Eϑ ) = T ϑ +E ϑ D ϑ E ϑ where T ϑ = D ϑ E ϑ Eϑ, D ϑ is the diagonal and E ϑ lower diagonal matrix of T ϑ, respectively. Obviously, D ϑ and E ϑ admit the form D ϑ = id +ϑ D and E ϑ = ϑ E for respective D and E. Observe that ϑ ϑ /( + ϑ d) is monotonically increasing for each d 0. Consequently, it follows from ϑ ϑ that ϑ D ϑ (ϑ ) D ϑ and, further that ϑ E ϑ D ϑ E ϑ = E ϑ D ϑ (E ) E (ϑ ) D ϑ (E ) = (ϑ ) E ϑ D ϑ E ϑ. 30

31 As ϑ (T ϑ id) is independent from ϑ, we get ϑ (M ϑ id) (ϑ ) (M ϑ id). In particular, ϑ ( M ϑ ) (ϑ ) ( M ϑ ) which shows that the symmetric Gauss Seidel preconditioners (M ϑ ) ϑ constitute a ϑ-norm-monotone family. The SSOR methods are in general not ϑ-norm-monotone, except for the symmetric Gauss Seidel case ω =, as discussed above. 4 Numerical experiments In order to assess the performance of the discussed Douglas Rachford algorithms, we performed numerical experiments for two specific convex optimization problems in imaging: image denoising with L -discrepancy and TV regularization as well as TV-Huber approximation, respectively. 4. L -TV denoising We tested the basic and the accelerated Douglas Rachford iteration for a discrete L -TV-denoising problem (also called ROF model) according to min u X u f + α u, (46) with X the space of functions on a D regular rectangular grid and : X Y a forward finite-difference gradient operator with homogeneous Neumann conditions. The minimization problem may be written as the saddle-point problem () with primal variable x = u X, dual variable y = p Y, F(u) = u f /, G(p) = I { p α}(p) and K =. The associated dual problem then reads as min p Y div p + f f + I { p α}(p). (47) such that the primal-dual gap, which is used as the stopping criterion, becomes G L - TV(u, p) = u f + α u + div p + f f + I { p α}(p). (48) As F is strongly convex, the accelerated methods (adr) and (padr) proposed in this paper are applicable for solving L -TV denoising problems, in addition to the basic methods (DR) and (pdr). The computational building blocks for the respective implementations are well-known, see [4, 5, 9], for instance. We compare with first-order methods which also achieve O(/k ) convergence rate in some sense, namely ALG from [9], FISTA [] and the fast Douglas Rachford iteration in [0]. The parameter settings for the O(/k )-methods are as follows: ALG: Accelerated primal-dual algorithm introduced in [9] with parameters τ 0 = /L, τ k σ k L =, γ = 0.35 and L = 8. FISTA: Fast iterative shrinkage thresholding algorithm [] on the dual problem (47). The Lipschitz constant estimate L as in [] here is chosen as L = 8. FDR: Fast Douglas Rachford splitting method proposed in [0] on the dual problem (47) with parameters L f = 8, γ =, λ k = λ = γl f +γl f, β 0 = 0, and β k = k k+ for k > 0. L f adr: Accelerated Douglas Rachford method as in Table 4. Here, the discrete cosine transform (DCT) is used for solving the discrete elliptic equation with operator id σ 0 τ 0. The parameters read as L = 8, σ 0 =, τ 0 = 5/σ 0, γ = /( + σ 0 τ 0 L ). 3

32 (a) Original image (b) Noise level: σ = 0. (c) α = 0. (d) α = 0.5 Figure : Results for variational L -TV denoising. All denoised images are obtained with the padr algorithm which was stopped once the primal-dual gap normalized by the number of pixels became less than 0 7. (a) is the original image: Taj Mahal ( pixels, gray). (b) shows a noise-perturbed version of (a) (additive Gaussian noise, standard deviation 0.), (c) and (d) are the denoised images with α = 0. and α = 0.5, respectively. padr: Preconditioned accelerated Douglas Rachford method as in Table 6 with two steps of the symmetric Red-Black Gauss Seidel iteration as preconditioner and parameters σ0 =, τ0 = 5/σ0, γ =, L = 8. The norm km k is estimated as km k < kt kest + km T kest with kt kest = + σ0 τ0 L and km T kest = 4(σ0 τ0 ) /( + 4σ0 τ0 ). The parameter γ is set as γ = γ /(kt kest + km T kest ). We also tested the O(/k) basic Douglas Rachford algorithms using the following parameters. DR: Douglas-Rachford method as in Table. Again, the discrete cosine transform (DCT) is used for applying the inverse of the elliptic operator id στ. Here, σ =, τ = 5/σ. pdr: Preconditioned Douglas-Rachford method as in Table. Again, the preconditioner is given by two steps of the symmetric Red-Black Gauss Seidel iteration. The step sizes are chosen as σ =, τ = 5/σ. Computations were performed for the image Taj Mahal (size pixels), additive Gaussian noise (noise level 0.) and different regularization parameters α in (46) using an Intel Xeon CPU (E5-690v3,.6 GHz, cores). Figure includes the original image, the noisy image and denoised images with different regularization parameters. It can be seen from Table 8 as well as Figure that the proposed algorithms pdr and padr are competitive both in terms of iteration number and computation time, especially padr. While in terms of iteration numbers, DR and pdr as well as adr and padr perform roughly equally well, the preconditioned variants massively benefit from the preconditioner in terms of computation time with each iteration taking only a fraction compared to the respective unpreconditioned variants. 3

33 α = 0. ε= ALG FISTA FDR adr padr DR pdr 0 5 α = 0.5 ε= 0 7 ε= 0 5 ε = (.5s) (.80s) 730 (74.39s) 68 (6.00s) 75 (.30s) 80 (8.77s) 659 (0.75s) 497 (494.95s) 357 (3.s) 369 (6.5s) 368 (3.99s) 8 (0.4s) 80 (.76s) 04 (9.4s) 8 (.5s) 74 (3.7s) 4534 (56.7s) 879 (45.9s) 757 (66.s) 83 (3.70s) 57 (5.4s) 65 (.07s) 596 (54.45s) 638 (0.06s) 33 (.50s) 4 (.6s) 3067 (88.7s) 307 (47.87s) Table 8: Numerical results for the L -TV image denoising (ROF) problem (46) with noise level 0. and regularization parameters α = 0., α = 0.5. For all algorithms, we use the pair k(t) to denote the iteration number k and CPU time cost t. The iteration is performed until the primal-dual gap (48) normalized by the number of pixels is below ε ALG FISTA FDR adr padr Normalized Primal-Dual Gap 0 0 O(/k ) 3 0 ALG FISTA FDR adr padr 0 Normalized Primal-Dual Gap Iterations CPU Time (s) (a) Convergence with respect to iteration number. (b) Convergence with respect to computation time. Figure : Numerical convergence rate compared with iteration number and iteration time for the ROF model with regularization parameter α = 0.5. The plot in (a) utilizes a double-logarithmic scale while the plot in (b) is semi-logarithmic. 4. L -Huber-TV denoising There are several ways to approximate the non-smooth -norm term in (46) by a smooth functional. One possibility is employing the so-called Huber loss function instead of the Euclidean vector norm, resulting in ( pi X if pi αλ, λ kpkα,λ = pi α,λ, pi α,λ = λα α pi if pi > αλ, i pixels and the associated L -Huber-TV minimization problem min u X ku f k + k ukα,λ. (49) An appropriate saddle-point formulation can be obtained from the saddle-point formulation for the ROF model by replacing G by the functional G(p) = I{kpk α} (p) + λ kpk. The primal-dual 33

(a) Original image (b) Noise level: σ = 0.5 (c) ALG3 (d) padrsc Figure 3: Results for variational L -Huber-TV denoising. (a) is the original image: Man (04 04 pixels, gray).

05 obtained with algorithm ALG3 and padrsc, respectively. The iteration was stopped as soon as the primal-dual gap normalized by the number of pixels was less than 0 4.

(50) Since both F and G are strongly convex functions with respective moduli γ = and γ = λ, we can use (adrsc ) and (padrsc ) to solve the saddle-point problem.

34 (a) Original image (b) Noise level: σ = 0.5 (c) ALG3 (d) padrsc Figure 3: Results for variational L -Huber-TV denoising. (a) is the original image: Man (04 04 pixels, gray). (b) shows a noise-perturbed version of (a) (additive Gaussian noise, standard deviation 0.5), (c) and (d) are the denoised images with α =.0 and λ = 0.05 obtained with algorithm ALG3 and padrsc, respectively. The iteration was stopped as soon as the primal-dual gap normalized by the number of pixels was less than 0 4. gap consequently reads as GL -Huber-TV (u, p) = ku f k kdiv p + f k kf k λ + k ukα,λ + + I{kpk α} + kpk. (50) Since both F and G are strongly convex functions with respective moduli γ = and γ = λ, we can use (adrsc ) and (padrsc ) to solve the saddle-point problem. Additionally, we compare to ALG3 from [9], a solution algorithm for the same class of strongly convex saddle-point problems. All algorithms admit an O(ϑk )-convergence rate for the primal-dual gap. ALG3: Accelerated primal-dual algorithm ALG3 for strongly convex saddle-point problems as in [9], using parameters γ =, δ = λ, µ = γδ/l, L = 8, ϑ = /( + µ). adrsc : Accelerated Douglas Rachford method for strongly convex saddle-point problems as in Table 5, using parameters γ =, γ = λ, σ = 0., τ = σγ /γ, γ = γ / + ( + σγ )στ L with L = 8. padrsc : Preconditioned accelerated Douglas Rachford method for strongly convex saddlepoint problems as in Table 7. Computing three steps of the symmetric Red-Black Gauss Seidel method is employed as preconditioner. The parameters are chosen as: γ =, γ = λ, σ = 0.5, τ = σγ /γ, L = 8. The values ϑ and γ are obtained by the procedure outlined in Remark 7: Setting ϑ = first, then performing 0 times the update kt kest +ϑ στ L, km T kest 4ϑ4 (στ ) /(+4ϑ στ ), γ γ / +(+σγ )ϑ (kt kest +km T kest ), ϑ /( + σγ). Numerical experiments have been performed for the image Man (04 04 pixels, gray) which has been contaminated with additive Gaussian noise (noise level 0.5) using the same CPU as for the L -TV-denoising experiments. The results for the parameters α =.0 and λ = 0.05 are depicted in Figure 3. The table (a) in Figure 4 indicates that padrsc is competitive in comparison to a well-established fast algorithm suitable for the same class of saddle-point problems with speed benefits again coming from the preconditioner. Figure 4 (b) moreover shows that the observed geometric reduction factor for the primal-dual gap is lower for adrsc and padrsc than for ALG3, with the former being roughly equal. 34

Preconditioned Alternating Direction Method of Multipliers for the Minimization of Quadratic plus Non-Smooth Convex Functionals

SpezialForschungsBereich F 32 Karl Franzens Universita t Graz Technische Universita t Graz Medizinische Universita t Graz Preconditioned Alternating Direction Method of Multipliers for the Minimization