A first-order primal-dual algorithm with linesearch

Size: px

Start display at page:

Download "A first-order primal-dual algorithm with linesearch"

Emery Bishop
6 years ago
Views:

1 A first-order primal-dual algorithm with linesearch Yura Malitsky Thomas Pock arxiv: v2 [math.oc] 23 Mar 208 Abstract The paper proposes a linesearch for a primal-dual method. Each iteration of the linesearch requires to update only the dual (or primal) variable. For many problems, in particular for regularized least squares, the linesearch does not require any additional matrix-vector multiplications. We prove convergence of the proposed method under standard assumptions. We also show an ergodic O(/N) rate of convergence for our method. In case one or both of the prox-functions are strongly convex, we modify our basic method to get a better convergence rate. Finally, we propose a linesearch for a saddle point problem with an additional smooth term. Several numerical experiments confirm the efficiency of our proposed methods. 200 Mathematics Subject Classification: 49M29 65K0 65Y20 90C25 Keywords: Saddle-point problems, first-order algorithms, primal-dual algorithms, linesearch, convergence rates, backtracking In this work we propose a linesearch procedure for the primal-dual algorithm (PDA) that was introduced in [4]. It is a simple first-order method that is widely used for solving nonsmooth composite minimization problems. Recently, it was shown the connection of PDA with proximal point algorithms [2] and ADMM [6]. Some generalizations of the method were considered in [5, 7, 0]. A survey of possible applications of the algorithm can be found in [6, 3]. The basic form of PDA uses fixed step sizes during all iterations. This has several drawbacks. First, we have to compute the norm of the operator, which may be quite expensive for large scale dense matrices. Second, even if we know this norm, one can often use larger steps, which usually yields a faster convergence. As a remedy for the first issue one can use a diagonal precondition [6], but still there is no strong evidence that such a precondition improves or at least does not worsen the speed of convergence of PDA. Regarding the second issue, as we will see in our experiments, the speed improvement gained by using the linesearch sometimes can be significant. Our proposed analysis of PDA exploits the idea of recent works [4, 5] where several algorithms are proposed for solving a monotone variational inequality. Those algorithms are different from the PDA; however, they also use a similar extrapolation step. Although our analysis of the primal-dual method is not so elegant as, for example, that in [2], it gives a simple and a cheap way to incorporate the linesearch for defining the step sizes. Each inner iteration of the linesearch requires updating only the dual variables. Moreover, Institute for Computer Graphics and Vision, Graz University of Technology, 800 Graz,Austria. y.malitsky@gmail.com, pock@icg.tugraz.at

2 the step sizes may increase from iteration to iteration. We prove the convergence of the algorithm under quite general assumptions. Also we show that in many important cases the PDAL (primal-dual algorithm with linesearch) preserves the complexity of the iteration of PDA. In particular, our method, applied to the regularized least-squares problems, uses the same number of matrix-vector multiplication per iteration as the forward-backward method or FISTA [3] (both with fixed step size) does, but does not require to know the matrix norm and, in addition, uses adaptive steps. For the case when the primal or dual objectives are strongly convex, we modify our linesearch procedure in order to construct accelerated versions of PDA. This is done in a similar way as in [4]. The obtained algorithms share the same complexity per iteration as PDAL does, but in many cases substantially outperform PDA and PDAL. We also consider a more general primal-dual problem which involves an additional smooth function with Lipschitz-continuous gradient (see [7]). For this case we generalize our linesearch to avoid knowing that Lipschitz constant. The authors in [9,0] also proposed a linesearch for the primal-dual method, with the goal to vary the ratio between primal and dual steps such that primal and dual residuals remain roughly of the same size. The same idea was used in [] for the ADMM method. However, we should highlight that this principle is just a heuristic, as it is not clear that it in fact improves the speed of convergence. The linesearch proposed in [9, 0] requires an update of both primal and dual variables, which may make the algorithm much more expensive than the basic PDA. Also the authors proved convergence of the iterates only in the case when one of the sequences (x k ) or (y k ) is bounded. Although this is often the case, there are many problems which can not be encompassed by this assumption. Finally, it is not clear how to derive accelerated versions of that algorithm. As a byproduct of our analysis, we show how one can use a variable ratio between primal and dual steps and under which circumstances we can guarantee convergence. However, it was not the purpose of this paper to develop new strategies for varying such a ratio during iterations. The paper is organized as follows. In the next section we introduce the notations and recall some useful facts. Section 2 presents our basic primal-dual algorithm with linesearch. We prove its convergence, establish its ergodic convergence rate and consider some particular examples of how the linesearch works. In section 3 we propose accelerated versions of PDAL under the assumption that the primal or dual problem is strongly convex. Section 4 deals with more general saddle point problems which involve an additional smooth function. In section 5 we illustrate the efficiency of our methods for several typical problems. Preliminaries Let X, Y be two finite-dimensional real vector spaces equipped with an inner product, and a norm =,. We are focusing on the following problem: where min max Kx, y + g(x) f (y), () x X y Y 2

3 K : X Y is a bounded linear operator, with the operator norm L = K ; g : X (, + ] and f : Y (, + ] are proper lower semicontinuous convex (l.s.c.) functions; problem () has a saddle point. Note that f denotes the Legendre Fenchel conjugate of a convex l.s.c. function f. By a slight (but common) abuse of notation, we write K to denote the adjoint of the operator K. Under the above assumptions, () is equivalent to the primal problem and the dual problem: min f(kx) + g(x) (2) x X min f (y) + g ( K x). (3) y Y Recall that for a proper l.s.c. convex function h: X (, + ] the proximal operator prox h is defined as prox h : X X : x argmin z { h(z) + 2 z x 2}. The following important characteristic property of the proximal operator is well known: x = prox h x x x, y x h( x) h(y) y X. (4) We will often use the following identity (cosine rule): 2 a b, c a = b c 2 a b 2 a c 2 a, b, c X. (5) Let (ˆx, ŷ) be a saddle point of problem (). Then by the definition of the saddle point we have Pˆx,ŷ (x) := g(x) g(ˆx) + K ŷ, x ˆx 0 x X, (6) Dˆx,ŷ (y) := f (y) f (ŷ) K ˆx, y ŷ 0 y Y. (7) The expression Gˆx,ŷ (x, y) = Pˆx,ŷ (x) + Dˆx,ŷ (y) is known as a primal-dual gap. In certain cases when it is clear which saddle point is considered, we will omit the subscript in P, D, and G. It is also important to highlight that for fixed (ˆx, ŷ) functions P ( ), D( ), and G(, ) are convex. Consider the original primal-dual method: y k+ = prox σf (y k + σk x k ) x k+ = prox τg (x k τk y k+ ) x k+ = x k+ + θ(x k+ x k ). In [4] its convergence was proved under assumptions θ =, τ, σ > 0, and τσl 2 <. In the next section we will show how to incorporate a linesearch into this method. 3

4 Algorithm Primal-dual algorithm with linesearch Initialization: Choose x 0 X, y Y, τ 0 > 0, µ (0, ), δ (0, ), and β > 0. Set θ 0 =. Main iteration:. Compute x k = prox τk g(x k τ k K y k ). 2. Choose any τ k [τ k, τ k + θk ] and run Linesearch: 2.a. Compute θ k = τ k τ k x k = x k + θ k (x k x k ) y k+ = prox βτk f (yk + βτ k K x k ) 2.b. Break linesearch if βτ k K y k+ K y k δ y k+ y k (8) Otherwise, set τ k := τ k µ and go to 2.a. End of linesearch 2 Linesearch The primal-dual algorithm with linesearch (PDAL) is summarized in Algorithm. Given all information from the current iterate: x k, y k, τ k, and θ k, we first choose some trial step τ k [τ k, τ k + θk ], then during every iteration of the linesearch it is decreased by µ (0, ). At the end of the linesearch we obtain a new iterate y k+ and a step size τ k which will be used to compute the next iterate x k+. In Algorithm there are two opposite options: always start the linesearch from the largest possible step τ k = τ k + θk, or, the contrary, never increase τ k. Step 2 also allows us to choose compromise between them. Note that in PDAL parameter β plays the role of the ratio σ between fixed steps in τ PDA. Thus, we can rewrite the stopping criteria (8) as τ k σ k K y k+ K y k 2 δ 2 y k+ y k 2, where σ k = βτ k. Of course in PDAL we can always choose fixed steps τ k, σ k with τ k σ k δ2 and set θ L 2 k =. In this case PDAL will coincide with PDA, though our proposed analysis seems to be new. Parameter δ is mainly used for theoretical purposes: in order to be able to control y k+ y k 2, we need δ <. For the experiments δ should be chosen very close to. Remark. It is clear that each iteration of the linesearch requires computation of prox βτk f ( ) and K y k+. As in problem () we can always exchange primal and dual variables, hence it makes sense to choose for the dual variable in PDAL the one for which the respective prox-operator is simpler to compute. Note that during the linesearch we need to compute Kx k only once and then use that K x k = ( + θ k )Kx k θ k Kx k. 4

5 Remark 2. Note that when prox σf is a linear (or affine) operator, the linesearch becomes extremely simple: it does not require any additional matrix-vector multiplications. We itemize some examples below:. f (y) = c, y. Then it is easy to verify that prox σf u = u σc. Thus, we have y k+ = prox σk f (yk + σ k K x k ) = y k + σ k K x k σ k c K y k+ = K y k + σ k (K K x k K c) 2. f (y) = 2 y b 2. Then prox σf u = u+σb +σ and we obtain y k+ = prox σk f (yk + σ k K x k ) = yk + σ k (K x k + b) + σ k K y k+ = + σ k (K y k + σ k (K K x k + K b)) 3. f (y) = δ H (y), the indicator function of the hyperplane H = {u: u, a = b}. Then prox σf u = P H u = u + b u,a a. And hence, a 2 y k+ = prox σk f (yk + σ k K x k ) = y k + σ k K x k + b a, yk + σ k K x k a a 2 K y k+ = K y k + σ k K K x k + b a, yk + σ k K x k a 2 K a. Evidently, each iteration of PDAL for all the cases above requires only two matrix-vector multiplications. Indeed, before the linesearch starts we have to compute Kx k and K Kx k and then during the linesearch we should use the following relations: K x k = ( + θ k )Kx k θ k Kx k K K x k = ( + θ k )K Kx k θ k K Kx k, where Kx k and K Kx k should be reused from the previous iteration. All other operations are comparably cheap and, hence, the cost per iteration of PDA and PDAL are almost the same. However, this might not be true if the matrix K is very sparse. For example, for a finite differences operator the cost of a matrix-vector multiplication is the same as the cost of a few vector-vector additions. One simple implication of these facts is that for the regularized least-squares problem min x 2 Ax b 2 + g(x), (9) our method does not require one to know A but at the same time does not require any additional matrix-vector multiplication as it is the case for standard first order methods with backtracking (e.g. proximal gradient method, FISTA). In fact, we can rewrite (9) as min x max g(x) + Ax, y y 2 y + b 2. (0) Here f (y) = y + 2 b 2 and hence, we are in the situation where the operator prox f is affine. By construction of the algorithm we simply have the following: 5

6 Lemma. (i) The linesearch in PDAL always terminates. (ii) There exists τ > 0 such that τ k > τ for all k 0. (iii) There exists θ > 0 such that θ k θ for all k 0. Proof. (i) In each iteration of the linesearch τ k is multiplied by factor µ <. Since, (8) is satisfied for any τ k, the inner loop can not run indefinitely. δ βl (ii) Without loss of generality, assume that τ 0 > βl δµ. Our goal is to show that from τ k > βl δµ follows τ k > βl δµ. Suppose that τ k = τ k + θk µ i for some i Z +. If i = 0 then τ k > τ k > δµ βl. If i > 0 then τ k = τ k + θk µ i does not satisfy (8). Thus, τ k > βl δ and hence, τ k > βl δµ. (iii) From τ k τ k + θk and θ k = τ k τ k it follows that θ k + θ k. From this it can be easily concluded that θ k 5+ 2 for all k Z +. In fact, assume the contrary, and let r be the smallest number such that θ r > Since θ 0 =, we have r, and hence θ r θ 2 r > This yields a contradiction. One might be interested in how many iterations the linesearch needs in order to terminate. Of course, we can give only a priori upper bounds. Assume that (τ k ) is bounded from above by τ max. Consider in step 2 of Algorithm two opposite ways of choosing step size τ k : maximally increase it, that is, always set τ k = τ k + θk or never increase it and set τ k = τ k. If the former case holds, then after i iterations of the linesearch we have τ k τ k + θk µ i τ 5+ max µ i, where we have used the 2 2δ bound from Lemma (iii). Hence, if i log µ βl(+ 5)τ max, then τ k δ and the βl linesearch procedure must terminate. Similarly, if the latter case holds, then after at δ most + log µ βlτ max iterations the linesearch stops. However, because in this case (τ k ) is nonincreasing, this number is the upper bound for the total number of the linesearch iterations. The following is our main convergence result. Theorem. Let (x k, y k ) be a sequence generated by PDAL. Then it is a bounded sequence in X Y and all its cluster points are solutions of (). Moreover, if g dom g is continuous and (τ k ) is bounded from above then the whole sequence (x k, y k ) converges to a solution of (). The condition of g dom g to be continuous is not restrictive: it holds for any g with open dom g (this includes all finite-valued functions) or for an indicator δ C of any closed convex set C. Also it holds for any separable convex l.s.c. function (Corollary 9.5, [2]). The boundedness of (τ k ) from above is rather theoretical: clearly we can easily bound it in PDAL. 6

7 Proof. Let (ˆx, ŷ) be any saddle point of (). From (4) it follows that x k+ x k + τ k K y k+, ˆx x k+ τ k (g(x k+ ) g(ˆx)) () β (yk+ y k ) τ k K x k, ŷ y k+ τ k (f (y k+ ) f (ŷ)). (2) Since x k = prox τk g(x k τ k K y k ), we have again by (4) that for all x X x k x k + τ k K y k, x x k τ k (g(x k ) g(x)). After substitution in the last inequality x = x k+ and x = x k we get x k x k + τ k K y k, x k+ x k τ k (g(x k ) g(x k+ )), (3) x k x k + τ k K y k, x k x k τ k (g(x k ) g(x k )). (4) Adding (3), multiplied by θ k = τ k τ k, and (4), multiplied by θ 2 k, we obtain x k x k + τ k K y k, x k+ x k τ k (( + θ k )g(x k ) g(x k+ ) θ k g(x k )), (5) where we have used that x k = x k + θ k (x k x k ). Consider the following identity: τ k K y k+ K ŷ, x k ˆx τ k K x k K ˆx, y k+ ŷ = 0. (6) Summing (), (2), (5), and (6), we get x k+ x k, ˆx x k+ + β yk+ y k, ŷ y k+ + x k x k, x k+ x k + τ k K y k+ K y k, x k x k+ τ k K y, x k ˆx + τ k K ˆx, y k+ y τ k ( f (y k+ ) f (ŷ) + ( + θ k )g(x k ) θ k g(x k ) g(ˆx) ). (7) Using that and f (y k+ ) f (ŷ) K ˆx, y k+ ŷ = Dˆx,ŷ (y k+ ) (8) (+θ k )g(x k ) θ k g(x k ) g(ˆx) + K y, x k ˆx = ( + θ k ) ( g(x k ) g(ˆx) + K ŷ, x k ˆx ) θ k ( g(x k ) g(ˆx) + K ŷ, x k ˆx ) = ( + θ k )Pˆx,ŷ (x k ) θ k Pˆx,ŷ (x k ), (9) we can rewrite (7) as (we will not henceforth write the subscripts for P and D unless it is unclear) x k+ x k, ˆx x k+ + β yk+ y k, ŷ y k+ + x k x k, x k+ x k + τ k K y k+ K y k, x k x k+ τ k (( + θ k )P (x k ) θ k P (x k ) + D(y k+ )). (20) 7

8 Let ε k denotes the right-hand side of (20). Using the cosine rule for every item in the first line of (20), we obtain 2 ( xk ˆx 2 x k+ ˆx 2 x k+ x k 2 ) + 2β ( yk ŷ 2 y k+ ŷ 2 y k+ y k 2 ) + 2 ( xk+ x k 2 x k x k 2 x k+ x k 2 ) By (8), Cauchy Schwarz, and Cauchy s inequalities we get τ k K y k+ K y k, x k x k+ from which we derive that + τ k K y k+ K y k, x k x k+ ε k. (2) δ β x k+ x k y k+ y k 2 xk+ x k 2 + δ2 2β yk+ y k 2, 2 ( xk ˆx 2 x k+ ˆx 2 ) + 2β ( yk ŷ 2 y k+ ŷ 2 ) 2 xk x k 2 δ2 2β yk+ y k 2 ε k. (22) Since (ˆx, ŷ) is a saddle point, D(y k ) 0 and P (x k ) 0 and hence (22) yields 2 ( xk ˆx 2 x k+ ˆx 2 ) + 2β ( yk ŷ 2 y k+ ŷ 2 ) 2 xk x k 2 δ2 2β yk+ y k 2 τ k (( + θ k )P (x k ) θ k P (x k )) (23) or, taking into account θ k τ k ( + θ k )τ k, 2 xk+ ˆx 2 + 2β yk+ ŷ 2 + τ k ( + θ k )P (x k ) 2 xk ˆx 2 + 2β yk ŷ 2 + τ k ( + θ k )P (x k ) 2 xk x k 2 δ2 2β yk+ y k 2 (24) From this we deduce that (x k ), (y k ) are bounded sequences and lim k x k x k = 0, lim k y k y k = 0. Also notice that x k+ x k τ k = xk+ x k+ τ k+ 0 as k, 8

9 where the latter holds because τ k is separated from 0 by Lemma. Let (x k i, y k i ) be a subsequence that converges to some cluster point (x, y ). Passing to the limit in (x ki+ x k i ) + K y k i, x x ki+ g(x ki+ ) g(x) τ ki x X, (y ki+ y k i ) K x k i, y y ki+ f (y ki+ ) f (y) βτ ki y Y, we obtain that (x, y ) is a saddle point of (). When g dom g is continuous, g(x k i ) g(x ), and hence, P x,y (xk i ) 0. From (24) it follows that the sequence a k = 2 xk+ x 2 + 2β yk+ y 2 + τ k ( + θ k )P x,y (xk ) is monotone. Taking into account boundedness of (τ k ) and (θ k ), we obtain which means that x k x, y k y. lim a k = lim a ki = 0, k i Theorem 2 (ergodic convergence). Let (x k, y k ) be a sequence generated by PDAL and (ˆx, ŷ) be any saddle point of (). Then for the ergodic sequence (X N, Y N ) it holds that Gˆx,ŷ (X N, Y N ) s N ( 2 x ˆx 2 + ) 2β y ŷ 2 + τ θ Pˆx,ŷ (x 0 ), where s N = N k= τ k, X N = τ θ x 0 + N k= τ k x k τ θ + s N, Y N = Proof. Summing up (22) from k = to N, we get Nk= τ k y k s N. 2 ( x ˆx 2 x N+ ˆx 2 ) + 2β ( y ŷ 2 y N+ ŷ 2 ) The right-hand side in (25) can be expressed as By convexity of P, ε k = τ N ( + θ N )P (x N ) + [( + θ k )τ k θ k τ k )]P (x k ) k= k=2 θ τ P (x 0 ) + τ k D(y k+ ) k= ε k (25) k= τ N ( + θ N )P (x N )+ [( + θ k )τ k θ k τ k )]P (x k ) (26) k=2 (τ θ + s N )P ( τ ( + θ )x + N k=2 τ k x k ) (27) τ θ + s N = (τ θ + s N )P ( τ θ x 0 + N k= τ k x k τ θ + s N ) s N P (X N ), (28) 9

10 where s N = N k= τ k. Similarly, Hence, and we conclude k= G(X N, Y N ) = P (X N ) + D(Y N ) s N Nk= τ k D(y k τ k y k ) s N D( ) = s N D(Y N ). (29) s N ε k s N (P (X N ) + D(Y N )) τ θ P (x 0 ) k= ( 2 x ˆx 2 + ) 2β y ŷ 2 + τ θ P (x 0 ). Clearly, we have the same O(/N) rate of convergence as in [4, 5, 9], though with the ergodic sequence (X N, Y N ) defined in a different way. Analysing the proof of Theorems and 2, the reader may find out that our proof does not rely on the proximal interpretation of the PDA [2]. An obvious shortcoming of our approach is that deriving new extensions of the proposed method, like inertial or relaxed versions [5], still requires some nontrivial efforts. It would be interesting to obtain a general approach for the proposed method and its possible extensions. It is well known that in many cases the speed of convergence of PDA crucially depends on the ratio between primal and dual steps β = σ. Motivated by this, paper [0] proposed τ an adaptive strategy for how to choose β in every iteration. Although it is not the goal of this work to study the strategies for defining β, we show that the analysis of PDAL allows us to incorporate such strategies in a very natural way. Theorem 3. Let (β k ) (β min, β max ) be a monotone sequence with β min, β max > 0 and (x k, y k ) be a sequence generated by PDAL with variable (β k ). Then the statement of Theorem holds. Proof. Let (β k ) be nondecreasing. Then using β k β k, we get from (24) that 2 xk+ ˆx 2 + y k+ ŷ 2 + τ k ( + θ k )P (x k ) 2β k 2 xk ˆx 2 + y k ŷ 2 + τ k ( + θ k )P (x k ) 2β k 2 xk x k 2 δ2 2β k y k+ y k 2. (30) Since (β k ) is bounded from above, the conclusion in Theorem simply follows. If (β k ) is decreasing, then the above arguments should be modified. Consider (23), multiplied by β k, β k 2 ( xk ˆx 2 x k+ ˆx 2 ) + 2 ( yk ŷ 2 y k+ ŷ 2 ) β k 2 xk x k 2 δ2 y k+ y k 2 β k τ k (( + θ k )P (x k ) θ k P (x k )). (3) 2 0

11 As β k < β k, we have θ k β k τ k ( + θ k )β k τ k, which in turn implies β k 2 xk+ ˆx yk+ ŷ 2 + τ k β k ( + θ k )P (x k ) β k 2 xk ˆx yk ŷ 2 + τ k β k ( + θ k )P (x k ) Due to the given properties of (β k ), the rest is trivial. β k 2 xk x k 2 δ2 y k+ y k 2. (32) 2 It is natural to ask if it is possible to use nonmonotone (β k ). To answer this question, we can easily use the strategies from [], where an ADMM with variable steps was proposed. One way is to use any β k during a finite number of iterations and then switch to monotone (β k ). Another way is to relax the monotonicity of (β k ) to the following: there exists (ρ k ) R + such that: k ρ k < and β k β k ( + ρ k ) k N or β k β k ( + ρ k ) k N. We believe that it should be quite straightforward to prove convergence of PDAL with the latter strategy. 3 Acceleration It has been shown [4] that in the case when g or f is strongly convex, one can modify the primal-dual algorithm and derive a better convergence rate. We show that the same holds for PDAL. The main difference of the accelerated variant APDAL from the basic PDAL is that now we have to vary β in every iteration. Of course due to the symmetry of the primal and dual variables in (), we can always assume that the primal objective is strongly convex. However, from the computational point of view it might not be desirable to exchange g and f in the PDAL because of Remark 2. Therefore, we discuss the two cases separately. Also notice that both accelerated algorithms below coincide with PDAL when the parameter of strong convexity γ = g is strongly convex Assume that g is γ strongly convex, i.e., g(x 2 ) g(x ) u, x 2 x + γ 2 x 2 x 2 x, x 2 X, u g(x ). Below we assume that the parameter γ is known. The following algorithm (APDAL) exploits the strong convexity of g: Note that in contrast to PDAL, we set δ =, as in any case we will not be able to prove convergence of (y k ).

12 Algorithm 2 Accelerated primal-dual algorithm with linesearch: g is strongly convex Initialization: Choose x 0 X, y Y, µ (0, ), τ 0 > 0, β 0 > 0. Set θ 0 =. Main iteration:. Compute 2. Choose any τ k [τ βk k β k Linesearch: 2.a. Compute x k = prox τk g(x k τ k K y k ) β k = β k ( + γτ k ), τ βk k β k ( + θ k )] and run τ k θ k = τ k x k = x k + θ k (x k x k ) y k+ = prox βk τ k f (yk + β k τ k K x k ) 2.b. Break linesearch if β k τ k K y k+ K y k y k+ y k (33) Otherwise, set τ k := τ k µ and go to 2.a. End of linesearch Instead of (), now one can use the stronger inequality x k+ x k + τ k K y k+, ˆx x k+ τ k (g(x k+ ) g(ˆx) + γ 2 xk+ ˆx 2 ). (34) In turn, (34) yields a stronger version of (22) (also with β k instead of β): 2 ( xk ˆx 2 x k+ ˆx 2 ) + ( y k ŷ 2 y k+ ŷ 2 ) 2β k or, alternatively, 2 xk ˆx 2 + y k ŷ 2 2β k 2 xk x k 2 Note that the algorithm provides that β k+ β k Then from (36) follows 2 xk x k 2 ε k + γτ k 2 xk+ ˆx 2 (35) ε k + + γτ k x k+ ˆx 2 + β k+ y k+ ŷ 2. (36) 2 β k 2β k+ = + γτ k. For brevity let A k = 2 xk ˆx 2 + 2β k y k ŷ 2. β k+ A k+ + ε k A k β k 2

13 or β k+ A k+ + β k ε k β k A k. Thus, summing the above from k = to N, we get β N+ A N+ + β k ε k β A. (37) k= Using the convexity in the same way as in (26), (29), we obtain β k ε k = β N τ N ( + θ N )P (x N ) + [( + θ k )β k τ k θ k β k τ k )]P (x k ) k= k=2 θ β τ P (x 0 ) + β k τ k D(y k+ ) s N (P (X N ) + D(Y N )) θ σ P (x 0 ), (38) k= where Hence, σ k = β k τ k s N = σ k k= X N = σ θ x 0 + N k= σ k x k Nk= Y N σ k y k+ = σ θ + s N s N β N+ A N+ + s N G(X N, Y N ) β A + θ σ P (x 0 ). (39) From this we deduce that the sequence ( y k ŷ ) is bounded and G(X N, Y N ) s N (β A + θ σ P (x 0 )) x N+ ˆx 2 β N+ (β A + θ σ P (x 0 )). Our next goal is to derive asymptotics for β N and s N. Obviously, (33) holds for any τ k, β k such that τ k β k L. Since in each iteration of the linesearch we decrease τ k by factor of µ, τ k can not be less than µ. Hence, we have β k L µ β k+ = β k ( + γτ k ) β k ( + γ L ) = β k + γµ β k L β k. (40) By induction, one can show that there exists C > 0 such that β k Ck 2 for all k > 0. Then for some constant C > 0 we have x N+ ˆx 2 C (N + ) 2. From (40) it follows that β k+ β k γµ L Ck. As σk = β k+ β k γ, we obtain σ k µ L Ck and thus s N = N k= σ k = O(N 2 ). This means that for some constant C 2 > 0 G(X N, Y N ) C 2 N 2. We have shown the following result: Theorem 4. Let (x k, y k ) be a sequence generated by Algorithm 2. Then x N ˆx = O(/N) and G(X N, Y N ) = O(/N 2 ). 3

14 3.2 f is strongly convex The case when f is γ strongly convex can be treated in a similar way. Algorithm 3 Accelerated primal-dual algorithm with linesearch: f is strongly convex Initialization: Choose x 0 X, y Y, µ (0, ), τ 0 > 0, β 0 > 0. Set θ 0 =. Main iteration:. Compute x k = prox τk g(x k τ k K y k ) β k β k = + γβ k τ k 2. Choose any τ k [τ k, τ k + θk ] and run Linesearch: 2.a. Compute θ k = τ k, σ k = β k τ k τ k x k = x k + θ k (x k x k ) y k+ = prox σk f (yk + σ k K x k ) 2.b. Break linesearch if β k τ k K y k+ K y k y k+ y k Otherwise, set τ k := τ k µ and go to 2.a. End of linesearch Note that again we set δ =. Dividing (22) over τ k and taking into account the strong convexity of f, which has to be used in (2), we deduce ( x k ˆx 2 x k+ ˆx 2 ) + ( y k ŷ 2 y k+ ŷ 2 ) 2τ k 2σ k which can be rewritten as 2τ k x k ˆx 2 + 2σ k y k ŷ 2 2τ k x k x k 2 2τ k x k x k 2 ε k τ k + γ 2 yk+ ŷ 2, (4) ε k + τ k+ x k+ ˆx 2 + σ k+ ( + γσ k ) y k+ ŷ 2, (42) τ k τ k 2τ k+ σ k 2σ k+ Note that by construction of (β k ) in Algorithm 3, we have τ k+ = σ k+ ( + γσ k ). τ k σ k Let A k = 2τ k x k ˆx 2 + 2σ k y k ŷ 2. Then (42) is equivalent to τ k+ A k+ + ε k A k x k x k 2 τ k τ k 2τ k 4

15 or τ k+ A k+ + ε k τ k A k 2 xk x k 2. Finally, summing the above from k = to N, we get τ N+ A N+ + ε k τ A k= 2 x k x k 2 (43) k= From this we conclude that the sequence (x k ) is bounded, lim k x k x k = 0, and G(X N, Y N ) s N (τ A + θ τ P (x 0 )), y N+ ŷ 2 σ N+ τ N+ (τ A + θ τ P (x 0 )) = β N+ (τ A + θ τ P (x 0 )), where X N, Y N, s N are the same as in Theorem 2. Let us turn to the derivation of the asymptotics of (τ N ) and (s N ). Analogously, we have that τ k β µ and thus k L β k+ = β k + γβ k τ k β k + γ µ. βk L Again it is not difficult to show by induction that β k C for some constant C > 0. In k 2 β fact, as φ(β) = is increasing, it is sufficient to show that +γ µ L β β k+ β k + γ µ βk L C k 2 + γ µ L C k 2 C (k + ) 2. The latter inequality is equivalent to C (2 + k ) L, which obviously holds for C large γµ enough (of course C must also satisfy the induction basis). The obtained asymptotics for (β k ) yields τ k µ βk L µk CL, from which we deduce s N = N k= τ k µ CL Nk= k. Finally, we obtain the following result: Theorem 5. Let (x k, y k ) be a sequence generated by Algorithm 3. Then y N ŷ = O(/N) and G(X N, Y N ) = O(/N 2 ). Remark 3. In the case when both g and f are strongly convex, one can derive a new algorithm, combining the ideas of Algorithms 2 and 3 (see [4] for more details). 5

16 4 A more general problem In this section we show how to apply the linesearch procedure to the more general problem min max Kx, y + g(x) f (y) h(y), (44) x X y Y where in addition to the previous assumptions, we suppose that h: Y R is a smooth convex function with L h Lipschitz continuous gradient h. Using the idea of [2], Condat and Vũ in [7, 8] proposed an extension of the primal-dual method to solve (44): y k+ = prox σf (y k + σ(k x k h(y k ))) x k+ = prox τg (x k τk y k+ ) x k+ = 2x k+ x k. This scheme was proved to converge under the condition τσ K 2 σl h. Originally the smooth function was added to the primal part and not to the dual as in (44). For us it is more convenient to consider precisely that form due to the nonsymmetry of the proposed linesearch procedure. However, simply exchanging max and min in (44), we recover the form which was considered in [7]. In addition to the issues related to the operator norm of K which motivated us to derive the PDAL, here we also have to know the Lipschitz constant L h of h. This has several drawbacks. First, its computation might be expensive. Second, our estimation of L h might be very conservative and will result in smaller steps. Third, using local information about h instead of global L h often allows the use of larger steps. Therefore, the introduction of a linesearch to the algorithm above is of great practical interest. The algorithm below exploits the same idea as Algorithm does. However, its stopping criterion is more involved. The interested reader may identify it as a combination of the stopping criterion (8) and the descent lemma for smooth functions h. Note that in the case h 0, Algorithm 4 corresponds exactly to Algorithm. We briefly sketch the proof of convergence. Let (ˆx, ŷ) be any saddle point of (). Similarly to (), (2), and (5), we get x k+ x k + τ k K y k+, ˆx x k+ τ k (g(x k+ ) g(ˆx)) β (yk+ y k ) τ k K x k + τ k h(y k ), ŷ y k+ τ k (f (y k+ ) f (ŷ)). θ k (x k x k ) + τ k K y k, x k+ x k τ k (( + θ k )g(x k ) g(x k+ ) θ k g(x k )). Summation of the three inequalities above and identity (6) yields x k+ x k, ˆx x k+ + β yk+ y k, ŷ y k+ + θ k x k x k, x k+ x k + τ k K y k+ K y k, x k x k+ τ k K y, x k ˆx + τ k K ˆx, y k+ y + τ k h(y k ), ŷ y k+ τ k ( f (y k+ ) f (ŷ) + ( + θ k )g(x k ) θ k g(x k ) g(ˆx) ). (46) 6

17 Algorithm 4 General primal-dual algorithm with linesearch Initialization: Choose x 0 X, y Y, τ 0 > 0, µ (0, ), δ (0, ) and β 0 > 0. Set θ 0 =. Main iteration:. Compute x k = prox τk g(x k τ k K y k ). 2. Choose any τ k [τ k, τ k + θk ] and run Linesearch: 2.a. Compute θ k = τ k, σ k = βτ k τ k 2.b. Break linesearch if x k = x k + θ k (x k x k ) y k+ = prox σk f (yk + σ k (K x k h(y k ))) τ k σ k K y k+ K y k 2 + 2σ k [h(y k+ ) h(y k ) h(y k ), y k+ y k ] Otherwise, set τ k := τ k µ and go to 2.a. End of linesearch δ y k+ y k 2 (45) By convexity of h, we have τ k (h(ŷ) h(y k ) h(y k ), ŷ y k ) 0. Combining it with inequality (45), divided over 2β, we get δ 2β yk+ y k 2 τ 2 k 2 K y k+ K y k 2 τ k [h(y k+ ) h(ŷ) h(y k ), y k+ ŷ ]. Adding the above inequality to (46) gives us x k+ x k, ˆx x k+ + β yk+ y k, ŷ y k+ + θ k x k x k, x k+ x k + τ k K y k+ K y k, x k x k+ τ k K y, x k ˆx + τ k K ˆx, y k+ y + δ 2β yk+ y k 2 τ 2 k 2 K y k+ K y k 2 τ k ( (f + h)(y k+ ) (f + h)(ŷ) + ( + θ k )g(x k ) θ k g(x k ) g(ˆx) ). (47) Note that for problem (44) instead of (7) we have to use Dˆx,ŷ (y) := f (y) + h(y) f (ŷ) h(ŷ) K ˆx, y ŷ 0 y Y, (48) 7

18 which is true by definition of (ˆx, ŷ). Now we can rewrite (47) as x k+ x k, ˆx x k+ + β yk+ y k, ŷ y k+ + θ k x k x k, x k+ x k + τ k K y k+ K y k, x k x k+ + δ 2β yk+ y k 2 τ 2 k 2 K y k+ K y k 2 τ k ( ( + θk )P (x k ) θ k P (x k ) + D(y k+ ) ). (49) Applying cosine rules for all inner products in the first line in (49) and using that θ k (x k x k ) = x k x k, we obtain 2 ( xk ˆx 2 x k+ ˆx 2 x k+ x k 2 )+ 2β ( yk ŷ 2 y k+ ŷ 2 y k+ y k 2 ) + 2 ( xk+ x k 2 x k x k 2 x k+ x k 2 ) + τ k K y k+ K y k x k+ x k + δ 2β yk+ y k 2 τ 2 k 2 K y k+ K y k 2 τ k ( ( + θk )P (x k ) θ k P (x k ) + D(y k+ ) ). (50) Finally, applying Cauchy s inequality in (50) and using that τ k θ k τ k ( + θ k ), we get 2 xk+ ˆx 2 + 2β yk+ ŷ 2 + τ k ( + θ k )P (x k ) 2 xk ˆx 2 + 2β yk ŷ 2 + τ k ( + θ k )P (x k ) 2 xk x k 2 δ 2β yk+ y k 2, (5) from which the convergence of (x k ) and (y k ) to a saddle point of (44) can be derived in a similar way as in Theorem. 5 Numerical Experiments This section collects several numerical tests that will illustrate the performance of the proposed methods. Computations were performed using Python 3 on an Intel Core i3-2350m CPU 2.30GHz running 64-bit Debian Linux 8.7. For PDAL and APDAL we initialize the input data as µ = 0.7, δ = 0.99, τ 0 = min {m,n} A F. The latter is easy to compute and it is an upper bound of. The parameter A β for PDAL is always taken as σ in PDA with fixed steps σ and τ. A trial step τ τ k in Step 2 is always chosen as τ k = τ k + θk. Codes can be found on 8

19 5. Matrix game We are interested in the following min-max matrix game: min max Ax, y, (52) x n y m where x R n, y R m, A R m n, and m, n denote the standard unit simplices in R m and R n, respectively. For this problem we study the performance of PDA, PDAL (Algorithm ), Tseng s FBF method [7], and PEGM [5]. For comparison we use the primal-dual gap G(x, y), which can be easily computed for a feasible pair (x, y) as G(x, y) = max(ax) i min(a y) j. i j Since iterates obtained by Tseng s method may be infeasible, we used an auxiliary point (see [7]) to compute the primal-dual gap. The initial point in all cases was chosen as x 0 = (,..., ) and n y0 = (,..., ). In m order to compute projection ontothe unit simplex we used the algorithm from [8]. For PDA we use τ = σ = / A = / λ max (A A), which we compute in advance. The input data for FBF and PEGM are the same as in [5]. Note that these methods also use a linesearch. We consider four differently generated samples of the matrix A R m n :. m = n = 00. All entries of A are generated independently from the uniform distribution in [, ]. 2. m = n = 00. All entries of A are generated independently from the the normal distribution N (0, ). 3. m = 500, n = 00. All entries of A are generated independently from the normal distribution N (0, ). 4. m = 000, n = The matrix A is sparse with 0% nonzero elements generated independently from the uniform distribution in [0, ]. For every case we report the primal-dual gap G(x k, y k ) computed in every iteration vs CPU time. The results are presented in Figure. 5.2 l -regularized least squares We study the following l -regularized problem: min x φ(x) := 2 Ax b 2 + λ x, (53) where A R m n, x R n, b R m. Let g(x) = λ x, f(p) = 2 p b 2. Analogously to (9) and (0), we can rewrite (53) as min x max g(x) + Ax, y f (y), y 9

20 PDA PDAL FBF PEGM CPU time, seconds PD gap (x k, y k ) PD gap (x k, y k ) PDA PDAL FBF PEGM PDA PDAL FBF PEGM CPU time, seconds CPU time, seconds (b) Example PD gap (x k, y k ) PD gap (x k, y k ) (a) Example (c) Example PDA PDAL FBF PEGM CPU time, seconds (d) Example 4 Figure : Convergence plots for problem (52) where f (y) = 2 kyk2 +(b, y) = 2 ky +bk2 2 kbk2. Clearly, the last term does not have any impact on the prox-term and we can conclude that proxλf (y) = y λb. This means that +λ the linesearch in Algorithm does not require any additional matrix-vector multiplication (see Remark 2). We generate four instances of problem (53), on which we compare the performance of PDA, PDAL, APDA (accelerated primal-dual algorithm), APDAL, FISTA [3], and SpaRSA [9]. The latter method is a variant of the proximal gradient method with an adaptive linesearch. All methods except q PDAL and SpaRSA require predefined step sizes. For this we compute in advance kak = λmax (A A). For all instances below we use the following parameters: PDA: σ =,τ 20kAk = 20 ; kak PDAL: β = /400; APDA, APDAL: β =, γ = 0.; FISTA: α = ; kak2 SpaRSA: In the first iteration we run a standard Armijo linesearch procedure to 20

21 define α 0 and then we run SpaRSA with parameters as described in [9]: M = 5, σ = 0.0, α max = /α min = For all cases below we generate some random w R n in which s random coordinates are drawn from from the uniform distribution in [ 0, 0] and the rest are zeros. Then we generate ν R m with entries drawn from N (0, 0.) and set b = Aw + ν. The parameter λ = 0. for all examples and the initial points are x 0 = (0,..., 0), y 0 = Ax 0 b. The matrix A R m n is constructed in one of the following ways:. n = 000, m = 200, s = 0. All entries of A are generated independently from N (0, ). 2. n = 2000, m = 000, s = 00. All entries of A are generated independently from N (0, ). 3. n = 5000, m = 000, s = 50. First, we generate the matrix B with entries from N (0, ). Then for any p (0, ) we construct the matrix A by columns A j, j =,..., n as follows: A = B p, A 2 j = p A j + B j. As p increases, A becomes more ill-conditioned (see [] where this example was considered). In this experiment we take p = The same as the previous example, but with p = 0.9. Figure 2 collects the convergence results with φ(x k ) φ vs CPU time. Since in fact the value φ is unknown, we instead run our algorithms for sufficiently many iterations to obtain the ground truth solution x. Then we simply set φ = φ(x ). 5.3 Nonnegative least squares Next, we consider another regularized least squares problem: min φ(x) := x 0 2 Ax b 2, (54) where A R m n, x R n, b R m. Similarly as before, we can express it as min x max g(x) + Ax, y f y), (55) y where g(x) = δ R n + (x), f (y) = 2 y+b 2. For all cases below we generate a random matrix A R m n with density d (0, ). In order to make the optimal value φ = 0, we always generate w as a sparse vector in R n whose s nonzero entries are drawn from the uniform distribution in [0, 00]. Then we set b = Aw. We test the performance of the same algorithms as in the previous example. For FISTA and SpaRSA we use the same parameters as before. For every instance of the problem, PDA and PDAL use the same β. For APDA and APDAL we always set β = and γ = 0.. The initial points are x 0 = (0,..., 0) and y 0 = Ax 0 b = b. We generate our data as follows:. m = 2000, n = 4000, d =, s = 000; the entries of A are generated independently from the uniform distribution in [, ]. β = 25. 2

22 (x) * PDA PDAL APDA APDAL FISTA SPARSA CPU time, seconds (a) Example (x) * PDA PDAL APDA APDAL FISTA SPARSA CPU time, seconds (b) Example 2 (x) * PDA PDAL APDA APDAL FISTA SPARSA CPU time, seconds (c) Example 3 (x) * PDA PDAL APDA APDAL FISTA SPARSA CPU time, seconds (d) Example 4 Figure 2: Convergence plots for problem (53) 2. m = 000, n = 2000, d = 0.5, s = 00; the nonzero entries of A are generated independently from the uniform distribution in [0, ]. β = m = 3000, n = 5000, d = 0., s = 00; the nonzero entries of A are generated independently from the uniform distribution in [0, ]. β = m = 0000, n = 20000, d = 0.0, s = 500; the nonzero entries of A are generated independently from the normal distribution N (0, ). β =. As (54) is again just a regularized least squares problem, the linesearch does not require any additional matrix-vector multiplications. The results with φ(x k ) φ vs. CPU time are presented in Figure 3. Although primal-dual methods may converge faster, they require tuning β = σ/τ. It is also interesting to highlight that sometimes non-accelerated methods with a properly chosen ratio σ/τ can be faster than their accelerated variants. 6 Conclusion In this work, we have presented several primal-dual algorithms with linesearch. On the one hand, this allows us to avoid the evaluation of the operator norm, and on the other 22

23 (x k ) * PDA PDAL APDA APDAL FISTA SPARSA CPU time, seconds (a) Example (x k ) * PDA PDAL APDA APDAL FISTA SPARSA CPU time, seconds (b) Example 2 (x k ) * PDA PDAL APDA APDAL FISTA SPARSA CPU time, seconds (c) Example 3 (x k ) * PDA PDAL APDA APDAL FISTA SPARSA CPU time, seconds (d) Example 4 Figure 3: Convergence plots for problem (54) hand, it allows us to make larger steps. The proposed linesearch is very simple and in many important cases it does not require any additional expensive operations (such as matrix-vector multiplications or prox-operators). For all methods we have proved convergence. Our experiments confirm the numerical efficiency of the proposed methods. Acknowledgements: The work is supported by the Austrian science fund (FWF) under the project "Efficient Algorithms for Nonsmooth Optimization in Imaging" (EANOI) No. I48. The authors also would like to thank the referees and the SIOPT editor Wotao Yin for their careful reading of the manuscript and their numerous helpful comments. References [] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In Adv. Neur. In., pages 37 45, 200. [2] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York,

24 [3] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problem. SIAM J. Imaging Sci., 2():83 202, [4] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging. Vis., 40():20 45, 20. [5] A. Chambolle and T. Pock. On the ergodic convergence rates of a first-order primal dual algorithm. Math. Program., pages 35, 205. [6] A. Chambolle and T. Pock. An introduction to continuous optimization for imaging. Acta Numerica, 25:6 39, 206. [7] L. Condat. A primal dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms. J. Optimiz. Theory App., 58(2): , 203. [8] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l -ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages , [9] T. Goldstein, M. Li, and X. Yuan. Adaptive primal-dual splitting methods for statistical learning and image processing. In Advances in Neural Information Processing Systems, pages , 205. [0] T. Goldstein, M. Li, X. Yuan, E. Esser, and R. Baraniuk. Adaptive primal-dual hybrid gradient methods for saddle-point problems. arxiv preprint arxiv: , 203. [] B. He, H. Yang, and S. Wang. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. J. Optimiz. Theory App., 06(2): , [2] B. He and X. Yuan. Convergence analysis of primal-dual algorithms for a saddlepoint problem: From contraction perspective. SIAM J. Imag. Sci., 5():9 49, 202. [3] N. Komodakis and J. C. Pesquet. Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems. IEEE Signal Proc. Mag., 32(6):3 54, 205. [4] Y. Malitsky. Reflected projected gradient method for solving monotone variational inequalities. SIAM J. Optimiz., 25(): , 205. [5] Y. Malitsky. Proximal extrapolated gradient methods for variational inequalities. Optimization Methods and Software, 33():40 64, 208. [6] T. Pock and A. Chambolle. Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In Computer Vision (ICCV), 20 IEEE International Conference on, pages IEEE,

25 [7] P. Tseng. A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control, 38:43 446, [8] B. C. Vũ. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Computational Mathematics, 38(3):667 68, 203. [9] S. J. Wright, R. D. Nowak, and M. A. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): ,

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some