A Linearly Convergent First-order Algorithm for Total Variation Minimization in Image Processing

Size: px

Start display at page:

Download "A Linearly Convergent First-order Algorithm for Total Variation Minimization in Image Processing"

Roland Bruce
5 years ago
Views:

1 A Linearly Convergent First-order Algorithm for Total Variation Minimization in Image Processing Cong D. Dang Kaiyu Dai Guanghui Lan October 9, 0 Abstract We introduce a new formulation for total variation minimization in image denoising. We also present a linearly convergent first-order method for solving this reformulated problem and show that it possesses a nearly dimension-independent iteration complexity bound. Keywords: Image denoising, Total variation, First-order methods, Complexity, Linear rate of convergent Introduction The restoration of images contaminated by noise is a fundamental problem in biomedical image processing and plays an important role in certain diagnosis techniques such as Magnetic Resonance Image (MRI) and functional Magnetic Resonance Image (fmri). In 99, Rudin, Osher and Fatemi (ROF) 7] proposed an influential optimization approach for image denoising by minimizing the total variation (TV). It turns out that the ROF model can preserve edges and important features in the original image. In this paper we propose an alternative formulation (or relaxation) for minimizing the total variation, which leads to comparable denoising quality to the classical ROF model. Moreover, we show that the relaxed model can be solved very efficiently. In particular, we present a linearly convergent first-order algorithm for solving this new model, and demonstrate that it possesses an O (ln(/ɛ)) iteration complexity for achieving the target accuracy ɛ. Since the aforementioned iteration complexity bound is almost dimension-independent and the iteration cost only linearly depends on the dimension, the total arithmetic complexity of our algorithm is O ( N ln(/ɛ) ) for processing an N N image. Hence, our approach is scalable to very large-scale image denoising problems. By contrast, most existing approaches for solving the original ROF model are based on an equivalent dual or primal-dual formulation (see, e.g., Chan et al. 5], Chambolle 3], Beck and Teboulle, ], Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL, 36. ( congdd@ufl.edu). This author was partially supported by a doctoral fellowship from the Vietnam International Education Development and NSF GRANT CMMI Software School, Fudan University, Shanghai, China, 003. ( kydai@fudan.edu.cn). This author was partially supported by a visiting scholar fellowship from China Scholarship Council. Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL, 36. ( glan@ise.ufl.edu). This author was partially supported by NSF GRANT CMMI

2 Chambolle and Pock 4]). All these algorithms converge sublinearly for solving the ROF model and the best iteration complexity bound is given by O(/ ɛ), 4]. Moreover, these complexity results heavily depend on the dimension of the problem, the selection of the initial point and certain regularization parameters. We also refer to other algorithms recently developed for solving the ROF model (e.g., 6, 8, 7, 9,, 3, 9]) and references therein. This paper is organized as follows. We review the classical ROF model and present a new strongly convex composite reformulation (or relaxation) for the ROF model in Section. An efficient algorithm for solving the reformulation is presented and analyzed in Section 3. We then report some promising numerical results and biomedical applications in Section 4. All the technical proofs are given in the Appendix. A strongly convex composite reformulation for total variation minimization In this section, we review the classical ROF model for image denoising and present a novel reformulation for it. We also show how these two TV-based models for image denoising are related. For the sake of simplicity, let us assume that the images are of -dimension with N N pixels. For any image u R N N, the discretized gradient operator u is defined as where ( u) := ( u) := (( u), ( u) ), i, j =,..., N, (.) { { ui+,j u, i < N 0, i = N and u+ u ( u) :=, j < N 0, j = N. Then, the classical total variation minimization problem is given by ū = arg min {φ(u) := T (u) + λ } u f, (.) where λ > 0 is a user-defined parameter, f is the observed noisy image and T (u) = ( ) ( u) ( u). Observe that the norm in the definition of T ( ) (and hereafter) can be either the l or l norm. If =, then problem (.) is exactly the original ROF model. It can be easily seen that the objective function φ(u) in (.) is a nonsmooth strongly convex function. It is known that oracle-based convex optimization techniques would require O(/ɛ) iterations to find an ɛ-solution of (.), i.e., a point û such that φ(û) φ ɛ (see 0]). It has recently been shown that the above iteration complexity can be significantly improved to O(/ ɛ) by using a dual or saddle point reformulation of (.) (e.g.,, 4, 5]). Note, however, that all these algorithms converge sublinearly, and that their performance also heavily depends on the dimension N and the selection of starting points. In order to address these issues, we consider an alternative formulation of problem (.). The basic idea is to introduce an extra variable d R N(N ), which corresponds to the nonzero components

3 of the gradient operator u, and then to impose the following set of constraints: d = u i+,j u, i =,..., N ; j =,..., N, d = u + u, i =,..., N; j =,..., N. Observe that the above constraints can be written in the matrix form as Eu + d = 0, (.3) where E T is a network flow matrix with N nodes and N(N ) arcs, with each node having at most degree 4, i.e., E = Now, denoting T (d) := ( ) d d, we consider the following optimization problem of { (u, d ) = arg min φ(u, d) := T (d) + λ u f + q Eu + d ]} (.4) for some parameters λ, q > 0. Similar to φ(u) in (.), the new objective function φ(u, d) is also a nonsmooth strongly convex function. While the non-separable and nonsmooth term T ( ) makes problem (.) difficult to solve, the nonsmooth term T ( ) in (.4) is separable with respect to (d, d ). This fact will enable us to design a very efficient algorithm for solving problem (.4) (see Section 3.). We would also like to provide some intuitive explanations about the reformulation given in (.4). Observe that both terms, i.e., T (d) and Eu + d, can be viewed as certain regularization terms. While the first term T (d) enforces the sparsity of the vector d, i.e., the estimated gradient vector, and thus help to smooth the recovered image, the latter term Eu + d essentially takes into account that the computation of d is not exact because of the stochastic noise. Introducing this extra regularization term into the optimization problem would protect the image from being oversmoothed, as what might happen for the original formulation in (.) (see Section 4). It is interesting to observe some relations between problem (.) and (.4). Proposition Let φ and φ be the optimal values of (.) and (.4), respectively. We have φ φ φ + N λq. (.5) If follows from Proposition that, for a given λ and N, the parameter q in (.4) should be big enough in order to approximately solve the original problem (.). Observe, however, that our goal is not to solve problem (.), but to recover the contaminated image. Due to the aforementioned role that the extra regularization term Eu + d plays in (.4), we argue that it is not necessary to choose a very large q. Indeed, we observe from our computational experiments that q can be set to or 4 for most cases, and that selecting a much larger value of q seems to be actually harmful to image denoising. 3

4 3 A linearly convergent algorithm for TV denoising In the previous section, we reformulated the TVM problem as to minimize the summation of a relatively simple nonsmooth convex function and a smooth strongly convex function. Our goal in this section is to show that such a reformulation can be efficiently solved. More specifically, we first present an accelerated gradient descent (AC-GD) method based on Nesterov s smooth optimal method 4, 6] for solving a general class of strongly convex composite optimization problems. Then, we show that, by using this algorithm, one can solve problem (.4) in O ( ) q ln(/ɛ) iterations. Our algorithm can be viewed as a variant of the well-known FISTA algorithm by Beck and Teboulle, ]. However, since FISTA does not take the advantage of the strong convexity of the problem, it possesses a much worse performance guarantee than the one mentioned above. 3. The accelerated gradient descent (AC-GD) algorithm Consider the following general composite problem of Ψ := min {Ψ(x) := ψ(x) + X (x)}, (3.6) x X where X R n is a closed convex set, X : X R is a simple convex function, and ψ : X R is smooth and strongly convex with Lipschitz continuous gradients, i.e., L 0, µ 0, such that µ y x ψ(y) ψ(x) ψ (x), y x L y x, x, y X. (3.7) The following AC-GD algorithm for solving (3.6) maintains the updating of three intertwined sequences, namely, {x t }, {x ag t } and {x md t } at each iteration t. All these types of multi-step gradient algorithms originate from Nesterov s seminal work in 4] (see Tseng 8] for a summary). However, very few of these algorithms can make use of the special strongly convex composite structure in (3.6) except those in 0, ]. The AC-GD method for strongly convex composite optimization. Input: x 0 X, step-size parameters {α t } t and {γ t } t s.t. α =, α t (0, ) for any t, and γ t 0 for any t. 0) Set the initial point x ag 0 = x 0 and t = ; ) Set ) Set x + t = x md t = ( α t)(µ + γ t ) γ t + µ( α t ) xag t + α t µ( α t ) + γ t ] γ t + µ( α t ) x t ; (3.8) α t µ x md t µ + γ t x t = arg min x X + ( α t)µ + γ t µ + γ t x t, (3.9) { α t ψ (x md t ), x + α t X (x) + (µ + γ t) } x + t x, (3.0) x ag t = α t x t + ( α t )x ag t ; (3.) 3) Set t t + and go to step. The AC-GD algorithm differs from a related accelerated stochastic approximation (AC-SA) algorithm for solving strongly convex composite optimization problems 0, ] in the following aspects. 4

5 Firstly, the above algorithm is deterministic, while the one in 0] is stochastic. Secondly, the subproblem used to define x t in (3.0) is much simpler than the corresponding one in 0]. Finally, we show that the above simple AC-GD algorithm can achieve the optimal rate of convergence for solving strongly convex composite problems possessed by a more involved multi-stage algorithm in ]. Theorem below describes the main convergence properties of the above AC-GD algorithm. Theorem Assume that {α t } t and {γ t } t in the AC-GD algorithm are chosen such that where Then, we have for any t, Γ t := µ + γ t Lα t, (3.) γ /Γ = γ /Γ =..., (3.3) {, t =, ( α t )Γ t, t. (3.4) Ψ(x ag t ) Ψ Γ tγ x 0 x, (3.5) where x is an optimal solution of (3.6). By properly choosing the stepsize parameters α t and γ t, we show that the above AC-GD algorithm can achieve the optimal rate of convergence for solving problem (3.6). Corollary 3 Let {x ag t } t be computed by the AC-GD algorithm with { } µ α t = max L, and γ t = LΓ t, (3.6) t + where Γ t is defined in (3.4). Then we have Ψ(x ag t ) Ψ min { ( µ L ) t, } L x 0 x, t. (3.7) t(t + ) Proof. The result follows by plugging the values of α t and γ t into (3.5) and noting that { ( ) t µ t ( Γ t min, ) } { ( ) t } µ = min,. L τ + L t(t + ) τ= 3. The AC-GD algorithm for total variation minimization In this subsection, we discuss how to apply the above AC-GD algorithm to solve the reformulated TV minimization problem in (.4). First, observe that the objective function φ( ) in (.4) can be written in the composite form, i.e., φ(u, d) = T (d) + ψ(u, d), where ψ(u, d) is given by ψ(u, d) := λ ( u A d ) ( f 0 ) ] Iu 0, A :=. (3.8) qe qid Here I u R N N and I d R N(N ) N(N ) are the identity matrices. Proposition 4 below summarizes some properties of ψ( ). 5

6 Proposition 4 The function ψ( ) in (3.8) is strongly convex with modulus Moreover, its gradient is Lipschitz continuous with constant µ ψ λ 0 +. (3.9) q L ψ λ( + q). (3.0) In view of the composition structure of φ( ) and Proposition 4, we can apply the AC-GD algorithm for solving problem (.4). Moreover, since A is very sparse, the computation of the gradient of ψ( ) only takes O(N ) arithmetic operations. Second, it is worth noting that the subproblem (3.0) arising from the AC-GD method applied to problem (.4) is easy to solve. Indeed, the subproblem (3.0) is given in the form of { (u, d) t = argmin u,d c, d + T (d) + p d d + t + c, u + p u u + t }, (3.) for some p > 0, c R N(N ) and c R N N. Suppose that the norm is given by an l norm in the definition of T ( ). By examining the optimality condition of problem (3.), we have the following explicit formula (see Section A.4 for more details): u t = p ( pu + t c ), (3.) and 0 if d t,ij = pd + t,ij c ) ij (pd p pd + t,ij c ij. + t,ij c ij if pd + t,ij c ij, pd + t,ij c ij > ; (3.3) where d t,ij = ( (d t ) (d t ) ) ( ( ) d +, d + t,ij = t ( ) d + t ) and c ij = ( c c ). Also note that one can write explicit solutions of (3.) if =. For both cases, solving the subproblem (3.0) requires only O(N ) arithmetic operations. We are now ready to state our main results. Theorem 5 Let (u 0, d 0 ) be an initial point of the AC-GD algorithm applied to problem (.4), and D 0 := (u 0, d 0 ) (u, d ). Also assume that the parameter q and that the stepsize parameters (α t, γ t ), t, are set to (3.6). Then, the iteration-complexity of the AC-GD algorithm for finding an ɛ-solution of (.4) can be bounded by ( { }) λqd 0 λqd0 O min q ln,. (3.4) ε ε Moreover, its arithmetic complexity can be bounded by ( { }) O N λqd 0 λqd0 min q ln,. (3.5) ε ε 6

7 Proof. The bound (3.4) follows immediately from Corollary 3, Proposition 4 and the observation that L ψ/µ ψ = O(q) when q. The bound (3.5) follows from (3.4) and the fact that the number of arithmetic operations in each iteration of the algorithm is bounded by O(N ). Observe that the complexity of FISTA applied to problem (.4) is O( λqd 0 /ε) (see, ]), which is strictly worse than the bound in (3.4). In particular, suppose that q is a given constant, in view of Theorem 5, the complexity of the AC-GD algorithm only weakly depends on the accuracy ɛ, the parameter λ, as well as the distance D 0 (and thus the dimension of the problem). Moreover, its total arithmetic complexity is polynomial with a mild linear dependence on the problem dimension N. 4 Numerical Results and Biomedical Application In this section, we report our preliminary computational results where we compare our reformulation (RTVM) in (.4) with the original TVM (OTVM) model in (.) for image denoising. We also compare the performance of two first-order algorithms for composite optimization, i.e., FISTA and AC-GD, applied to our reformulation. Furthermore, we discuss the application of the developed techniques for solving certain biomedical image denoising problems. 4. Numerical Study on General Image Denosing Problems In this subsection, we conduct numerical experiments on a few classical image denosing problems. In our first experiment, we show that the reformulated TVM model is comparable to the original model in term of the quality of the denoised images. Two image instance sets were used in this experiment. In the first instance set, we take the Lena test image whose pixels were scaled between 0 and. The noisy image is obtained by adding a white Gaussian noise with zero mean and various standard deviations (σ). In the second one, we use the Peppers test images with different sizes and the noise is added similarly to the first one with σ = 0.. The original and noisy images for both instance sets are given in Figure. We set the parameter λ to 6 for both formulations in (.) and (.4). We solve the original TVM model by using an efficient primal-dual algorithm 4] and also apply the AC-GD algorithm for the reformulated TVM model with different values of q. We then report the best (largest) value of the Peak Signal-to-Noise Ratio (PSNR) obtained after 00 iterations for both approaches. The results of these two instance sets are reported in Table and Table, respectively. Moreover, Figure and Figure 3, respectively, represent the denoised Lena and Peppers images obtained from solving the original TVM model, and the reformulated TVM model with q = and 4. Table : PSNR of denoised Lena images by OTVM and RTVM formulations Std. Denoised Lena images dev. Noisy OTVM RTVM RTVM RTVM RTVM RTVM RTVM σ image q = 0.5 q = q = q = 4 q = 8 q =

Original 56 56 Lena image Noisy 56 56 Lena image, σ = 0.

Figure : Original and noised images OTVM denoised image RTVM denoised image RTVM

8 Original Lena image Noisy Lena image, σ = 0. Original Peppers image Noisy Peppers image, σ = 0. Figure : Original and noised images OTVM denoised image RTVM denoised image RTVM denoised image with λ = 6 with λ = 6, q = with λ = 6, q = 4 Figure : Denoised Lena images obtained by the original and reformulated TVM models 8

9 Table : PSNR of denoised Peppers images by OTVM and RTVM formulations Images Denoised Peppers images size Noisy OTVM RTVM RTVM RTVM RTVM RTVM RTVM N image q = 0.5 q = q = q = 4 q = 8 q = OTVM denoised image RTVM denoised image RTVM denoised image with λ = 6 with λ = 6, q = with λ = 6, q = 4 Figure 3: Denoised Peppers images obtained by the original and reformulated TVM models It can be seen from Table and that the values of PSNR computed from the reformulated TVM model are not too sensitive to the choice of parameter q under different selection of noise level σ and image size N. We can set q = or q = 4 in practice to achieve reasonably good solution quality. We also observe that the quality of denoised images obtained by the reformulated TVM model is comparable to that obtained by the original model. In fact, at the first glance, the denoised Lena image using the original TVM model seems to be cleaner than those obtained by using the reformulated model. However, a closer examination reveals that some undesirable oversmoothing effects, e.g., the disappeared texture at the hat and a few extra lines at the nose of the Lena image in Figure, were introduced by the original TVM model. On the other hand, these oversmoothing effects were not appearant in the denoised images using the reformulated model. Moreover, no significant differences could be observed for the denoised Peppers images obtained by using the original and reformulated TVM models. In our second experiment, we demonstrate that AC-GD is faster than FISTA for solving the composite minimization problem in (.4). From our discussion in Section 3., the convergence rate of AC-GD always dominates that of FISTA for solving strongly convex composite optimization problems. Our goal here is to verify this claim from the numerical experiments. Figure 4 shows the convergence behavior of AC-GD and FISTA applied to the Lena image. More specifically, we report the optimality gap φ(u k, d k ) φ for both algorithms, where the optimal value φ was estimated by running FISTA for 0, 000 iterations. As shown in Figure 4, after only 50 iterations, the AC-GD method can reach 0 accuracy. It can also be easily seen from Figure 4 that AC-GD converges linearly while FISTA converges sublinearly. This indeed reflects the difference on the theoretical 9

10 0 4 0 FISTA algorithm AC GD algorithm φ(uk, dk) φ k Figure 4: AC-GD vs. FISTA applied to denoise the Lena image convergence rates for both algorithms applied to problem (.4). 4. Applications in Biomedical Image Denosing In this subsection, we apply the developed reformulations for TVM in Magnetic resonance image (MRI), which provides detailed information about internal structures of the body. In comparison with other medical imaging techniques, MRI is most useful for brain and muscle imaging. Two image instance sets were used in this experiment. In the first instance set, we use the Brain MRI test image and noisy images are obtained by adding a white Gaussian noise with zero mean and various standard deviations (σ). In the second one, we use the Knee MRI test images with different sizes and noisy image with σ = 0.. We applied AC-GD algorithm with same settings (λ = 6 and q = 0.5,,, 4, 8, 6 ) as previous experiment to solve the RTVM model for these two instance sets. The As shown in Figure 5, Figure 6 and Table 3, Table 4, these results obtained for MRI images are consistent with those in Section 4.. Table 3: PSNR of denoised Brain MRI by OTVM and RTVM formulations Std. Denoised Brain MRIs dev. Noisy OTVM RTVM RTVM RTVM RTVM RTVM RTVM σ image q = 0.5 q = q = q = 4 q = 8 q =

56 56 Original Knee MRI 56 56 Noisy Knee MRI, σ = 0.

noised and denoised Knee MRI 56 56 Original Brain MRI 56 56 Noisy Brain MRI, σ = 0.

11 56 56 Original Knee MRI Noisy Knee MRI, σ = 0. Denoised Knee MRI, λ = 6, q = Denoised Knee MRI, λ = 6, q = 4 Figure 5: Original, noised and denoised Knee MRI Original Brain MRI Noisy Brain MRI, σ = 0. Denoised Brain MRI, λ = 6, q = Denoised Brain MRI, λ = 6, q = 4 Figure 6: Original, noised and denoised Brain MRI

12 Table 4: PSNR of denoised Knee MRIs by OTVM and RTVM formulations Images Denoised Knee MRIs size Noisy OTVM RTVM RTVM RTVM RTVM RTVM RTVM N image q = 0.5 q = q = q = 4 q = 8 q = Conclusions In this paper, we introduced a strongly convex composite reformulation for the ROF model, a wellknown approach in biomedical image processing. We show that this reformulation is comparable with the original ROF model in terms of the quality of denoised images. We presented a first-order algorithm which possesses a linear rate of convergence for solving the reformulated problem and can be scalable to large-scale imaging denosing problems. We demonstrate from our numerical experiments that the developed algorithm, when applied to reformulated model, compares favorably with existing first-order algorithms applied to original model. In the future, we would like to generalize these reformulations to others image processing problems such as image deconvolution and image zooming.

13 References ] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Image Proc., 8:49 434, 009. ] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, :83 0, ] A. Chambolle. An algorithm for total variation minimization and applications. J. Math. Imaging Vision, 0:89 97, ] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vision, 40:0 45, 0. 5] T.F. Chan, G.H. Golub, and P. Mulet. A nonlinear primal-dual method for total variation-based image restoration. SIAM J. Sci. Comput., 0(6):964977, ] P.L. Combettes and J. Luo. An adaptive level set method for nondifferentiable constrained image recovery. IEEE Trans. Image Process.,. 7] J. Dahl, P.C. Hansen, S.H. Jensen, and T.L. Jensen. Algorithms and software for total variation image reconstruction via first-order methods. Numerical Algorithms, pages 67 9, ] F. Dibos and G. Koepfler. Global total variation minimization. SIAM J. Numer. Anal., page , ] H. Y. Fu, M. K. Ng, M. Nikolova, and J. L. Barlow. Efficient minimization methods of mixed l l and l l norms for image restoration. SIAM Journal on Scientific Computing, 7(6):88 90, ] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: a generic algorithmic framework. Manuscript, Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL 36, USA, 00. Submitted to SIAM Journal on Optimization. ] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. Manuscript, Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL 36, USA, 00. Submitted to SIAM Journal on Optimization. ] D. Goldfarb and W. Yin. Second-order cone programming methods for total variation-based image restoration. SIAM Journal on Scientific Computing, 7:6 645, ] T. Goldstein and S. Osher. The split bregman algorithm for l regularized problems. Manuscript, UCLA CAM Report(08-9), April ] Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(/k ). Doklady AN SSSR, 69: , ] Y. E. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 6:35 49,

14 6] Y. E. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 03:7 5, ] L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:5968, 99. 8] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Manuscript, University of Washington, Seattle, May ] Y. Wang, J. Yang, W. Yin, and Y. Zhang. A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences, (3):48 7,

15 Appendix We provide the proofs for Proposition, Theorem and Proposition 4 in Sections A., A. and A.3, respectively. We also discuss how to solve the subproblems arising from the AC-GD algorithm applied to problem (.4) in Section A.4. A. Relation between two formulations (Proposition ) In this subsection, we provide the proof of Proposition, which shows how the optimal values of problem (.) and (.4) are related. Proof of Proposition First note that by the definitions of T ( ) and T ( ), we have T (d) = T (u), if d = Eu. (5.6) Let ū be an optimal solution of (.) and d = Eū. Using (.), (.4) and (5.6), we have φ φ(ū, d) = T ( Eū) + λ ] ū f + q Eū Eū = T (ū) + λ ū f = φ, We now show the second relation in (.5). Let (u, d ) be an optimal solution to (.4), then φ φ(u ) = T (u ) + λ u f. Observe that by the definition of T ( ), we have, T (d + δ) = ( d ij + δ ) ij d ij + ( ) d ij δ ij d + ( ) δ ij ij δij T (d) + N δ, where the last inequality follows from the Cauchy-Swartz inequality. It then follows from (5.6) and the above conclusion that Therefore, T (u ) = T ( Eu ) = T (d (d + Eu )) T (d ) + N d + Eu. φ T (d ) + N d + Eu + λ u f = T (d ) + λ u f + q d + Eu ] + N d + Eu λq d + Eu = φ(u, d ) + N d + Eu λq d + Eu φ + N λq, where the last relation follows from Young s inequality. A. Concergence analysis for the AC-GD algorithm (Theorem ) Our main goal in this subsection is to prove the convergence results of the AC-GD algorithm described in Theorem. We first establish two technical results. Lemma 6 states some properties of the subproblem (3.0) and Lemma 7 establishes an important recursion of the AC-GD algorithm. Theorem then follows directly from Lemma 7. The first technical result below characterizes the solution of the projection step (3.0). 5

16 Lemma 6 Let X{ be a convex set and p : X} R be a convex function. Assume that û is an optimal solution of min p(u) + µ x u : u X, where x, ỹ X and µ > 0 are given. Then, for any u X, p(û) + µ x û + µ û u p(u) + µ x u. (5.7) Proof. Denote q(u) = p(u)+ µ x u. The result immediately follows from the strong convexity of q(u) and the optimality condition that q (û), u û 0 for any u X. The following lemma establishes an important recursion for the AC-GD algorithm. Lemma 7 Let (x t, x ag t ) X X be given. Also let (xmd t, x t, x ag t ) X X X be computed according to (3.8),(3.0), (3.) and suppose that (3.) holds for given γ t and α t. Then, for any x X, we have Ψ(x ag t ) + µ x t x ( α t ) Ψ(x ag t ) + µ x t x ] + α t Ψ(x) + γ t x t x x t x ]. (5.8) Proof. We first establish some basic relations among the search points x ag t, x md t, x t and x + t. Denote d t := x ag t x md t. It follows from (3.8), (3.) and (3.9) that d t = α t x t + ( α t ) x ag t xmd t ( ) ( α = α t x t + t )µ + γ t µ + γ t ( = α t x t α tµ x md t ( α t)µ + γ t x t µ + γ t µ + γ t Using the above result and the convexity of ψ, we have x md t α t ( α t )µ + γ t ] x t µ + γ t ) = α t (x t x + t ). (5.9) ψ(x md t ) + ψ (x md t ), d t = ψ(x md t ) + ψ (x md t ), α t x t + ( α t )x ag t xmd t ] = ( α t ) ψ(x md t ) + ψ (x md t ), x ag t xmd t ] + α t ψ(x md t ) + ψ (x md t ), x t x md t ] ( α t )ψ(x ag t ) + α t ψ(x md t ) + ψ (x md t ), x t x md t. It then follows from the previous two observations, (3.7),(3.) and the convexity of X (x) that Ψ(x ag t ) = ψ(x ag t ) + X (x ag t ) ψ(x md t ) + ψ (x md t ), d t + L d t + ( α t )X (x ag t ) + α tx (x t ) ] ) + ψ (x md ), x t x md + L d t ( α t )ψ(x ag t ) + α t ψ(x md t +( α t )X (x ag t ) + α tx (x t ) ] ( α t )Ψ(x ag t ) + α t ψ(x md t ) + ψ (x md t ), x t x md t + X (x t ) + L d t ] = ( α t )Ψ(x ag t ) + α t ψ(x md t ) + ψ (x md t ), x t x md t + X (x t ) t t (5.30) + µ + γ t α t d t µ + γ t Lα t α t d t. (5.3) 6

17 Now let us apply the result regarding the projection step in (3.0). Specifically, by using Lemma 6 with p( ) = α t ψ (x md t ), x + X ( )], û = x t, and x = x + t, we have, for any x X, µ + γ ] t x t x + α t ψ(x md t ) + ψ (x md t ), x t x md t + X (x t ) + µ + γ t x t x + t α t ψ(x md t ) + ψ (x md t ), x x md t + X (x)] + µ + γ t x x + t α t ψ(x md t ) + ψ (x md t ), x x md t + X (x)] + α tµ x x md t + ( α t)µ + γ t x x t α t Ψ(x) + ( α t)µ + γ t x x t, (5.3) where the second inequality follows from (3.9) and the convexity of, and the last inequality follows from the strong convexity of Ψ( ). Combining (5.3) and (5.3), we obtain Ψ(x ag t ) ( α t ) Ψ(x ag t ) + µ x x t ] + α t Ψ(x) + γ t x x t x x t ] µ x x t µ + γ t Lα t α t d t, which clearly implies (5.8) due to the assumption in (3.). We are now ready to prove Theorem. Proof of Theorem : Dividing both sides of (5.8) by Γ t, and using (3.4) and the fact that α =, we have Ψ(x ag t ) + µ Γ t x x t ] Ψ(x ag t Γ ) + µ t x x t ] + α t Ψ(x) Γ t + γ t x x t x x t ], t, Γ t and Ψ(x ag Γ ) + µ x x ] α Ψ(x) + γ x x 0 x x ]. Γ Γ Summing up above inequalities, we obtain Ψ(x ag t ) + µ Γ t x x t ] t τ= α τ Γ τ Ψ(x) + t τ= Note that by (3.4) and the fact α =, we have t α τ = α t ( + Γ ) τ = + Γ τ Γ Γ τ Γ τ Γ τ= τ= γ τ x x τ x x τ ]. (5.33) Γ τ t τ= ( ) =. (5.34) Γ τ Γ τ Γ t Using the above two relations, condition (3.3) and the fact that Γ =, we have Ψ(x ag t ) + µ Γ t x x t ] Ψ(x) + γ x x 0 x x t ] Γ t Γ which clearly implies (3.5). Γ t Ψ(x) + γ x x 0, 7

18 A3. Properties of the composite function (Proposition 4) In this subsection, we provide the proof of Proposition 4, which provides certain estimates on the two crucial parameters µ ψ and L ψ for the smooth component ψ( ) in the composite function φ( ). Proof of Proposition 4: Denote the maximum eigenvalue and minimum eigenvalue of M := A T A by λ max and λ min respectively. Then, it suffices to show that and λ max + q, (5.35) λ min 0 +. (5.36) q We bound the eigenvalues of M by using Gershgorin s Theorem. Observe that the network flow matrix E in (.3) can be written explicitly as E := e K e 0 K P 0 K P 0 0 K P N K 0 0 P L, L, L,N 0 0 P L, L, L,N P N L N, L N, L N,N where e R N is the unit vector (, 0,..., 0) T, K R (N ) (N ) denotes the two-diagonals lower triangular matrix with the main diagonal entries equal to and the sub-diagonal entries equal to, P i, i N, denotes the matrix having the ith column equal to e and others entries equal to 0 and L R (N ) (N ), i, j N, denotes the matrix with the ith column equal to jth column of K and others entries equal to 0. First, we will find the upper bound for the maximum eigenvalue of M. It is easy to see that the Nth row of M having the largest value of the sum of the absolute values of all entries. We have N +N(N ) i= M N,i = + q, which, in view of the Gershgorin s Theorem, clearly implies (5.35). Second, we find the lower bound for the minimum eigenvalue of M. Since if λ is an eigenvalue of M then /λ is an eigenvalue of M, we will find the upper bound for the maximum eigenvalue of M instead of the minimum eigenvalue of M. Note that M = A (A ) T and that by applying, 8

19 Gauss-Jordan s elimination, we easily obtain the formula of A as follows I u 0 e K e 0 K P 0 K 0 0 A 0 P 0 0 K 0 ( / ) q I d 0 P N K 0 0 P L, L, L,N 0 0 P L, L, L,N P N L N, L N, L N,N It is easy to see that the (N + N)th row of M having the largest value for the sum of the absolute values of all the entries. In particular, we have N +N(N ) i= M N +N,i = 0 + q, which, in view of the Gershgorin s Theorem, clearly implies that λ max (M ) 0 + q. By using the above observation and the fact that λ min (M) = /λ max (M ), we obtain (5.36). A4. Solving the subproblem (3.) in AC-GD algorithm In this subsection, we derive the explicit solution to the subproblem (3.) in Step of the AC-GD algorithm. It is easy to see that (3.) is true. By definition of T (d), d t,ij is the solution of min c ij, d ij + d ij d + p d ij d + t,ij, ij where d ij = ( d d ), d + t,ij = ( ( d + t ) ( d + t ) ), c ij = ( c c ). For notational convenience, let us consider the following problem: min y r, y + y + p y q, (5.37) 9

20 where y, q, r R, p R and p > 0. We consider two cases: Case : pq r. We have r, y + y + p y q = y + p y + p q p q, y + r, y = y + p y + p q pq r, y y + p y + p q pq r y p y + p q, which implies that y = 0 is the solution of (5.37) in case pq r. Case : pq r >. By the optimality condition of (5.37), we have y y + py pq + r = 0, which means { y y + py pq + r = 0, y y + py pq + r = 0. Denoting t = y, then we have { y ( t + p) = pq r, y ( t + p) = pq r, or equivalently, { y = t +tp (pq r ), y = t +tp (pq r ). Combining the above relation with t = y, we have t + tp pq r = t, or equivalently, t = p ( pq r ), which immediately implies that the optimal solution of (5.37) is given by y = pq r p pq r (pq r). Replacing y, q and r by d ij, d + t,ij and c ij respectively, we obtain (3.3). It is worth noting that this formula still holds in case y, q, r R. 0

Gradient Sliding for Composite Optimization

Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this