The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

Size: px

Start display at page:

Download "The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent"

Shona Little
5 years ago
Views:

1 The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent Caihua Chen Bingsheng He Yinyu Ye Xiaoming Yuan 4 December, Abstract. The alternating direction method of multipliers (ADMM) is now widely used in many fields, and its convergence was proved when two blocks of variables are alternatively updated. It is strongly desirable and practically valuable to extend ADMM directly to the case of a multi-block convex minimization problem where its objective function is the sum of more than two separable convex functions. However, the convergence of this extension has been missing for a long time neither affirmatively proved convergence nor counter example showing its failure of convergence is known in the literature. In this paper we answer this long-standing open question: the direct extension of ADMM is not necessarily convergent. We present an example showing its failure of convergence, and a sufficient condition ensuring its convergence. Keywords. Alternating direction method of multipliers, Convergence analysis, Convex programming, Sparse optimization, Low-rank Optimization, Image processing, Statistical learning, Computer vision Introduction We consider the convex minimization model with linear constraints and an objective function which is the sum of three functions without coupled variables: min θ (x ) + θ (x ) + θ (x ) s.t. A x + A x + A x = b, x X, x X, x X, (.) where A i R p n i (i =,, ), b R p, X i R n i (i =,, ) are closed convex sets; and θ i : R n i R (i =,, ) are closed convex but not necessarily smooth functions. The solution set of (.) is assumed to be nonempty. The abstract model (.) captures many applications in diversifying areas e.g. see the image alignment problem in [8], the robust principal component analysis model with noisy and incomplete data in [], the latent variable Gaussian graphical model selection in International Center of Management Science and Engineering, School of Management and Engineering, Nanjing University, China. This author was supported in part by the Natural Science Foundation of Jiangsu Province under project grant No. BK55 and the NSFC Grant chchen@nju.edu.cn. International Centre of Management Science and Engineering, School of Management and Engineering, and Department of Mathematics, Nanjing University, China. This author was supported by the NSFC Grant 97 and the MOEC fund hebma@nju.edu.cn. Department of Management Science and Engineering, School of Engineering, Stanford University, USA; and International Center of Management Science and Engineering, School of Management and Engineering, Nanjing University. This author was supported by AFOSR Grant FA yyye@stanford.edu. 4 Department of Mathematics, Hong Kong Baptist University, Hong Kong, China. This author was supported partially by the General Research Fund from Hong Kong Research Grants Council: 6. xmyuan@hkbu.edu.hk.

2 [4, 7] and the quadratic discriminant analysis model in [6]. Our discussion is inspired by the scenario where each function θ i may have some specific properties and it deserves to explore them in algorithmic design. This is often encountered in some sparse and low-rank optimization models, such as the just-mentioned applications of (.). We thus do not consider the generic treatment that the sum of three functions is regarded as one general function and advantageous properties of each individual θ i are ignored or not fully used. The alternating direction method of multipliers (ADMM) was originally proposed in [9] (see also [, 7]), and it is now a benchmark for the following convex minimization model analogous to (.) but with only two blocks of functions and variables: min θ (x ) + θ (x ) s.t. A x + A x = b, x X, x X. (.) Let L A (x, x, λ) = θ (x ) + θ (x ) λ T ( A x + A x b ) + β A x + A x b (.) be the augmented Lagrangian function of (.) with λ R p the Lagrange multiplier and β > a penalty parameter. Then, the iterative scheme of ADMM for (.) is x k+ = Argmin{L A (x, x k, λ k ) x X }, (.4a) (ADMM) x k+ = Argmin{L A (x k+, x, λ k ) x X }, (.4b) λ k+ = λ k β(a x k+ + A x k+ b). (.4c) The iterative scheme of ADMM embeds a Gaussian-Seidel decomposition into each iteration of the augmented Lagrangian method in [4, 9]; thus the functions θ and θ are treated individually and so easier subproblems could be generated. This feature is very advantageous for a broad spectrum of application such as partial differential equations, mechanics, image processing, statistical learning, computer vision, and so on. In fact, ADMM has recently witnessed a renaissance in many application domains after a long period without too much attention. We refer to [, 5, 8] for some review papers on ADMM. With the same philosophy as ADMM to take advantage of each θ i s properties individually, it is natural to extend the original ADMM (.4) for (.) directly to (.) and obtain the scheme where (Extended ADMM) x k+ = Argmin{L A (x, x k, x k, λ k ) x X }, (.5a) x k+ = Argmin{L A (x k+, x, x k, λ k ) x X }, (.5b) x k+ = Argmin{L A (x k+, x k+, x, λ k ) x X }, (.5c) λ k+ = λ k β(a x k+ + A x k+ + A x k+ b), (.5d) L A (x, x, x, λ) = θ i (x i ) λ T ( A x + A x + A x b ) + β A x + A x + A x b (.6) i= is the augmented Lagrangian function of (.). This direct extension of ADMM is strongly desired and practically used by many users, see e.g. [8, ]. The convergence of (.5), however, has been ambiguous for a long time there is neither affirmative convergence proof nor counter example

3 showing its failure of convergence in the literature. This convergence ambiguity has inspired an active research topic in developing such algorithms that are some slightly twisted versions of (.5) but with provable convergence and competitive numerical efficiency and iteration simplicity, see e.g. [,, 5]. Since the extended ADMM scheme (.5) does work well for some applications (e.g. [8, ]), users have the inclination to imagine that this scheme seems convergent even though they are perplexed by the rigorous proof. In the literature, there is even very little hint for the difficulty in the convergence proof for (.5), see [5] for an insightful explanation. The main purpose of this paper is to answer this long-standing open question negatively: The extended ADMM scheme (.5) is not necessarily convergent. We will give an example (and a strategy for constructing such an example) to demonstrate its failure of convergence, and show that the convergence of (.5) can be guaranteed when any two coefficient matrices in (.) are orthogonal. A Sufficient Condition Ensuring the Convergence of (.5) We first study a condition that can ensure the convergence for the direct extension of ADMM (.5). Our methodology of constructing a counter example to show the failure of convergence for (.5) is also clear via this study. Our claim is that the convergence of (.5) is guaranteed when any two coefficient matrices in (.) are orthogonal. We thus will discuss the cases: A T A =, A T A = and A T A =. This new condition does not impose any strong convexity on the objective function in (.), and it simply requires to check the orthogonality of the coefficient matrices. So, it is more checkable than conditions in the literature such as the one in [] which requires strong convexity on all functions in the objective and restricts the choice of the penalty parameter β into a specific range; and the one in [5] which requires to attach a sufficiently small shrinkage factor to the Lagrange-multiplier updating step (.5d) such that a certain error-bound condition is satisfied.. Case : A T A = or A T A = We remark that if two coefficient matrices of (.) in consecutive order are orthogonal, i.e., A T A = or A T A =, then the direct extension of ADMM (.5) reduces to a special case of the original ADMM (.4). Thus the convergence of (.5) under this condition is implied by well known results in ADMM literature. To see this, let us first assume A T A =. According to the first-order optimality conditions of the minimization problems in (.5), we have x k+ i X i (i =,, ) and θ (x ) θ (x k+ ) + (x x k+ ) T { A T [λ k β(a x k+ + A x k + A x k b)]}, x X, (.a) θ (x ) θ (x k+ ) + (x x k+ ) T { A T [λ k β(a x k+ + A x k+ + A x k b)]}, x X, (.b) θ (z) θ (x k+ ) + (x x k+ ) T { A T [λ k β(a x k+ + A x k+ + A x k+ b)]}, x X. (.c) Then, because of A T A =, it follows from (.) that θ (x ) θ (x k+ ) + (x x k+ ) T { A T [λ k β(a x k+ + A x k b)]}, x X, (.a) θ (x ) θ (x k+ ) + (x x k+ ) T { A T [λ k β(a x k+ + A x k b)]}, x X, (.b) θ (x ) θ (x k+ ) + (x x k+ ) T { A T [λ k β(a x k+ + A x k+ + A x k b)]}, x X, (.c)

4 which is also the first-order optimality condition of the scheme { (x k+, x k+ ) = Argmin θ (x ) + θ (x ) (λ k ) T (A x + A x ) + β A x + A x + A x k b x X, x X }, (.a) x k+ = Argmin{θ (x ) (λ k ) T A x + β A x k+ + A x k+ + A x b x X, }, (.b) λ k+ = λ k β(a x k+ + A x k+ + A x k+ b). (.c) Clearly, (.) is a specific application of the original ADMM (.4) to (.) by regarding (x, x ) as one variable, [A, A ] as one matrix and θ (x ) + θ (x ) as one function. Note in (.), both x k and x k are not required to generate the (k + )-th iteration under the orthogonality condition AT A =. Existing convergence results for the original ADMM such as those in [6, ] thus hold for the special case of (.5) with the orthogonality condition A T A =. Similar discussion can be carried out under the orthogonality condition A T A =.. Case : A T A = In the last subsection, we have discussed the cases where two consecutive coefficient matrices are orthogonal. Now, we pay more attention to the case where A T A = and show that it can also ensure the convergence of (.5). To prepare for the proof, we need to make something clear. First, note the update order of (.5) at each iteration is x x x λ and then it repeats cyclically. We thus can rewrite (.5) in the form x k+ = Argmin{L A (x k+, x, x k, λ k ) x X }, (.4a) x k+ = Argmin{L A (x k+, x k+, x, λ k ) x X }, (.4b) λ k+ = λ k β(a x k+ + A x k+ + A x k+ b), (.4c) x k+ = Argmin{L A (x, x k+, x k+, λ k+ ) x X }. (.4d) According to (.4), there is a update for the variable λ between the updates for x and x. Thus, the case A T A = requires discussion different from that in the last subsection. We will focus on this representation of (.5) within this subsection. Second, it worths to mention that the variable x is not involved in the iteration of (.4), meaning the scheme (.4) generating a new iterate only based on (x k+, x k, λk ). We thus follow the terminology in [] to call x an intermediate variable; and correspondingly call (x, x, λ) essential variables because they are really necessary to execute the iteration of (.4). Accordingly, we use the notations w k = (x k+, x k, xk, λk ), u k = w k \λ k = (x k+, x k, xk ), vk = w k \x k = (xk+, x k, λk ), v = w\x = (x, x, λ), V = X X R p and V := {v = (x, x, λ ) w = (x, x, x, λ ) Ω }. Note the first element in w k, u k or v k is x k+ rather than x k. Third, it is useful to characterize the model (.) by a variational inequality. More specifically, finding a saddle point of the Lagrange function of (.) is equivalent to solving the variational inequality problem: Finding w Ω such that VI(Ω, F, θ) : θ(u) θ(u ) + (w w ) T F (w ), w Ω, (.5a) 4

5 where and u = x x x, w = F (w) = x x x λ, θ(u) = θ (x ) + θ (x ) + θ (x ), (.5b) A T λ A T λ A T λ A x + A x + A x b. (.5c) Obviously, the mapping F ( ) defined in (.5c) is monotone because it is affine with a skew-symmetric matrix. Last, let us take a deeper look at the output of (.4) and investigate some of its properties. In fact, deriving the first-order optimality condition of the minimization problems in (.4) and rewriting (.4c) appropriately, we obtain θ (x ) θ (x k+ ) + (x x k+ ) T { A T [λ k β(a x k+ + A x k+ + A x k b)]}, x X, (.6a) θ (x ) θ (x k+ ) + (x x k+ ) T { A T [λ k β(a x k+ + A x k+ + A x k+ b)]}, x X, (.6b) (A x k+ + A x k+ + A x k+ b) + β (λk+ λ k ) =, (.6c) θ (x ) θ (x k+ ) + (x x k+ ) T { A T [λ k+ β(a x k+ + A x k+ + A x k+ b)]}, x X. (.6d) Then, substituting (.6c) in (.6a) (.6b) and (.6d) and using A T A =, we get θ (x ) θ (x k+ ) + (x x k+ ) T { A T λ k+ + βa T A (x k x k+ )}, x X, (.7a) θ (x ) θ (x k+ ) + (x x k+ ) T { A T λ k+ }, x X, (.7b) (A x k+ + A x k+ + A x k+ b) + A (x k+ x k+ ) β (λk λ k+ ) =, (.7c) θ (x ) θ (x k+ ) + (x x k+ ) T { A T λ k+ βa T A (x k+ x k+ ) + A T (λ k λ k+ )}, x X. (.7d) With the definitions of θ, F, Ω, u k and v k, we can rewrite (.7) as a compact form. We summarize it in the next lemma and omit its proof as it is just a compact reformulation of (.7). Lemma.. Let w k+ = (x k+, x k+, x k+, λ k+ ) be generated by (.4) from given v k = (x k+, x k, λk ). Then we have w k+ Ω, θ(u) θ(u k+ ) + (w w k+ ) T {F (w k+ ) + Q(v k v k+ )}, w Ω, (.8) where Q = βa T A A T βa T A. (.9) A I β Note the assertion (.8) is useful for quantifying the accuracy of w k+ to a solution point of VI(Ω, F, θ), because of the variational inequality reformulation (.5) of (.). Now, we are ready to prove the convergence for the direct extension of ADMM under the condition A T A =. We first refine the assertion (.8) under this additional condition. 5

6 Lemma.. Let w k+ = (x k+, x k+, x k+, λ k+ ) be generated by (.4) from given v k = (x k+, x k, λk ). If A T A =, then we have where w k+ Ω, θ(u) θ(u k+ ) + (w w k+ ) T {F (w k+ ) + βp A (x k x k+ )} P = A T A T A T (v v k+ ) T H(v k v k+ ), w Ω, (.) x, v = x λ and H = Proof. Since A T A =, the following is an identity: x x k+ T βa T A βa T A A T x x k+ x x k+ βa T λ λ k+ A A β I = x x k+ x x k+ λ λ k+ T βa T A A T βa T A βa T A A T βa T A A β I A β I x k+ x k+ x k xk+ λ λ k+ x k+ x k+ x k xk+ λ k λ k+. (.) Adding the above identity to the both sides of (.8) and using the notations of v and H, we obtain w k+ Ω, θ(u) θ(u k+ ) + (w w k+ ) T {F (w k+ ) + Q (v k v k+ )}. (v v k+ ) T H(v k v k+ ), w Ω, (.) where (see Q in (.9)) Q = Q + βa T A βa T A A T βa T A A β I βa T A = βa T A βa T A. By using the structures of the matrices Q and P (see (.)), and the vector v, we have The assertion (.) is proved. (w w k+ ) T Q (v k v k+ ) = (w w k+ ) T βp A (x k x k+ ). Let us define two auxiliary sequences which will only serve for simplifying our notation in convergence analysis: x k x k+ w k x k = x k = x k+ x k x k+ and ũk = x k, (.) λ k λ k+ βa (x k x k xk+ ) where {x k+, x k+, x k+, λ k+ } is generated by (.4). In the next lemma, we establish an important inequality based on the assertion in Lemma., which will play a vital role in convergence analysis. 6

7 Lemma.. Let w k+ = (x k+, x k+, x k+, λ k+ ) be generated by (.4) from given v k = (x k+, x k, λk ). If A T A =, we have w k Ω and θ(u) θ(ũ k ) + (w w k ) T F ( w k ) ( v v k+ H v v k ) H + vk v k+ H, w Ω,(.4) where w k and ũ k are defined in (.). Proof. According to the definition of w k, (.) can be rewritten as w k Ω, θ(u) θ(ũ k ) + (w w k ) T F ( w k ) β(a x k+ + A x k+ + A x k+ b) T A (x k x k+ ) +(v v k+ ) T H(v k v k+ ), w Ω, (.5) Setting x = x k in (.7b), we obtain θ (x k ) θ (x k+ ) + (x k x k+ ) T { A T λ k+ }. (.6) Note that (.7b) is also true for the (k )th iteration. Thus, it holds that θ (x ) θ (x k ) + (x x k ) T { A T λ k }. Setting x = x k+ in the last inequality, we obtain which together with (.6) yields that θ (x k+ ) θ (x k ) + (x k+ x k ) T { A T λ k }, (.7) (λ k λ k+ ) T A (x k x k+ ), k. (.8) By using the fact λ k λ k+ = β(a x k+ + A x k+ + A x k+ b) and the assumption A T A =, we get immdeiately that and hence β(a x k+ + A x k+ + A x k+ b) T A (x k x k+ ), (.9) w k Ω, θ(u) θ(ũ k ) + (w w k ) T F ( w k ) (v v k+ ) T H(v k v k+ ), w Ω. (.) By substituting the identity (v v k+ ) T H(v k v k+ ) = v v ( k+ H v v k ) H + vk v k+ H into the right-hand side of (.), we obtain (.4). Now, we are able to establish the contraction property with respect to the solution set of VI(Ω, F, θ) for the sequence {v k } generated by (.4), from which the convergence of (.4) can be easily established. Theorem.4. Assume A T A = for the model (.). Let {x k+, x k, xk, λk } be the sequence generated by the direct extension of ADMM (.4). Then, we have: 7

8 (i) The sequence {v k := (x k+, x k, λk )} is contractive with respective to the solution of VI(Ω, F, θ), i.e., v k+ v H v k v H v k v k+ H. (.) (ii) If the matrices [A, A ] and A are assumed to be full column rank, then the sequence {w k } converges to a KKT point of the model (.). Proof. (i) The first assertion is straightforward based on (.4). Setting w = w in (.4), we get ( v k v H v k+ v H) vk v k+ H θ(ũ k ) θ(u ) + ( w k w ) T F ( w k ). From the monotonicity of F and (.5), it follows that θ(ũ k ) θ(u ) + ( w k w ) T F ( w k ) θ(ũ k ) θ(u ) + ( w k w ) T F (w ), and thus (.) is proved. Clearly, (.) indicates that the sequence {v k } is contractive with respect to the solution set of VI(Ω, F, θ), see e.g. []. (ii) To prove (ii), by the inequality (.), it follows that the sequences {A x k+ β λk } and {A x k } are both bounded. Since A has full column rank, we deduce that {x k } is bounded. Note that A x k+ + A x k = A x k+ β λk (A x k β λk ) A x k + b. (.) Hence, {A x k+ + A x k } is bounded. Together with the assumption that [A, A ] has full column rank, we conclude that the sequences {x k+ }, {x k } and {λk } are all bounded. Therefore, there exists a subsequence {x n k+, x n k+, x n k+, λ nk+ } that converges to a limit point, say (x, x, x, λ ). Moreover, from (.), we see immediately that which shows and thus v k v k+ H < +, (.) k= lim k H(vk v k+ ) =, (.4) lim k Q(vk v k+ ) =. (.5) Then, by taking the limits on the both sides of (.8), using (.5), one can immediately write w Ω, θ(u) θ(u ) + (w w ) T F (w ), w Ω, (.6) which means w = (x, x, x, λ ) is a KKT point of (.). Hence, the inequality (.) is also valid if (x, x, x, λ ) is replaced by (x, x, x, λ ). Then it holds that which implies that v k+ v H v k v H, (.7) lim A (x k+ x ) k β (λk λ ) =, lim A (x k x ) =. (.8) k 8

9 By taking limits to (.), using (.8) and the assumptions, we know lim k xk+ = x, lim k xk = x, lim k xk = x, lim k λk = λ. (.9) which completes the proof of this theorem. Inspired by [], we can also establish a worst-case convergence rate measured by the iteration complexity in ergodic sense for the direct extension of ADMM (.4). This is summarized in the following theorem. Theorem.5. Assume A T A = for the model (.). Let {(x k, xk, xk, λk )} be the sequence generated by the direct extension of ADMM (.4) and w k be defined in (.). After t iterations of (.4), we take w t = t w k. (.) t + Then, w W and it satisfies θ(ũ t ) θ(u) + ( w t w) T F (w) k= Proof. By the monotonicity of F and (.4), it follows that (t + ) v v H, w Ω. (.) w k Ω, θ(u) θ(ũ k ) + (w w k ) T F (w) + v vk H v vk+ H, w Ω. (.) Together with the convexity of X, X and X, (.) implies that w t Ω. Summing the inequality (.) over k =,,..., t, we obtain (t + )θ(u) t k= ( θ(ũ k ) + (t + )w Use the notation of w t, it can be written as t + t k= t θ(ũ k ) θ(u) + ( w t w) T F (w) k= Since θ(u) is convex and we have that ũ t = t + θ(ũ t ) t + w k) T F (w) + v v H, w Ω. t ũ k, k= (t + ) v v H, w Ω. (.) t θ(ũ k ). Substituting it into (.), the assertion of this theorem follows directly. Remark.6. For an arbitrarily given compact set D Ω, let d = sup{ v v H } v = w\x, w D}, where v = (x, x, λ ). Then, after t iterations of the extended ADMM (.4), the point w t defined in (.) satisfies sup{θ(ũ t ) θ(u) + ( w t w) T d F (w)} (t + ), (.4) which, according to the definition (.5), means w t is an approximate solution of VI(Ω, F, θ) with an accuracy of O(/t). Thus a worst-case O(/t) convergence rate in ergodic sense is established for the direct extension of ADMM (.4). k= 9

10 An Example Showing the Failure of Convergence of (.5) In the last section, we have shown that if it is additionally assumed that any two coefficient matrices in (.) be orthogonal, then the direct extension of ADMM is convergent in any order. In this section, we give an example to show the failure of convergence when (.5) is applied to solve (.) without such an orthogonality assumption. More specifically, we consider the following linear homogenous equation with three variables: x A + x A + x A =, (.) where A i R (i =,, ) are all column vectors and [A, A, A ] is assumed to be nonsingular; and x i R (i =,, ). The unique solution of (.) is thus x = x = x =. Clearly, (.) is a special case of (.) where the objective function is zero, b is the zero vector in R and X i = R for i =,,. The direct extension of ADMM (.5) is thus applicable, and the corresponding optimal Lagrange multiplier is.. The Iterative Scheme of (.5) for (.) Now, we elucidate the iterative scheme when the direct extension of ADMM (.5) is applied to solve the linear equation (.). In fact, as we will show, it can be represented as a matrix recursion. Specifying the scheme (.5) in general setting with β = by the particular setting in (.), we obtain A T λ k + A T (A x k+ + A x k + A x k ) =, (.a) It follows from the first equation in (.) that A T λ k + A T (A x k+ + A x k+ + A x k ) =, (.b) A T λ k + A T (A x k+ + A x k+ + A x k+ ) =, (.c) (A x k+ + A x k+ + A x k+ ) + λ k+ λ k =. (.d) x k+ = ( A T A T A A x k A T A x k + A T λ k). (.) Substituting (.) into (.b), (.c) and (.d), we obtain a reformulation of (.) A T A A T A A T A A A I A T A A T = A T I x k+ x k+ λ k+ A T A A T A A T A A ( A T A, A T A, A T ) x k x k λ k. (.4) Let L = A T A A T A A T A (.5) A A I

11 and R = A T A A T A T I A T A A T A A T A A ( A T A, A T A, A T ). (.6) Then the iterative formula (.4) can be rewritten in the following matrix recursion context: x k+ x k+ λ k+ = M x k x k λ k = = M k+ x x λ (.7) with M = L R. (.8) Note for R defined in (.6), we have R =. A Thus, as will be seen in the next subsection, is an eigenvalue of the matrix M.. A Concrete Example Showing the Divergence of (.5) Now we are ready to construct a concrete example to show the divergence when the direct extension of ADMM (.5) is applied to solve the model (.). Our previous analysis in Section has shown that the scheme (.5) is convergent whenever any two coefficient matrices are orthogonal. Thus, to show the failure of convergence of (.5) for (.), the columns A, A and A in (.) should be chosen such that any two of them are not orthogonal. Moreover, notice that the sequence generated by (.7) is divergent if ρ(m) >. We thus construct the matrix A as follows: A = (A, A, A ) = Given this matrix A, the system of linear equations (.4) can be specified as 6 x k+ 7 9 x k+ λ k+ λ k+ λ k+ 7 4 = 5 ( ) 4, 5,,,. (.9) x k x k λ k λ k λ k.

12 Note with the specification in (.9), the matrices L in (.5) and R in (.6) reduce to L = and R = Thus we have M = L R = By direct computation, we know that M admits the following eigenvalue decomposition where and V = d = M = V Diag(d)V, (.) i i i.8744.i i.4.66i.4.66i i i i i i i i i i i.47.8i i.47.8i.5774 An important fact regarding d defined above is that d = d >, which offers us the opportunity to construct a divergent sequence {x k, xk, λk, λk, λk }. In fact, let us choose the initial point (x, x, λ, λ, λ )T as V (:, ) + V (:, ). It is clear that the vector V (:, ) is the complex conjugate of the vector V (:, ); thus the starting point is real. Then, since V = x x λ λ λ,,

13 we know from (.7) and (.) that x k+ x k+ λ k+ λ k+ λ k+ = V Diag(d k+ )V = V Diag(d k+ ) = V ( i ) k+ ( i ) k+ which is definitely divergent and there is no way to converge to the solution point of (.). x x λ λ λ, 4 Conclusions We showed by an example that the direct extension of the alternating direction method of multiplier (ADMM) is not necessarily convergent for solving a multi-block convex minimization model with linear constraints and an objective function which is the sum of more than two convex functions without coupled variables; a long-standing open problem is thus answered. Based on our strategy for constructing this example, it is easy to find more such examples. The negative answer to this open problem thus verifies the rationale of algorithmic design in our recent work such as [, ], where it was suggested to combine some correction steps with the output of the direct extension of ADMM in order to produce a splitting algorithm with rigorous convergence under mild assumptions for multi-block convex minimization models. We also studied a condition that can guarantee the convergence of the direct extension of ADMM. This condition is significantly different from those in the literature which often require strong convexity on the objective functions and/or restrictive choices for the penalty parameter. Instead, the new condition simply depends on the orthogonality of the given coefficient matrices in the model and poses no restriction on how to choose the penalty parameter in algorithmic implementation. Our analysis focused on the model (.) where there are three convex functions in its objective, because this case represents a wide range of applications and it is easier to demonstrate our main idea with this special case. The analysis can be easily extended to the more general multi-block case of convex minimization model where there are m > convex functions in its objective. Moreover, based on our strategy of constructing the example showing the divergence of (.5), it is easy to show that the direct extension of ADMM with some given value of the penalty parameter is still divergent even if the functions θ i in (.) are all further assumed to be strongly convex. We omit the detail of these extensions for succinctness.

14 References [] E. Blum and W. Oettli, Mathematische Optimierung. Grundlagen und Verfahren. Ökonometrie und Unternehmensforschung, Springer-Verlag, Berlin-Heidelberg-New York, 975. [] S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learning, (), pp. -. [] T. F. Chan and R. Glowinski, Finite element approximation and iterative solution of a class of mildly non-linear elliptic equations, technical report, Stanford University, 978. [4] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky, Latent variable graphical model selection via convex optimization, Ann. Stat., 4 (), pp [5] J. Eckstein, Augmented Lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results, manuscript,. [6] M. Fortin and R. Glowinski, On Decomposition-Coordination Methods Using an Augmented Lagrangian. In M. Fortin and R. Glowinski, eds., Augmented Lagrangian Methods: Applications to the solution of Boundary Problems. North- Holland, Amsterdam, 98. [7] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational problems via finite element approximations, Comput. Math. Appli., (976), pp [8] R. Glowinski, On alternating directon methods of multipliers: a historical perspective, Springer Proceedings of a Conference Dedicated to J. Periaux, to appear. [9] R. Glowinski and A. Marrocco, Approximation par éléments finis d ordre un et résolution par pénalisation-dualité d une classe de problèmes non linéaires, R.A.I.R.O., R (975), pp [] D. R. Han and X.M. Yuan, A note on the alternating direction method of multipliers, J. Optim. Theory Appli., 55 (), pp [] B. S. He, M. Tao and X. M. Yuan, Alternating direction method with Gaussian back substitution for separable convex programming, SIAM J. Optim., (), pp. -4. [] B. S. He, M. Tao and X. M. Yuan, Convergence rate and iteration complexity on the alternating direction method of multipliers with a substitution procedure for separable convex programming, Math. Oper. Res., under revision. [] B. S. He and X.M. Yuan. On the O(/n) convergence rate of the Douglas-Rachford alternating direction method, SIAM J. Num. Anal., 5 (), pp [4] M. R. Hestenes, Multiplier and gradient methods, J. Optim. Theory Appli., 4 (969), pp. -. [5] M. Hong and Z. Q. Luo, On the linear convergence of the alternating direction method of multipliers, manuscript, August. [6] G. J. McLachlan, Discriminant analysis and statistical pattern recognition, volume 544. Wiley- Interscience, 4. 4

15 [7] K. Mohan, P. London, M. Fazel, D. Witten, and S. Lee. Node-based learning of multiple gaussian graphical models. avialable at: arxiv:/.545,. [8] Y. G. Peng, A. Ganesh, J. Wright, W. L. Xu and Y. Ma, Robust alignment by sparse and lowrank decomposition for linearly correlated images, IEEE Tran. Pattern Anal. Mach. Intel. 4 (), pp [9] M. J. D. Powell, A method for nonlinear constraints in minimization problems, In Optimization edited by R. Fletcher, pp. 8-98, Academic Press, New York, 969. [] M. Tao and X. M. Yuan, Recovering low-rank and sparse components of matrices from incomplete and noisy observations, SIAM J. Optim., (), pp

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent Yinyu Ye K. T. Li Professor of Engineering Department of Management Science and Engineering Stanford