Adaptive Stochastic Alternating Direction Method of Multipliers

Size: px

Start display at page:

Download "Adaptive Stochastic Alternating Direction Method of Multipliers"

Daniel Horton
5 years ago
Views:

1 Peilin Zhao, Jinwei Yang ong Zhang Ping Li Data Analytics Department, Institute for Infocomm Research, A*SAR, Singapore Department of Mathematics, Rutgers University; and Department of Mathematics, University of Notre Dame, USA Department of Statistics & Biostatistics, Rutgers University, USA; and Big Data Lab, Baidu Research, China Department of Statistics & Biostatistics, Department of Computer Science, Rutgers University; and Baidu Research, USA Abstract he Alternating Direction Method of Multipliers (ADMM has been studied for years raditional ADMM algorithms need to compute, at each iteration, an (empirical expected loss function on all training examples, resulting in a computational complexity proportional to the number of training examples o reduce the complexity, stochastic ADMM algorithms were proposed to replace the expected loss function with a random loss function associated with one uniformly drawn example plus a Bregman divergence term he Bregman divergence, however, is derived from a simple nd-order proximal function, ie, the half squared norm, which could be a suboptimal choice In this paper, we present a new family of stochastic ADMM algorithms with optimal nd-order proximal functions, which produce a new family of adaptive stochastic ADMM methods We theoretically prove that the regret bounds are as good as the bounds which could be achieved by the best proximal function that can be chosen in hindsight Encouraging empirical results on a variety of real-world datasets confirm the effectiveness and efficiency of the proposed algorithms Introduction Originally introduced in (Glowinski & Marroco, 975; Gabay & Mercier, 976, the offline/batch Alternating Direction Method of Multipliers (ADMM stemmed from the augmented Lagrangian method, with its Proceedings of the 3 nd International Conference on Machine Learning, Lille, France, 5 JMLR: W&CP volume 37 Copyright 5 by the author(s global convergence property established in (Gabay, 983; Glowinski & Le allec, 989; Eckstein & Bertsekas, 99 Recent studies have shown that ADMM achieves a convergence rate of O(/ (Monteiro & Svaiter, 3; He & Yuan, (where is number of iterations of ADMM, when the objective function is generally convex Furthermore, ADMM enjoys a convergence rate of O(α, for some α (,, when the objective function is strongly convex and smooth (Luo, ; Deng & Yin, ADMM has shown attractive performance in a wide range of real-world problems such as compressed sensing (Yang & Zhang,, image restoration (Goldstein & Osher, 9, video processing, and matrix completion (Goldfarb et al, 3, etc From the computational perspective, one drawback of AD- MM is that, at every iteration, the method needs to compute an (empirical expected loss function on all the training examples he computational complexity is propositional to the number of training examples, which makes the original ADMM unsuitable for solving large-scale learning and big data mining problems he online ADMM (OADMM algorithm (Wang & Banerjee, was proposed to tackle the computational challenge For OADMM, the objective function is replaced with an online function at every step, which only depends on a single training example OAD- MM can achieve an average regret bound of O(/ for convex objective functions and O(log( / for strongly convex objective functions Interestingly, although the optimization of the loss function is assumed to be easy in the analysis of (Wang & Banerjee,, this step is actually not necessarily easy in practice o address this issue, the stochastic ADMM algorithm was proposed, by linearizing the online loss function (Ouyang et al, 3; Suzuki, 3 In stochastic ADMM algorithms, the online loss function is firstly uniformly drawn from all the loss functions associated with all the training examples hen the

2 loss function is replaced with its first order expansion at the current solution plus the Bregman divergence from the current solution he Bregman divergence is based on a simple proximal function, the half squared norm, so that the Bregman divergence is the half squared distance In this way, the optimization of the loss function enjoys a closed-form solution he stochastic ADMM achieves similar convergence rates as OADMM Using the half square norm as proximal function, however, may be a suboptimal choice Our paper will address this issue We should mention that there are other strategies recently adopted to accelerate stochastic ADMM, including stochastic average gradient (Zhong & Kwok, 4, dual coordinate ascent (Suzuki, 4, which are still based on half squared distance Our contribution In the previous work (Ouyang et al, 3; Suzuki, 3 the Bregman divergence is derived from a simple second order function, ie, the half squared norm, which could be a suboptimal choice (Duchi et al, In this paper, we present a new family of stochastic ADMM algorithms with adaptive proximal functions, which can accelerate stochastic ADMM by using adaptive regularization We theoretically prove that the regret bounds of our methods are as good as those achieved by stochastic ADMM with the best proximal function that can be chosen in hindsight he effectiveness and efficiency of the proposed algorithms are confirmed by encouraging empirical evaluations on several real-world datasets Adaptive Stochastic Alternating Direction Method of Multipliers Problem Formulation In this paper, we will study a family of convex optimization problems, where our objective functions are composite Specially, we are interested in the following equalityconstrained optimization task: min w W,v V f((w, v := E ξ l(w, ξ + φ(v, ( st Aw + Bv = b, where w R d, v R d, A R m d, B R m d, b R m, W and V are convex sets For simplicity, the notation l is used for both the instance function value l(w, ξ and its expectation l(w = E ξ l(w, ξ It is assumed that a sequence of identical and independent (iid observations can be drawn from the random vector ξ, which satisfies a fixed but unknown distribution When ξ is deterministic, the above optimization becomes the traditional formulation of ADMM (Boyd et al, In this paper, we will assume the functions l and φ are convex but not necessarily continuously differentiable In addition, we denote the optimal solution of ( as (w, v Before presenting the proposed algorithm, we first introduce some notations For a positive definite matrix G R d d, we define the G-norm of a vector w as w G := w Gw When there is no ambiguity, we often use to denote the Euclidean norm We use, to denote the inner product in a finite dimensional Euclidean space Let H t be a positive definite matrix for t N Set the proximal function ϕ t (, as ϕ t (w = w H t = w, H tw hen the corresponding Bregman divergence for ϕ t (w is defined as B ϕt (w, u = ϕ t (w ϕ t (u ϕ t (u, w u Algorithm = w u H t o solve the problem (, a popular method is Alternating Direction Method of Multipliers (ADMM ADMM splits the optimization with respect to w and v by minimizing the augmented Lagrangian: min L β(w, v, θ := l(w + φ(v θ, Aw + Bv b w,v + β Aw + Bv b, where β > is a pre-defined penalty Specifically, the ADMM algorithm minimizes L β as follows w t+ = arg min w L β(w, v t, θ t, v t+ = arg min v L β (w t+, v, θ t, θ t+ = θ t β(aw t+ + Bv t+ b At each step, however, ADMM requires calculating the expectation E ξ l(w, ξ, which may be computationally too expensive, since we may only have an unbiased estimate of l(w or the expectation E ξ l(w, ξ is an empirical one for big data problems o solve this issue, we propose to minimize the following stochastic approximation: L β,t (w, v, θ = g t, w + φ(v θ, Aw + Bv b + β Aw + Bv b + η B ϕ t (w, w t, where g t = l (w t, ξ t and H t for ϕ t = w H t will be specified later his objective linearizes l(w, ξ t and adopts a dynamic Bregman divergence function to keep the new model near to the previous one It is easy to see that this proposed approximation includes the one proposed by (Ouyang et al, 3 as a special case when H t = I o minimize the above function, we followed the ADMM algorithm to optimize over w, v, θ sequentially, by fixing the others In addition, we also need to update H t for B ϕt at every step, which will be specified later Finally the proposed (Ada- is summarized in Algorithm

3 Algorithm Adaptive Stochastic Alternating Direction Method of Multipliers (Ada- Initialize: w =, u =, θ =, H = ai, and a > for t =,,, do Compute g t = l (w t, ξ t ; Update H t and compute B ϕt ; w t+ = arg min w W L β,t (w, v t, θ t ; v t+ = arg min v V L β,t (w t+, v, θ t ; θ t+ = θ t β(aw t+ + Bv t+ b; end for 3 Analysis his subsection is devoted to analyzing the expected convergence rate of the iterative solutions of the proposed algorithm for general H t, t =,,, where the proof techniques in (Ouyang et al, 3 and (Duchi et al, are adopted o accomplish this, we begin with a technical lemma, which will facilitate the later analysis Lemma Let l(w, ξ t and φ(w be convex functions and H t be positive definite, for t For Algorithm, we have the following inequality l(w t + φ(v t+ l(w φ(v + (z t+ z F (z t+ η g t H t + η [B ϕ t (w t, w B ϕt (w t+, w] + β ( Aw + Bv t b Aw + Bv t+ b + δ t, w w t + β ( θ θ t θ θ t+, where z t = (w t, v t, θ t, z = (w, v, θ, δ t = g t l (w t, and F (z = (( A θ, ( B θ, (Aw + Bv b he proof is in the appendix We now analyze the convergence behavior of Algorithm and provide an upper bound on the objective value and the feasibility violation heorem Let l(w, ξ t and φ(w be convex functions and H t be positive definite, for t For Algorithm, we have the following inequality for any and ρ > : E[f(ū f(u + ρ A w + B v b ] { [ E η (B ϕ t (w t, w B ϕt (w t+, w t= +η g t ] } Ht + βd v,b + ρ, ( β where ū = ( t= w t, (w, v, and ( w, v = ( and D v,b = Bv + t= v t + t= w t,, u = + t= v t, Proof For convenience, we denote u = (w, v, θ = + t= θ t, and z = ( w, v, θ With these notations, using convexity of l(w and φ(v and the monotonicity of operator F (, we have for any z: f(ū f(u + ( z z F ( z [f((wt, vt+ f(u + (z t+ z F (z t+] t= = [l(w t +φ(v t+ l(w φ(v+(z t+ z F (z t+ ] t= Combining this inequality with Lemma at the optimal solution (w, v = (w, v, we can derive f(ū f(u + ( z z F ( z { η [B ϕ t (w t, w B ϕt (w t+, w ] + η g t H t t= + δ t, w w t + β ( Aw + Bvt b Aw + Bv t+ b + β ( θ θ t θ θ t+ } { t= [ η [B ϕ t (w t, w B ϕt (w t+, w ] + η gt H t + δ t, w w t ] + β Aw + Bv b + β θ θ } { t= [ η (B ϕ t (w t, w B ϕt (w t+, w + η g t H t + δ t, w w t ] + β D v,b + β θ θ } As the above inequality is valid for any θ, it also holds in the ball B ρ = {θ : θ ρ} Combining with the fact that the optimal solution must also be feasible, it follows that max{f(ū f(u + ( z z F ( z } θ B ρ = max θ B ρ {f(ū f(u + θ (Aw + Bv b θ (A w + B v b} = max θ B ρ {f(ū f(u θ (A w + B v b} = f(ū f(u + ρ A w + B v b Utilizing the above two inequalities, we obtain E[f(ū f(u + ρ A w + B v b ] E { t= ( η [B ϕ t (w t, w B ϕt (w t+, w ] + η g t Ht + δ t, w w t + β D v,b + β θ θ } { E [ η [B ϕ t (w t, w B ϕt (w t+, w ] t= +η g t H t ] + βd v,b + ρ β Note Eδ t = in the last step his completes the proof },

4 he above theorem allows us to derive regret bounds for a family of algorithms that iteratively modify the proximal functions ϕ t in attempt to lower the regret bounds Since the rate of convergence is still dependent on H t and η, next we are going to choose appropriate positive definite matrix H t and the constant η to optimize the rate of convergence 4 Diagonal Matrix Proximal Functions In this subsection, we restrict H t to be a diagonal matrix, for two important reasons: (i the diagonal matrix will provide results easier to understand than that for the general matrix; (ii for high dimension problem the general matrix may result in prohibitively expensive computational cost, which is not desirable Firstly, we notice that the upper bound in the heorem relies on t= g t H If we assume all the g t s are t known in advance, we could minimize this term by setting H t = diag(s, t We shall use the following proposition Proposition For any g, g,, g R d, we have min diag(s, s c t= g t diag(s = ( d g :,i, c i= where g :,i = (g,i,, g,i and the minimum is attained at s i = c g :,i / d j= g :,j We omit proof of this proposition, since it is easy to derive Since we do not have all the g t s in advance, we receives the stochastic (subgradients g t sequentially instead As a result, we propose to update the H t incrementally as H t = ai + diag(s t, where s t,i = g :t,i and a For these H t s, we have the following inequality g t Ht = g t, (ai + diag(s t g t t= t= g t, diag(s t g t t= d i= g :,i, (3 where the last inequality used Lemma 4 in (Duchi et al,, which implies this update is a nearly optimal update method for the diagonal matrix case Finally, the adaptive stochastic ADMM with diagonal matrix update (Ada- diag is summarized into the Algorithm For the convergence rate of the proposed Algorithm, we have the following specific theorem heorem Let l(w, ξ t and φ(w be convex functions for any t > hen for Algorithm, we have the following Algorithm Adaptive Stochastic ADMM with Diagonal Matrix Update (Ada- diag Initialize: w =, u =, θ =, and a > for t =,,, do Compute g t = l (w t, ξ t ; Update H t = ai + diag(s t, where s t,i = g :t,i ; w t+ = arg min w L β,t (w, v t, θ t ; v t+ = arg min v V L β,t (w t+, v, θ t ; θ t+ = θ t β(aw t+ + Bv t+ b; end for inequality for any and ρ > E[f(ū f(u + ρ A w + B v b ] ( d E[η g :,i i= + η max t w t w d i= g :,i ] + βdv,b + ρ β If we further set η = D w, / where D w, = max w,w w w, then we have E[f(ū f(u + ρ A w + B v b ] d ( E[Dw, g :,i ] + β D v,b + ρ β i= he proof is in the appendix 5 Full Matrix Proximal Functions In this subsection, we derive and analyze new updates when we estimate a full matrix H t for the proximal function instead of a diagonal one Although full matrix computation may not be attractive for high dimension problems, it may be helpful for tasks with low dimension Furthermore, it will provide us with a more complete insight Similar with the analysis for the diagonal case, we first introduce the following proposition (Lemma 5 in (Duchi et al, Proposition For any g, g,, g R d, we have the following equality min S, tr(s c t= g t S = c tr(g, where G = t= g tgt and the minimizer is attained at S = cg / /tr(g/ If G is not of full rank, then we use its pseudo-inverse to replace its inverse in the minimization problem Because the (subgradients are received sequentially, we propose to update the H t incrementally as H t = ai + G t, G t = t i= g igi, t =,, For these H ts, we have the following inequalities

5 g t Ht g t S g t S =tr(g, (4 t= t= t t= where the last inequality used Lemma in (Duchi et al,, which implies this update is a nearly optimal update method for the full matrix case Finally, the adaptive stochastic ADMM with full matrix update is summarized in Algorithm 3 Algorithm 3 Adaptive Stochastic ADMM with Full Matrix Update (Ada- full Initialize: w =, u =, θ =, G =, and a > for t =,,, do Compute g t = l (w t, ξ t ; Update G t = G t + g t g t ; Update H t = ai + S t, where S t = G t ; w t+ = arg min w L β,t (w, v t, θ t ; v t+ = arg min v V L β,t (w t+, v, θ t ; θ t+ = θ t β(aw t+ + Bv t+ b; end for For the convergence rate of the above proposed Algorithm 3, we have the following specific theorem heorem 3 Let l(w, ξ t and φ(w be convex functions for any t > hen for Algorithm 3, we have the following inequality for any, ρ >, E[f(ū f(u + ρ A w + B v b ] { E[ηtr (G + η max t w w t tr (G ] +βdv,b + ρ } β Furthermore, if we set η = D w, /, where D w, = max w,w w w, then we have E[f(ū f(u + ρ A w + B v b ] ( E[Dw, tr (G / ] + β D v,b + ρ β he proof is in the appendix 3 Experiment In this section, we evaluate the empirical performance of the proposed adaptive stochastic ADMM algorithms for solving Graph-Guided SVM (GGSVM tasks, which is formulated as the following problem (Ouyang et al, 3: min w,v n n [ y ix i w] + + γ w + ν v, i= st F w v =, where [z] + = max(, z and the matrix F is constructed based on a graph G = {V, E} For this graph, V = {w,, w d } is a set of variables and E = {e,, e E }, where e k = {i, j} is assigned with a weight α ij And the corresponding F is in the form: F ki = α ij and F kj = α ij o construct a graph for a given dataset, we adopt the sparse inverse covariance estimation (Friedman et al, 8 and determine the sparsity pattern of the inverse covariance matrix Σ Based on the inverse covariance matrix, we connect all index pairs (i, j with Σ ij and assign α ij = 3 Experimental estbed and Setup o examine the performance, we test all the algorithms on six real-world datasets from web machine learning repositories, which are listed in the able he news dataset was downloaded from wwwcsnyuedu/ roweis/datahtml All other datasets were downloaded from the LIBSVM website For each dataset, we randomly divide it into two folds: training set with 8% of examples and test set with the rest able Several real-world datasets in our experiments Dataset # examples # features a9a 48,84 3 mushrooms 8,4 news 6,4 splice 3,75 6 svmguide3,84 w8a 64,7 3 o make a fair comparison, all algorithms adopt the same experimental setup In particular, we set the penalty parameter γ = ν = /n, where n is the number of training examples, and the trade-off parameter β = In addition, we set the step size parameter η t = /(γt for according to the theorem in (Ouyang et al, 3 Finally, the smooth parameter a is set as, and the step size for adaptive stochastic ADMM algorithms are searched from [ 5:5] using cross validation All the experiments were conducted with 5 different random seeds and epochs (n iterations for each dataset All the result were reported by averaging over these 5 runs We evaluated the learning performance by measuring objective values, ie, f(u, and test error rates on the test datasets In addition, we also evaluate computational efficiency of all the algorithms by their running time All experiments were run in Matlab over a machine of 34GHz CPU 3 Performance Evaluation he figure shows the performance of all the algorithms in comparison over trials, from which we can draw several observations Firstly, the left column shows the objective

6 able Evaluation of stochastic ADMM algorithms on the real-world data sets Algorithm a9a mushrooms Objective value est error rate ime (s Objective value est error rate ime (s 6 ± ± ± 4 35 ± Ada- diag 355 ± 5 ± ± 5 6 ± 3355 Ada- full 3545 ± 498 ± ± ± Algorithm news splice Objective value est error rate ime (s Objective value est error rate ime (s 565 ± ± ± ± 3 98 Ada- diag 339 ± 3 8 ± ± ± Ada- full 34 ± 7 84 ± ± 4 55 ± Algorithm svmguide3 w8a Objective value est error rate ime (s Objective value est error rate ime (s 643 ± 33 6 ± ± ± Ada- diag 563 ± ± ± 93 ± Ada- full 53 ± 44 ± ± 6 99 ± values of the three algorithms We can observe that the t- wo adaptive stochastic ADMM algorithms converge much faster than, which shows the effectiveness of exploration of adaptive regularization (Bregman Divergence to accelerate stochastic ADMM Secondly, compared with Ada- diag, Ada- full achieves slightly s- maller objective values on most of the datasets, which indicates full matrix is slightly more informative than the diagonal one his should be due to the lower dimensions of these datasets hirdly, the central column provides test error rates of three algorithms, where we observe that the two adaptive algorithms achieve significantly smaller or comparable test error rates at 5-th epoch than at -th epoch his observation indicates that we can terminate the two adaptive algorithms earlier to save time and at the same time achieve similar performance compared with Finally, the right column shows the running time of three algorithms, which shows that during the learning process, the Ada- full is significantly s- lower while the Ada- diag is overall efficient compared with In summary, the Ada- diag algorithm achieves a good trade-off between efficiency and effectiveness able summarizes the performance of all the compared algorithms over the six datasets, from which we can make similar observations his again verifies the effectiveness of the proposed algorithms 4 Conclusion ADMM is a popular technique in machine learning his paper studied a method to accelerate stochastic ADMM with adaptive regularization, by replacing the fixed proximal function with adaptive proximal function Compared with traditional stochastic ADMM, we show that the proposed adaptive algorithms converge significantly faster through the proposed adaptive strategies Promising experimental results on a variety of real-world datasets further validate the effectiveness of our techniques Acknowledgments he research of Peilin Zhao and ong Zhang is partially supported by NSF-IIS and NSF-IIS 5985 he research of Jinwei Yang and Ping Li is partially supported by NSF-III-3697, NSF-Bigdata-49, ONR- N , and AFOSR-FA Appendix: Some Proofs Proof for Lemma Proof Firstly, using the convexity of l and the definition of δ t, we can obtain l(w t l(w l (w t, w t w = g t, w t+ w + δ t, w w t + g t, w t w t+ Combining the above inequality with the relation between θ t and θ t+ will derive l(w t l(w + w t+ w, A θ t+ g t, w t+ w + δ t, w w t + g t, w t w t+ + w t+ w, A [β(aw t+ + Bv t+ b θ t] = g t + A [β(aw t+ + Bv t b θ t ], w t+ w }{{} L t + w w t+, βa B(v t v t+ + δ t, w w t }{{} M t + g t, w t w t+ }{{} N t

7 a9a 5 5 mushrooms news 5 5 splice svmguide3 5 5 w8a 5 5 est Error Rates est Error Rates est Error Rates est Error Rates est Error Rates est Error Rates 8 6 a9a mushrooms news splice w8a svmguide ime Costs ime Costs ime Costs ime Costs ime Costs ime Costs a9a a9a 5 5 mushrooms news 5 5 splice svmguide w8a 5 5 Figure Comparison between with Ada- diag ( Ada-diag and Ada- full ( Ada-full on 6 real-world datasets Epoch for the horizontal axis is the number of iterations divided by dataset size Left Panels: Average objective values Middle Panels: Average test error rates Right Panels: Average time costs (in seconds

8 o provide an upper bound for the first term L t, taking D(u, v = B ϕt (u, v = u v H t and applying Lemma in (Ouyang et al, 3 to the step of getting w t+ in Algorithm, we will have g t + A [β(aw t+ + Bv t b θ t], w t+ w η [B ϕ t (w t, w B ϕt (w t+, w B ϕt (w t+, w t ] o provide an upper bound for the second term M t, we can derive as follows w w t+, βa B(v t v t+ = β Aw Aw t+, Bv t Bv t+ = β [( Aw + Bvt b Aw + Bv t+ b +( Aw t+ + Bv t+ b Aw t+ + Bv t b ] β ( Aw + Bv t b Aw + Bv t+ b + β θt+ θt o drive an upper bound for the final term N t, we can use Young s inequality to get g t, w t w t+ η g t H t = η g t H t + w t w t+ H t η + B ϕ t (w t, w t+ η Replacing the terms L t, M t and N t with their upper bounds, we will get l(w t l(w + w t+ w, A θ t+ η [B ϕ t (w t, w B ϕt w t+, w] + η gt H t + β ( Aw + Bvt b Aw + Bv t+ b + β θ t+ θ t + δ t, w w t Due to the optimality condition of the step of updating v in Algorithm, ie, v L β,t (w t+, v t+, θ t and the convexity of φ, we have φ(v t+ φ(v + v t+ v, B θ t+ Using the fact Aw t+ + Bv t+ b = (θ t θ t+ /β, we have θ t+ θ, Aw t+ + Bv t+ b = β ( θ θt θ θ t+ θ t+ θ t Combining the above three inequalities and re-arranging the terms will conclude the proof Proof of heorem Proof We have the following inequality = [B ϕt (w t, w B ϕt (w t+, w ] t= ( w t w H t w t+ w H t t= w w H + ( w t+ w H t+ w t+ w H t t= = w w H + w t+ w (H t+ H t t= w w H + max(w t+,i w i,i s t+ s t t= = w w H + w t+ w (s t+ s t t= w w H + max t w t w s w w s max t w t w d i= g :,i, where the last inequality used s, = d i= g :,i and w w H w w s Plugging the above inequality and the inequality (4 into the inequality (, will conclude the first part of the theorem hen the second part is trivial to be derived Proof of heorem 3 Proof We consider the sum of the difference = [B ϕt (w t, w B ϕt (w t+, w ] t= ( w t w H t w t+ w H t t= w w H + ( w t+ w H t+ w t+ w H t t= = w w H + w t+ w (G t+ G t t= w w H + w t+ w λ max(g t+ G t t= = w w H + w t+ w tr(g t+ G t t= w w H + max t w t w tr(g w w tr(g max t wt w tr(g Plugging the above inequality and the inequality (4 into the inequality (, will conclude the first part of the theorem hen the second part is trivial to be derived

9 References Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan Distributed optimization and statistical learning via the alternating direction method of multipliers Foundations and rends R in Machine Learning, 3(:, Deng, Wei and Yin, Wotao On the global and linear convergence of the generalized alternating direction method of multipliers echnical report, DIC Document, Duchi, John, Hazan, Elad, and Singer, Yoram Adaptive subgradient methods for online learning and stochastic optimization he Journal of Machine Learning Research, : 59, Eckstein, Jonathan and Bertsekas, Dimitri P On the douglasrachford splitting method and the proximal point algorithm for maximal monotone operators Mathematical Programming, 55(-3:93 38, 99 Friedman, Jerome, Hastie, revor, and ibshirani, Robert Sparse inverse covariance estimation with the graphical lasso Biostatistics, 9(3:43 44, 8 Gabay, Daniel Chapter ix applications of the method of multipliers to variational inequalities Studies in mathematics and its applications, 5:99 33, 983 Gabay, Daniel and Mercier, Bertrand A dual algorithm for the solution of nonlinear variational problems via finite element approximation Computers & Mathematics with Applications, (:7 4, 976 Glowinski, R and Marroco, A Sur lapproximation, par elements nis dordre un, et la resolution, par penalisationdualite, dune classe de problems de dirichlet non lineares Revue Francaise dautomatique, Informatique, et, 975 Luo, Zhi-Quan On the linear convergence of the alternating direction method of multipliers arxiv preprint arxiv:839, Monteiro, Renato D C and Svaiter, Benar Fux Iterationcomplexity of block-decomposition algorithms and the alternating direction method of multipliers SIAM Journal on Optimization, 3(:475 57, 3 Ouyang, Hua, He, Niao, ran, Long, and Gray, Alexander G Stochastic alternating direction method of multipliers In Proceedings of the 3th International Conference on Machine Learning (ICML-3, pp 8 88, 3 Suzuki, aiji Dual averaging and proximal gradient descent for online alternating direction multiplier method In Proceedings of the 3th International Conference on Machine Learning (ICML-3, pp 39 4, 3 Suzuki, aiji Stochastic dual coordinate ascent with alternating direction method of multipliers In Proceedings of the 3th International Conference on Machine Learning, ICML 4, Beijing, China, -6 June 4, pp , 4 Wang, Huahua and Banerjee, Arindam Online alternating direction method In Proceedings of the 9th International Conference on Machine Learning (ICML-, pp 9 6, Yang, Junfeng and Zhang, Yin Alternating direction algorithms for l -problems in compressive sensing SIAM journal on scientific computing, 33(:5 78, Zhong, Wenliang and Kwok, James in-yau Fast stochastic alternating direction method of multipliers In Proceedings of the 3th International Conference on Machine Learning, ICML 4, Beijing, China, -6 June 4, pp 46 54, 4 Glowinski, Roland and Le allec, Patrick Augmented Lagrangian and operator-splitting methods in nonlinear mechanics, volume 9 SIAM, 989 Goldfarb, Donald, Ma, Shiqian, and Scheinberg, Katya Fast alternating linearization methods for minimizing the sum of two convex functions Math Program, 4(-: , 3 Goldstein, om and Osher, Stanley he split bregman method for l-regularized problems SIAM Journal on Imaging Sciences, (:33 343, 9 He, Bingsheng and Yuan, Xiaoming On the o(/n convergence rate of the douglas-rachford alternating direction method SIAM Journal on Numerical Analysis, 5(: 7 79,

arxiv: v1 [math.oc] 23 May 2017

arxiv: v1 [math.oc] 23 May 2017 A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method