Adaptive Stochastic Alternating Direction Method of Multipliers

Size: px
Start display at page:

Download "Adaptive Stochastic Alternating Direction Method of Multipliers"

Transcription

1 Peilin Zhao, Jinwei Yang ong Zhang Ping Li Data Analytics Department, Institute for Infocomm Research, A*SAR, Singapore Department of Mathematics, Rutgers University; and Department of Mathematics, University of Notre Dame, USA Department of Statistics & Biostatistics, Rutgers University, USA; and Big Data Lab, Baidu Research, China Department of Statistics & Biostatistics, Department of Computer Science, Rutgers University; and Baidu Research, USA Abstract he Alternating Direction Method of Multipliers (ADMM has been studied for years raditional ADMM algorithms need to compute, at each iteration, an (empirical expected loss function on all training examples, resulting in a computational complexity proportional to the number of training examples o reduce the complexity, stochastic ADMM algorithms were proposed to replace the expected loss function with a random loss function associated with one uniformly drawn example plus a Bregman divergence term he Bregman divergence, however, is derived from a simple nd-order proximal function, ie, the half squared norm, which could be a suboptimal choice In this paper, we present a new family of stochastic ADMM algorithms with optimal nd-order proximal functions, which produce a new family of adaptive stochastic ADMM methods We theoretically prove that the regret bounds are as good as the bounds which could be achieved by the best proximal function that can be chosen in hindsight Encouraging empirical results on a variety of real-world datasets confirm the effectiveness and efficiency of the proposed algorithms Introduction Originally introduced in (Glowinski & Marroco, 975; Gabay & Mercier, 976, the offline/batch Alternating Direction Method of Multipliers (ADMM stemmed from the augmented Lagrangian method, with its Proceedings of the 3 nd International Conference on Machine Learning, Lille, France, 5 JMLR: W&CP volume 37 Copyright 5 by the author(s global convergence property established in (Gabay, 983; Glowinski & Le allec, 989; Eckstein & Bertsekas, 99 Recent studies have shown that ADMM achieves a convergence rate of O(/ (Monteiro & Svaiter, 3; He & Yuan, (where is number of iterations of ADMM, when the objective function is generally convex Furthermore, ADMM enjoys a convergence rate of O(α, for some α (,, when the objective function is strongly convex and smooth (Luo, ; Deng & Yin, ADMM has shown attractive performance in a wide range of real-world problems such as compressed sensing (Yang & Zhang,, image restoration (Goldstein & Osher, 9, video processing, and matrix completion (Goldfarb et al, 3, etc From the computational perspective, one drawback of AD- MM is that, at every iteration, the method needs to compute an (empirical expected loss function on all the training examples he computational complexity is propositional to the number of training examples, which makes the original ADMM unsuitable for solving large-scale learning and big data mining problems he online ADMM (OADMM algorithm (Wang & Banerjee, was proposed to tackle the computational challenge For OADMM, the objective function is replaced with an online function at every step, which only depends on a single training example OAD- MM can achieve an average regret bound of O(/ for convex objective functions and O(log( / for strongly convex objective functions Interestingly, although the optimization of the loss function is assumed to be easy in the analysis of (Wang & Banerjee,, this step is actually not necessarily easy in practice o address this issue, the stochastic ADMM algorithm was proposed, by linearizing the online loss function (Ouyang et al, 3; Suzuki, 3 In stochastic ADMM algorithms, the online loss function is firstly uniformly drawn from all the loss functions associated with all the training examples hen the

2 loss function is replaced with its first order expansion at the current solution plus the Bregman divergence from the current solution he Bregman divergence is based on a simple proximal function, the half squared norm, so that the Bregman divergence is the half squared distance In this way, the optimization of the loss function enjoys a closed-form solution he stochastic ADMM achieves similar convergence rates as OADMM Using the half square norm as proximal function, however, may be a suboptimal choice Our paper will address this issue We should mention that there are other strategies recently adopted to accelerate stochastic ADMM, including stochastic average gradient (Zhong & Kwok, 4, dual coordinate ascent (Suzuki, 4, which are still based on half squared distance Our contribution In the previous work (Ouyang et al, 3; Suzuki, 3 the Bregman divergence is derived from a simple second order function, ie, the half squared norm, which could be a suboptimal choice (Duchi et al, In this paper, we present a new family of stochastic ADMM algorithms with adaptive proximal functions, which can accelerate stochastic ADMM by using adaptive regularization We theoretically prove that the regret bounds of our methods are as good as those achieved by stochastic ADMM with the best proximal function that can be chosen in hindsight he effectiveness and efficiency of the proposed algorithms are confirmed by encouraging empirical evaluations on several real-world datasets Adaptive Stochastic Alternating Direction Method of Multipliers Problem Formulation In this paper, we will study a family of convex optimization problems, where our objective functions are composite Specially, we are interested in the following equalityconstrained optimization task: min w W,v V f((w, v := E ξ l(w, ξ + φ(v, ( st Aw + Bv = b, where w R d, v R d, A R m d, B R m d, b R m, W and V are convex sets For simplicity, the notation l is used for both the instance function value l(w, ξ and its expectation l(w = E ξ l(w, ξ It is assumed that a sequence of identical and independent (iid observations can be drawn from the random vector ξ, which satisfies a fixed but unknown distribution When ξ is deterministic, the above optimization becomes the traditional formulation of ADMM (Boyd et al, In this paper, we will assume the functions l and φ are convex but not necessarily continuously differentiable In addition, we denote the optimal solution of ( as (w, v Before presenting the proposed algorithm, we first introduce some notations For a positive definite matrix G R d d, we define the G-norm of a vector w as w G := w Gw When there is no ambiguity, we often use to denote the Euclidean norm We use, to denote the inner product in a finite dimensional Euclidean space Let H t be a positive definite matrix for t N Set the proximal function ϕ t (, as ϕ t (w = w H t = w, H tw hen the corresponding Bregman divergence for ϕ t (w is defined as B ϕt (w, u = ϕ t (w ϕ t (u ϕ t (u, w u Algorithm = w u H t o solve the problem (, a popular method is Alternating Direction Method of Multipliers (ADMM ADMM splits the optimization with respect to w and v by minimizing the augmented Lagrangian: min L β(w, v, θ := l(w + φ(v θ, Aw + Bv b w,v + β Aw + Bv b, where β > is a pre-defined penalty Specifically, the ADMM algorithm minimizes L β as follows w t+ = arg min w L β(w, v t, θ t, v t+ = arg min v L β (w t+, v, θ t, θ t+ = θ t β(aw t+ + Bv t+ b At each step, however, ADMM requires calculating the expectation E ξ l(w, ξ, which may be computationally too expensive, since we may only have an unbiased estimate of l(w or the expectation E ξ l(w, ξ is an empirical one for big data problems o solve this issue, we propose to minimize the following stochastic approximation: L β,t (w, v, θ = g t, w + φ(v θ, Aw + Bv b + β Aw + Bv b + η B ϕ t (w, w t, where g t = l (w t, ξ t and H t for ϕ t = w H t will be specified later his objective linearizes l(w, ξ t and adopts a dynamic Bregman divergence function to keep the new model near to the previous one It is easy to see that this proposed approximation includes the one proposed by (Ouyang et al, 3 as a special case when H t = I o minimize the above function, we followed the ADMM algorithm to optimize over w, v, θ sequentially, by fixing the others In addition, we also need to update H t for B ϕt at every step, which will be specified later Finally the proposed (Ada- is summarized in Algorithm

3 Algorithm Adaptive Stochastic Alternating Direction Method of Multipliers (Ada- Initialize: w =, u =, θ =, H = ai, and a > for t =,,, do Compute g t = l (w t, ξ t ; Update H t and compute B ϕt ; w t+ = arg min w W L β,t (w, v t, θ t ; v t+ = arg min v V L β,t (w t+, v, θ t ; θ t+ = θ t β(aw t+ + Bv t+ b; end for 3 Analysis his subsection is devoted to analyzing the expected convergence rate of the iterative solutions of the proposed algorithm for general H t, t =,,, where the proof techniques in (Ouyang et al, 3 and (Duchi et al, are adopted o accomplish this, we begin with a technical lemma, which will facilitate the later analysis Lemma Let l(w, ξ t and φ(w be convex functions and H t be positive definite, for t For Algorithm, we have the following inequality l(w t + φ(v t+ l(w φ(v + (z t+ z F (z t+ η g t H t + η [B ϕ t (w t, w B ϕt (w t+, w] + β ( Aw + Bv t b Aw + Bv t+ b + δ t, w w t + β ( θ θ t θ θ t+, where z t = (w t, v t, θ t, z = (w, v, θ, δ t = g t l (w t, and F (z = (( A θ, ( B θ, (Aw + Bv b he proof is in the appendix We now analyze the convergence behavior of Algorithm and provide an upper bound on the objective value and the feasibility violation heorem Let l(w, ξ t and φ(w be convex functions and H t be positive definite, for t For Algorithm, we have the following inequality for any and ρ > : E[f(ū f(u + ρ A w + B v b ] { [ E η (B ϕ t (w t, w B ϕt (w t+, w t= +η g t ] } Ht + βd v,b + ρ, ( β where ū = ( t= w t, (w, v, and ( w, v = ( and D v,b = Bv + t= v t + t= w t,, u = + t= v t, Proof For convenience, we denote u = (w, v, θ = + t= θ t, and z = ( w, v, θ With these notations, using convexity of l(w and φ(v and the monotonicity of operator F (, we have for any z: f(ū f(u + ( z z F ( z [f((wt, vt+ f(u + (z t+ z F (z t+] t= = [l(w t +φ(v t+ l(w φ(v+(z t+ z F (z t+ ] t= Combining this inequality with Lemma at the optimal solution (w, v = (w, v, we can derive f(ū f(u + ( z z F ( z { η [B ϕ t (w t, w B ϕt (w t+, w ] + η g t H t t= + δ t, w w t + β ( Aw + Bvt b Aw + Bv t+ b + β ( θ θ t θ θ t+ } { t= [ η [B ϕ t (w t, w B ϕt (w t+, w ] + η gt H t + δ t, w w t ] + β Aw + Bv b + β θ θ } { t= [ η (B ϕ t (w t, w B ϕt (w t+, w + η g t H t + δ t, w w t ] + β D v,b + β θ θ } As the above inequality is valid for any θ, it also holds in the ball B ρ = {θ : θ ρ} Combining with the fact that the optimal solution must also be feasible, it follows that max{f(ū f(u + ( z z F ( z } θ B ρ = max θ B ρ {f(ū f(u + θ (Aw + Bv b θ (A w + B v b} = max θ B ρ {f(ū f(u θ (A w + B v b} = f(ū f(u + ρ A w + B v b Utilizing the above two inequalities, we obtain E[f(ū f(u + ρ A w + B v b ] E { t= ( η [B ϕ t (w t, w B ϕt (w t+, w ] + η g t Ht + δ t, w w t + β D v,b + β θ θ } { E [ η [B ϕ t (w t, w B ϕt (w t+, w ] t= +η g t H t ] + βd v,b + ρ β Note Eδ t = in the last step his completes the proof },

4 he above theorem allows us to derive regret bounds for a family of algorithms that iteratively modify the proximal functions ϕ t in attempt to lower the regret bounds Since the rate of convergence is still dependent on H t and η, next we are going to choose appropriate positive definite matrix H t and the constant η to optimize the rate of convergence 4 Diagonal Matrix Proximal Functions In this subsection, we restrict H t to be a diagonal matrix, for two important reasons: (i the diagonal matrix will provide results easier to understand than that for the general matrix; (ii for high dimension problem the general matrix may result in prohibitively expensive computational cost, which is not desirable Firstly, we notice that the upper bound in the heorem relies on t= g t H If we assume all the g t s are t known in advance, we could minimize this term by setting H t = diag(s, t We shall use the following proposition Proposition For any g, g,, g R d, we have min diag(s, s c t= g t diag(s = ( d g :,i, c i= where g :,i = (g,i,, g,i and the minimum is attained at s i = c g :,i / d j= g :,j We omit proof of this proposition, since it is easy to derive Since we do not have all the g t s in advance, we receives the stochastic (subgradients g t sequentially instead As a result, we propose to update the H t incrementally as H t = ai + diag(s t, where s t,i = g :t,i and a For these H t s, we have the following inequality g t Ht = g t, (ai + diag(s t g t t= t= g t, diag(s t g t t= d i= g :,i, (3 where the last inequality used Lemma 4 in (Duchi et al,, which implies this update is a nearly optimal update method for the diagonal matrix case Finally, the adaptive stochastic ADMM with diagonal matrix update (Ada- diag is summarized into the Algorithm For the convergence rate of the proposed Algorithm, we have the following specific theorem heorem Let l(w, ξ t and φ(w be convex functions for any t > hen for Algorithm, we have the following Algorithm Adaptive Stochastic ADMM with Diagonal Matrix Update (Ada- diag Initialize: w =, u =, θ =, and a > for t =,,, do Compute g t = l (w t, ξ t ; Update H t = ai + diag(s t, where s t,i = g :t,i ; w t+ = arg min w L β,t (w, v t, θ t ; v t+ = arg min v V L β,t (w t+, v, θ t ; θ t+ = θ t β(aw t+ + Bv t+ b; end for inequality for any and ρ > E[f(ū f(u + ρ A w + B v b ] ( d E[η g :,i i= + η max t w t w d i= g :,i ] + βdv,b + ρ β If we further set η = D w, / where D w, = max w,w w w, then we have E[f(ū f(u + ρ A w + B v b ] d ( E[Dw, g :,i ] + β D v,b + ρ β i= he proof is in the appendix 5 Full Matrix Proximal Functions In this subsection, we derive and analyze new updates when we estimate a full matrix H t for the proximal function instead of a diagonal one Although full matrix computation may not be attractive for high dimension problems, it may be helpful for tasks with low dimension Furthermore, it will provide us with a more complete insight Similar with the analysis for the diagonal case, we first introduce the following proposition (Lemma 5 in (Duchi et al, Proposition For any g, g,, g R d, we have the following equality min S, tr(s c t= g t S = c tr(g, where G = t= g tgt and the minimizer is attained at S = cg / /tr(g/ If G is not of full rank, then we use its pseudo-inverse to replace its inverse in the minimization problem Because the (subgradients are received sequentially, we propose to update the H t incrementally as H t = ai + G t, G t = t i= g igi, t =,, For these H ts, we have the following inequalities

5 g t Ht g t S g t S =tr(g, (4 t= t= t t= where the last inequality used Lemma in (Duchi et al,, which implies this update is a nearly optimal update method for the full matrix case Finally, the adaptive stochastic ADMM with full matrix update is summarized in Algorithm 3 Algorithm 3 Adaptive Stochastic ADMM with Full Matrix Update (Ada- full Initialize: w =, u =, θ =, G =, and a > for t =,,, do Compute g t = l (w t, ξ t ; Update G t = G t + g t g t ; Update H t = ai + S t, where S t = G t ; w t+ = arg min w L β,t (w, v t, θ t ; v t+ = arg min v V L β,t (w t+, v, θ t ; θ t+ = θ t β(aw t+ + Bv t+ b; end for For the convergence rate of the above proposed Algorithm 3, we have the following specific theorem heorem 3 Let l(w, ξ t and φ(w be convex functions for any t > hen for Algorithm 3, we have the following inequality for any, ρ >, E[f(ū f(u + ρ A w + B v b ] { E[ηtr (G + η max t w w t tr (G ] +βdv,b + ρ } β Furthermore, if we set η = D w, /, where D w, = max w,w w w, then we have E[f(ū f(u + ρ A w + B v b ] ( E[Dw, tr (G / ] + β D v,b + ρ β he proof is in the appendix 3 Experiment In this section, we evaluate the empirical performance of the proposed adaptive stochastic ADMM algorithms for solving Graph-Guided SVM (GGSVM tasks, which is formulated as the following problem (Ouyang et al, 3: min w,v n n [ y ix i w] + + γ w + ν v, i= st F w v =, where [z] + = max(, z and the matrix F is constructed based on a graph G = {V, E} For this graph, V = {w,, w d } is a set of variables and E = {e,, e E }, where e k = {i, j} is assigned with a weight α ij And the corresponding F is in the form: F ki = α ij and F kj = α ij o construct a graph for a given dataset, we adopt the sparse inverse covariance estimation (Friedman et al, 8 and determine the sparsity pattern of the inverse covariance matrix Σ Based on the inverse covariance matrix, we connect all index pairs (i, j with Σ ij and assign α ij = 3 Experimental estbed and Setup o examine the performance, we test all the algorithms on six real-world datasets from web machine learning repositories, which are listed in the able he news dataset was downloaded from wwwcsnyuedu/ roweis/datahtml All other datasets were downloaded from the LIBSVM website For each dataset, we randomly divide it into two folds: training set with 8% of examples and test set with the rest able Several real-world datasets in our experiments Dataset # examples # features a9a 48,84 3 mushrooms 8,4 news 6,4 splice 3,75 6 svmguide3,84 w8a 64,7 3 o make a fair comparison, all algorithms adopt the same experimental setup In particular, we set the penalty parameter γ = ν = /n, where n is the number of training examples, and the trade-off parameter β = In addition, we set the step size parameter η t = /(γt for according to the theorem in (Ouyang et al, 3 Finally, the smooth parameter a is set as, and the step size for adaptive stochastic ADMM algorithms are searched from [ 5:5] using cross validation All the experiments were conducted with 5 different random seeds and epochs (n iterations for each dataset All the result were reported by averaging over these 5 runs We evaluated the learning performance by measuring objective values, ie, f(u, and test error rates on the test datasets In addition, we also evaluate computational efficiency of all the algorithms by their running time All experiments were run in Matlab over a machine of 34GHz CPU 3 Performance Evaluation he figure shows the performance of all the algorithms in comparison over trials, from which we can draw several observations Firstly, the left column shows the objective

6 able Evaluation of stochastic ADMM algorithms on the real-world data sets Algorithm a9a mushrooms Objective value est error rate ime (s Objective value est error rate ime (s 6 ± ± ± 4 35 ± Ada- diag 355 ± 5 ± ± 5 6 ± 3355 Ada- full 3545 ± 498 ± ± ± Algorithm news splice Objective value est error rate ime (s Objective value est error rate ime (s 565 ± ± ± ± 3 98 Ada- diag 339 ± 3 8 ± ± ± Ada- full 34 ± 7 84 ± ± 4 55 ± Algorithm svmguide3 w8a Objective value est error rate ime (s Objective value est error rate ime (s 643 ± 33 6 ± ± ± Ada- diag 563 ± ± ± 93 ± Ada- full 53 ± 44 ± ± 6 99 ± values of the three algorithms We can observe that the t- wo adaptive stochastic ADMM algorithms converge much faster than, which shows the effectiveness of exploration of adaptive regularization (Bregman Divergence to accelerate stochastic ADMM Secondly, compared with Ada- diag, Ada- full achieves slightly s- maller objective values on most of the datasets, which indicates full matrix is slightly more informative than the diagonal one his should be due to the lower dimensions of these datasets hirdly, the central column provides test error rates of three algorithms, where we observe that the two adaptive algorithms achieve significantly smaller or comparable test error rates at 5-th epoch than at -th epoch his observation indicates that we can terminate the two adaptive algorithms earlier to save time and at the same time achieve similar performance compared with Finally, the right column shows the running time of three algorithms, which shows that during the learning process, the Ada- full is significantly s- lower while the Ada- diag is overall efficient compared with In summary, the Ada- diag algorithm achieves a good trade-off between efficiency and effectiveness able summarizes the performance of all the compared algorithms over the six datasets, from which we can make similar observations his again verifies the effectiveness of the proposed algorithms 4 Conclusion ADMM is a popular technique in machine learning his paper studied a method to accelerate stochastic ADMM with adaptive regularization, by replacing the fixed proximal function with adaptive proximal function Compared with traditional stochastic ADMM, we show that the proposed adaptive algorithms converge significantly faster through the proposed adaptive strategies Promising experimental results on a variety of real-world datasets further validate the effectiveness of our techniques Acknowledgments he research of Peilin Zhao and ong Zhang is partially supported by NSF-IIS and NSF-IIS 5985 he research of Jinwei Yang and Ping Li is partially supported by NSF-III-3697, NSF-Bigdata-49, ONR- N , and AFOSR-FA Appendix: Some Proofs Proof for Lemma Proof Firstly, using the convexity of l and the definition of δ t, we can obtain l(w t l(w l (w t, w t w = g t, w t+ w + δ t, w w t + g t, w t w t+ Combining the above inequality with the relation between θ t and θ t+ will derive l(w t l(w + w t+ w, A θ t+ g t, w t+ w + δ t, w w t + g t, w t w t+ + w t+ w, A [β(aw t+ + Bv t+ b θ t] = g t + A [β(aw t+ + Bv t b θ t ], w t+ w }{{} L t + w w t+, βa B(v t v t+ + δ t, w w t }{{} M t + g t, w t w t+ }{{} N t

7 a9a 5 5 mushrooms news 5 5 splice svmguide3 5 5 w8a 5 5 est Error Rates est Error Rates est Error Rates est Error Rates est Error Rates est Error Rates 8 6 a9a mushrooms news splice w8a svmguide ime Costs ime Costs ime Costs ime Costs ime Costs ime Costs a9a a9a 5 5 mushrooms news 5 5 splice svmguide w8a 5 5 Figure Comparison between with Ada- diag ( Ada-diag and Ada- full ( Ada-full on 6 real-world datasets Epoch for the horizontal axis is the number of iterations divided by dataset size Left Panels: Average objective values Middle Panels: Average test error rates Right Panels: Average time costs (in seconds

8 o provide an upper bound for the first term L t, taking D(u, v = B ϕt (u, v = u v H t and applying Lemma in (Ouyang et al, 3 to the step of getting w t+ in Algorithm, we will have g t + A [β(aw t+ + Bv t b θ t], w t+ w η [B ϕ t (w t, w B ϕt (w t+, w B ϕt (w t+, w t ] o provide an upper bound for the second term M t, we can derive as follows w w t+, βa B(v t v t+ = β Aw Aw t+, Bv t Bv t+ = β [( Aw + Bvt b Aw + Bv t+ b +( Aw t+ + Bv t+ b Aw t+ + Bv t b ] β ( Aw + Bv t b Aw + Bv t+ b + β θt+ θt o drive an upper bound for the final term N t, we can use Young s inequality to get g t, w t w t+ η g t H t = η g t H t + w t w t+ H t η + B ϕ t (w t, w t+ η Replacing the terms L t, M t and N t with their upper bounds, we will get l(w t l(w + w t+ w, A θ t+ η [B ϕ t (w t, w B ϕt w t+, w] + η gt H t + β ( Aw + Bvt b Aw + Bv t+ b + β θ t+ θ t + δ t, w w t Due to the optimality condition of the step of updating v in Algorithm, ie, v L β,t (w t+, v t+, θ t and the convexity of φ, we have φ(v t+ φ(v + v t+ v, B θ t+ Using the fact Aw t+ + Bv t+ b = (θ t θ t+ /β, we have θ t+ θ, Aw t+ + Bv t+ b = β ( θ θt θ θ t+ θ t+ θ t Combining the above three inequalities and re-arranging the terms will conclude the proof Proof of heorem Proof We have the following inequality = [B ϕt (w t, w B ϕt (w t+, w ] t= ( w t w H t w t+ w H t t= w w H + ( w t+ w H t+ w t+ w H t t= = w w H + w t+ w (H t+ H t t= w w H + max(w t+,i w i,i s t+ s t t= = w w H + w t+ w (s t+ s t t= w w H + max t w t w s w w s max t w t w d i= g :,i, where the last inequality used s, = d i= g :,i and w w H w w s Plugging the above inequality and the inequality (4 into the inequality (, will conclude the first part of the theorem hen the second part is trivial to be derived Proof of heorem 3 Proof We consider the sum of the difference = [B ϕt (w t, w B ϕt (w t+, w ] t= ( w t w H t w t+ w H t t= w w H + ( w t+ w H t+ w t+ w H t t= = w w H + w t+ w (G t+ G t t= w w H + w t+ w λ max(g t+ G t t= = w w H + w t+ w tr(g t+ G t t= w w H + max t w t w tr(g w w tr(g max t wt w tr(g Plugging the above inequality and the inequality (4 into the inequality (, will conclude the first part of the theorem hen the second part is trivial to be derived

9 References Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan Distributed optimization and statistical learning via the alternating direction method of multipliers Foundations and rends R in Machine Learning, 3(:, Deng, Wei and Yin, Wotao On the global and linear convergence of the generalized alternating direction method of multipliers echnical report, DIC Document, Duchi, John, Hazan, Elad, and Singer, Yoram Adaptive subgradient methods for online learning and stochastic optimization he Journal of Machine Learning Research, : 59, Eckstein, Jonathan and Bertsekas, Dimitri P On the douglasrachford splitting method and the proximal point algorithm for maximal monotone operators Mathematical Programming, 55(-3:93 38, 99 Friedman, Jerome, Hastie, revor, and ibshirani, Robert Sparse inverse covariance estimation with the graphical lasso Biostatistics, 9(3:43 44, 8 Gabay, Daniel Chapter ix applications of the method of multipliers to variational inequalities Studies in mathematics and its applications, 5:99 33, 983 Gabay, Daniel and Mercier, Bertrand A dual algorithm for the solution of nonlinear variational problems via finite element approximation Computers & Mathematics with Applications, (:7 4, 976 Glowinski, R and Marroco, A Sur lapproximation, par elements nis dordre un, et la resolution, par penalisationdualite, dune classe de problems de dirichlet non lineares Revue Francaise dautomatique, Informatique, et, 975 Luo, Zhi-Quan On the linear convergence of the alternating direction method of multipliers arxiv preprint arxiv:839, Monteiro, Renato D C and Svaiter, Benar Fux Iterationcomplexity of block-decomposition algorithms and the alternating direction method of multipliers SIAM Journal on Optimization, 3(:475 57, 3 Ouyang, Hua, He, Niao, ran, Long, and Gray, Alexander G Stochastic alternating direction method of multipliers In Proceedings of the 3th International Conference on Machine Learning (ICML-3, pp 8 88, 3 Suzuki, aiji Dual averaging and proximal gradient descent for online alternating direction multiplier method In Proceedings of the 3th International Conference on Machine Learning (ICML-3, pp 39 4, 3 Suzuki, aiji Stochastic dual coordinate ascent with alternating direction method of multipliers In Proceedings of the 3th International Conference on Machine Learning, ICML 4, Beijing, China, -6 June 4, pp , 4 Wang, Huahua and Banerjee, Arindam Online alternating direction method In Proceedings of the 9th International Conference on Machine Learning (ICML-, pp 9 6, Yang, Junfeng and Zhang, Yin Alternating direction algorithms for l -problems in compressive sensing SIAM journal on scientific computing, 33(:5 78, Zhong, Wenliang and Kwok, James in-yau Fast stochastic alternating direction method of multipliers In Proceedings of the 3th International Conference on Machine Learning, ICML 4, Beijing, China, -6 June 4, pp 46 54, 4 Glowinski, Roland and Le allec, Patrick Augmented Lagrangian and operator-splitting methods in nonlinear mechanics, volume 9 SIAM, 989 Goldfarb, Donald, Ma, Shiqian, and Scheinberg, Katya Fast alternating linearization methods for minimizing the sum of two convex functions Math Program, 4(-: , 3 Goldstein, om and Osher, Stanley he split bregman method for l-regularized problems SIAM Journal on Imaging Sciences, (:33 343, 9 He, Bingsheng and Yuan, Xiaoming On the o(/n convergence rate of the douglas-rachford alternating direction method SIAM Journal on Numerical Analysis, 5(: 7 79,

arxiv: v1 [math.oc] 23 May 2017

arxiv: v1 [math.oc] 23 May 2017 A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method

More information

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider

More information

Fast Rate Analysis of Some Stochastic Optimization Algorithms

Fast Rate Analysis of Some Stochastic Optimization Algorithms Fast Rate Analysis of Some Stochastic Optimization Algorithms Chao Qu Department of Mechanical Engineering, National University of Singapore Huan Xu Department of Industrial and Systems Engineering, National

More information

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent Yinyu Ye K. T. Li Professor of Engineering Department of Management Science and Engineering Stanford

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Contraction Methods for Convex Optimization and monotone variational inequalities No.12 XII - 1 Contraction Methods for Convex Optimization and monotone variational inequalities No.12 Linearized alternating direction methods of multipliers for separable convex programming Bingsheng He Department

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.11

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.11 XI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.11 Alternating direction methods of multipliers for separable convex programming Bingsheng He Department of Mathematics

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Distributed Optimization via Alternating Direction Method of Multipliers

Distributed Optimization via Alternating Direction Method of Multipliers Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition

More information

Linearized Alternating Direction Method of Multipliers for Constrained Nonconvex Regularized Optimization

Linearized Alternating Direction Method of Multipliers for Constrained Nonconvex Regularized Optimization JMLR: Workshop and Conference Proceedings 63:97 19, 216 ACML 216 Linearized Alternating Direction Method of Multipliers for Constrained Nonconvex Regularized Optimization Linbo Qiao qiao.linbo@nudt.edu.cn

More information

Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method

Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method aiji Suzuki s-taiji@stat.t.u-tokyo.ac.jp Department of Mathematical Informatics, he University of okyo, okyo

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

One-Pass Multi-View Learning

One-Pass Multi-View Learning JMLR: Workshop and Conference Proceedings 45:407 422, 2015 ACML 2015 One-Pass Multi-View Learning Yue Zhu zhuy@lamda.nju.edu.cn Wei Gao gaow@lamda.nju.edu.cn Zhi-Hua Zhou zhouzh@lamda.nju.edu.cn National

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

Fast-and-Light Stochastic ADMM

Fast-and-Light Stochastic ADMM Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Fast-and-Light Stochastic ADMM Shuai Zheng James T. Kwok Department of Computer Science and Engineering

More information

Sparse inverse covariance estimation with the lasso

Sparse inverse covariance estimation with the lasso Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.18

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.18 XVIII - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No18 Linearized alternating direction method with Gaussian back substitution for separable convex optimization

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent Caihua Chen Bingsheng He Yinyu Ye Xiaoming Yuan 4 December, Abstract. The alternating direction method

More information

Dealing with Constraints via Random Permutation

Dealing with Constraints via Random Permutation Dealing with Constraints via Random Permutation Ruoyu Sun UIUC Joint work with Zhi-Quan Luo (U of Minnesota and CUHK (SZ)) and Yinyu Ye (Stanford) Simons Institute Workshop on Fast Iterative Methods in

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

2 Regularized Image Reconstruction for Compressive Imaging and Beyond EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in

More information

Convergence of Bregman Alternating Direction Method with Multipliers for Nonconvex Composite Problems

Convergence of Bregman Alternating Direction Method with Multipliers for Nonconvex Composite Problems Convergence of Bregman Alternating Direction Method with Multipliers for Nonconvex Composite Problems Fenghui Wang, Zongben Xu, and Hong-Kun Xu arxiv:40.8625v3 [math.oc] 5 Dec 204 Abstract The alternating

More information

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers Distributed Optimization and Statistics via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University Stanford Statistics Seminar, September 2010

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

The Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein December 2, 2015 1 / 25 Background The Dual Problem Consider

More information

Multivariate Normal Models

Multivariate Normal Models Case Study 3: fmri Prediction Coping with Large Covariances: Latent Factor Models, Graphical Models, Graphical LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February

More information

On convergence rate of the Douglas-Rachford operator splitting method

On convergence rate of the Douglas-Rachford operator splitting method On convergence rate of the Douglas-Rachford operator splitting method Bingsheng He and Xiaoming Yuan 2 Abstract. This note provides a simple proof on a O(/k) convergence rate for the Douglas- Rachford

More information

Preconditioning via Diagonal Scaling

Preconditioning via Diagonal Scaling Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina. Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Divide-and-combine Strategies in Statistical Modeling for Massive Data Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017

More information

Proximal ADMM with larger step size for two-block separable convex programming and its application to the correlation matrices calibrating problems

Proximal ADMM with larger step size for two-block separable convex programming and its application to the correlation matrices calibrating problems Available online at www.isr-publications.com/jnsa J. Nonlinear Sci. Appl., (7), 538 55 Research Article Journal Homepage: www.tjnsa.com - www.isr-publications.com/jnsa Proximal ADMM with larger step size

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Bregman Alternating Direction Method of Multipliers

Bregman Alternating Direction Method of Multipliers Bregman Alternating Direction Method of Multipliers Huahua Wang, Arindam Banerjee Dept of Computer Science & Engg, University of Minnesota, Twin Cities {huwang,banerjee}@cs.umn.edu Abstract The mirror

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Stochastic Semi-Proximal Mirror-Prox

Stochastic Semi-Proximal Mirror-Prox Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology nhe6@gatech.edu Zaid Harchaoui NYU, Inria firstname.lastname@nyu.edu Abstract We present a direct extension of the Semi-Proximal

More information

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

Sparse and Regularized Optimization

Sparse and Regularized Optimization Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Math 273a: Optimization Overview of First-Order Optimization Algorithms Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization

More information

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Optimal Linearized Alternating Direction Method of Multipliers for Convex Programming 1

Optimal Linearized Alternating Direction Method of Multipliers for Convex Programming 1 Optimal Linearized Alternating Direction Method of Multipliers for Convex Programming Bingsheng He 2 Feng Ma 3 Xiaoming Yuan 4 October 4, 207 Abstract. The alternating direction method of multipliers ADMM

More information

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem Kangkang Deng, Zheng Peng Abstract: The main task of genetic regulatory networks is to construct a

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

ALTERNATING DIRECTION METHOD OF MULTIPLIERS WITH VARIABLE STEP SIZES

ALTERNATING DIRECTION METHOD OF MULTIPLIERS WITH VARIABLE STEP SIZES ALTERNATING DIRECTION METHOD OF MULTIPLIERS WITH VARIABLE STEP SIZES Abstract. The alternating direction method of multipliers (ADMM) is a flexible method to solve a large class of convex minimization

More information

Multivariate Normal Models

Multivariate Normal Models Case Study 3: fmri Prediction Graphical LASSO Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 26 th, 2013 Emily Fox 2013 1 Multivariate Normal Models

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Mahyar Fazlyab, Santiago Paternain, Alejandro Ribeiro and Victor M. Preciado Abstract In this paper, we consider a class of

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

A Primal-dual Three-operator Splitting Scheme

A Primal-dual Three-operator Splitting Scheme Noname manuscript No. (will be inserted by the editor) A Primal-dual Three-operator Splitting Scheme Ming Yan Received: date / Accepted: date Abstract In this paper, we propose a new primal-dual algorithm

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

ADMM and Accelerated ADMM as Continuous Dynamical Systems

ADMM and Accelerated ADMM as Continuous Dynamical Systems Guilherme França 1 Daniel P. Robinson 1 René Vidal 1 Abstract Recently, there has been an increasing interest in using tools from dynamical systems to analyze the behavior of simple optimization algorithms

More information

On the Convergence and O(1/N) Complexity of a Class of Nonlinear Proximal Point Algorithms for Monotonic Variational Inequalities

On the Convergence and O(1/N) Complexity of a Class of Nonlinear Proximal Point Algorithms for Monotonic Variational Inequalities STATISTICS,OPTIMIZATION AND INFORMATION COMPUTING Stat., Optim. Inf. Comput., Vol. 2, June 204, pp 05 3. Published online in International Academic Press (www.iapress.org) On the Convergence and O(/N)

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information