Adaptive Stochastic Alternating Direction Method of Multipliers
|
|
- Daniel Horton
- 5 years ago
- Views:
Transcription
1 Peilin Zhao, Jinwei Yang ong Zhang Ping Li Data Analytics Department, Institute for Infocomm Research, A*SAR, Singapore Department of Mathematics, Rutgers University; and Department of Mathematics, University of Notre Dame, USA Department of Statistics & Biostatistics, Rutgers University, USA; and Big Data Lab, Baidu Research, China Department of Statistics & Biostatistics, Department of Computer Science, Rutgers University; and Baidu Research, USA Abstract he Alternating Direction Method of Multipliers (ADMM has been studied for years raditional ADMM algorithms need to compute, at each iteration, an (empirical expected loss function on all training examples, resulting in a computational complexity proportional to the number of training examples o reduce the complexity, stochastic ADMM algorithms were proposed to replace the expected loss function with a random loss function associated with one uniformly drawn example plus a Bregman divergence term he Bregman divergence, however, is derived from a simple nd-order proximal function, ie, the half squared norm, which could be a suboptimal choice In this paper, we present a new family of stochastic ADMM algorithms with optimal nd-order proximal functions, which produce a new family of adaptive stochastic ADMM methods We theoretically prove that the regret bounds are as good as the bounds which could be achieved by the best proximal function that can be chosen in hindsight Encouraging empirical results on a variety of real-world datasets confirm the effectiveness and efficiency of the proposed algorithms Introduction Originally introduced in (Glowinski & Marroco, 975; Gabay & Mercier, 976, the offline/batch Alternating Direction Method of Multipliers (ADMM stemmed from the augmented Lagrangian method, with its Proceedings of the 3 nd International Conference on Machine Learning, Lille, France, 5 JMLR: W&CP volume 37 Copyright 5 by the author(s global convergence property established in (Gabay, 983; Glowinski & Le allec, 989; Eckstein & Bertsekas, 99 Recent studies have shown that ADMM achieves a convergence rate of O(/ (Monteiro & Svaiter, 3; He & Yuan, (where is number of iterations of ADMM, when the objective function is generally convex Furthermore, ADMM enjoys a convergence rate of O(α, for some α (,, when the objective function is strongly convex and smooth (Luo, ; Deng & Yin, ADMM has shown attractive performance in a wide range of real-world problems such as compressed sensing (Yang & Zhang,, image restoration (Goldstein & Osher, 9, video processing, and matrix completion (Goldfarb et al, 3, etc From the computational perspective, one drawback of AD- MM is that, at every iteration, the method needs to compute an (empirical expected loss function on all the training examples he computational complexity is propositional to the number of training examples, which makes the original ADMM unsuitable for solving large-scale learning and big data mining problems he online ADMM (OADMM algorithm (Wang & Banerjee, was proposed to tackle the computational challenge For OADMM, the objective function is replaced with an online function at every step, which only depends on a single training example OAD- MM can achieve an average regret bound of O(/ for convex objective functions and O(log( / for strongly convex objective functions Interestingly, although the optimization of the loss function is assumed to be easy in the analysis of (Wang & Banerjee,, this step is actually not necessarily easy in practice o address this issue, the stochastic ADMM algorithm was proposed, by linearizing the online loss function (Ouyang et al, 3; Suzuki, 3 In stochastic ADMM algorithms, the online loss function is firstly uniformly drawn from all the loss functions associated with all the training examples hen the
2 loss function is replaced with its first order expansion at the current solution plus the Bregman divergence from the current solution he Bregman divergence is based on a simple proximal function, the half squared norm, so that the Bregman divergence is the half squared distance In this way, the optimization of the loss function enjoys a closed-form solution he stochastic ADMM achieves similar convergence rates as OADMM Using the half square norm as proximal function, however, may be a suboptimal choice Our paper will address this issue We should mention that there are other strategies recently adopted to accelerate stochastic ADMM, including stochastic average gradient (Zhong & Kwok, 4, dual coordinate ascent (Suzuki, 4, which are still based on half squared distance Our contribution In the previous work (Ouyang et al, 3; Suzuki, 3 the Bregman divergence is derived from a simple second order function, ie, the half squared norm, which could be a suboptimal choice (Duchi et al, In this paper, we present a new family of stochastic ADMM algorithms with adaptive proximal functions, which can accelerate stochastic ADMM by using adaptive regularization We theoretically prove that the regret bounds of our methods are as good as those achieved by stochastic ADMM with the best proximal function that can be chosen in hindsight he effectiveness and efficiency of the proposed algorithms are confirmed by encouraging empirical evaluations on several real-world datasets Adaptive Stochastic Alternating Direction Method of Multipliers Problem Formulation In this paper, we will study a family of convex optimization problems, where our objective functions are composite Specially, we are interested in the following equalityconstrained optimization task: min w W,v V f((w, v := E ξ l(w, ξ + φ(v, ( st Aw + Bv = b, where w R d, v R d, A R m d, B R m d, b R m, W and V are convex sets For simplicity, the notation l is used for both the instance function value l(w, ξ and its expectation l(w = E ξ l(w, ξ It is assumed that a sequence of identical and independent (iid observations can be drawn from the random vector ξ, which satisfies a fixed but unknown distribution When ξ is deterministic, the above optimization becomes the traditional formulation of ADMM (Boyd et al, In this paper, we will assume the functions l and φ are convex but not necessarily continuously differentiable In addition, we denote the optimal solution of ( as (w, v Before presenting the proposed algorithm, we first introduce some notations For a positive definite matrix G R d d, we define the G-norm of a vector w as w G := w Gw When there is no ambiguity, we often use to denote the Euclidean norm We use, to denote the inner product in a finite dimensional Euclidean space Let H t be a positive definite matrix for t N Set the proximal function ϕ t (, as ϕ t (w = w H t = w, H tw hen the corresponding Bregman divergence for ϕ t (w is defined as B ϕt (w, u = ϕ t (w ϕ t (u ϕ t (u, w u Algorithm = w u H t o solve the problem (, a popular method is Alternating Direction Method of Multipliers (ADMM ADMM splits the optimization with respect to w and v by minimizing the augmented Lagrangian: min L β(w, v, θ := l(w + φ(v θ, Aw + Bv b w,v + β Aw + Bv b, where β > is a pre-defined penalty Specifically, the ADMM algorithm minimizes L β as follows w t+ = arg min w L β(w, v t, θ t, v t+ = arg min v L β (w t+, v, θ t, θ t+ = θ t β(aw t+ + Bv t+ b At each step, however, ADMM requires calculating the expectation E ξ l(w, ξ, which may be computationally too expensive, since we may only have an unbiased estimate of l(w or the expectation E ξ l(w, ξ is an empirical one for big data problems o solve this issue, we propose to minimize the following stochastic approximation: L β,t (w, v, θ = g t, w + φ(v θ, Aw + Bv b + β Aw + Bv b + η B ϕ t (w, w t, where g t = l (w t, ξ t and H t for ϕ t = w H t will be specified later his objective linearizes l(w, ξ t and adopts a dynamic Bregman divergence function to keep the new model near to the previous one It is easy to see that this proposed approximation includes the one proposed by (Ouyang et al, 3 as a special case when H t = I o minimize the above function, we followed the ADMM algorithm to optimize over w, v, θ sequentially, by fixing the others In addition, we also need to update H t for B ϕt at every step, which will be specified later Finally the proposed (Ada- is summarized in Algorithm
3 Algorithm Adaptive Stochastic Alternating Direction Method of Multipliers (Ada- Initialize: w =, u =, θ =, H = ai, and a > for t =,,, do Compute g t = l (w t, ξ t ; Update H t and compute B ϕt ; w t+ = arg min w W L β,t (w, v t, θ t ; v t+ = arg min v V L β,t (w t+, v, θ t ; θ t+ = θ t β(aw t+ + Bv t+ b; end for 3 Analysis his subsection is devoted to analyzing the expected convergence rate of the iterative solutions of the proposed algorithm for general H t, t =,,, where the proof techniques in (Ouyang et al, 3 and (Duchi et al, are adopted o accomplish this, we begin with a technical lemma, which will facilitate the later analysis Lemma Let l(w, ξ t and φ(w be convex functions and H t be positive definite, for t For Algorithm, we have the following inequality l(w t + φ(v t+ l(w φ(v + (z t+ z F (z t+ η g t H t + η [B ϕ t (w t, w B ϕt (w t+, w] + β ( Aw + Bv t b Aw + Bv t+ b + δ t, w w t + β ( θ θ t θ θ t+, where z t = (w t, v t, θ t, z = (w, v, θ, δ t = g t l (w t, and F (z = (( A θ, ( B θ, (Aw + Bv b he proof is in the appendix We now analyze the convergence behavior of Algorithm and provide an upper bound on the objective value and the feasibility violation heorem Let l(w, ξ t and φ(w be convex functions and H t be positive definite, for t For Algorithm, we have the following inequality for any and ρ > : E[f(ū f(u + ρ A w + B v b ] { [ E η (B ϕ t (w t, w B ϕt (w t+, w t= +η g t ] } Ht + βd v,b + ρ, ( β where ū = ( t= w t, (w, v, and ( w, v = ( and D v,b = Bv + t= v t + t= w t,, u = + t= v t, Proof For convenience, we denote u = (w, v, θ = + t= θ t, and z = ( w, v, θ With these notations, using convexity of l(w and φ(v and the monotonicity of operator F (, we have for any z: f(ū f(u + ( z z F ( z [f((wt, vt+ f(u + (z t+ z F (z t+] t= = [l(w t +φ(v t+ l(w φ(v+(z t+ z F (z t+ ] t= Combining this inequality with Lemma at the optimal solution (w, v = (w, v, we can derive f(ū f(u + ( z z F ( z { η [B ϕ t (w t, w B ϕt (w t+, w ] + η g t H t t= + δ t, w w t + β ( Aw + Bvt b Aw + Bv t+ b + β ( θ θ t θ θ t+ } { t= [ η [B ϕ t (w t, w B ϕt (w t+, w ] + η gt H t + δ t, w w t ] + β Aw + Bv b + β θ θ } { t= [ η (B ϕ t (w t, w B ϕt (w t+, w + η g t H t + δ t, w w t ] + β D v,b + β θ θ } As the above inequality is valid for any θ, it also holds in the ball B ρ = {θ : θ ρ} Combining with the fact that the optimal solution must also be feasible, it follows that max{f(ū f(u + ( z z F ( z } θ B ρ = max θ B ρ {f(ū f(u + θ (Aw + Bv b θ (A w + B v b} = max θ B ρ {f(ū f(u θ (A w + B v b} = f(ū f(u + ρ A w + B v b Utilizing the above two inequalities, we obtain E[f(ū f(u + ρ A w + B v b ] E { t= ( η [B ϕ t (w t, w B ϕt (w t+, w ] + η g t Ht + δ t, w w t + β D v,b + β θ θ } { E [ η [B ϕ t (w t, w B ϕt (w t+, w ] t= +η g t H t ] + βd v,b + ρ β Note Eδ t = in the last step his completes the proof },
4 he above theorem allows us to derive regret bounds for a family of algorithms that iteratively modify the proximal functions ϕ t in attempt to lower the regret bounds Since the rate of convergence is still dependent on H t and η, next we are going to choose appropriate positive definite matrix H t and the constant η to optimize the rate of convergence 4 Diagonal Matrix Proximal Functions In this subsection, we restrict H t to be a diagonal matrix, for two important reasons: (i the diagonal matrix will provide results easier to understand than that for the general matrix; (ii for high dimension problem the general matrix may result in prohibitively expensive computational cost, which is not desirable Firstly, we notice that the upper bound in the heorem relies on t= g t H If we assume all the g t s are t known in advance, we could minimize this term by setting H t = diag(s, t We shall use the following proposition Proposition For any g, g,, g R d, we have min diag(s, s c t= g t diag(s = ( d g :,i, c i= where g :,i = (g,i,, g,i and the minimum is attained at s i = c g :,i / d j= g :,j We omit proof of this proposition, since it is easy to derive Since we do not have all the g t s in advance, we receives the stochastic (subgradients g t sequentially instead As a result, we propose to update the H t incrementally as H t = ai + diag(s t, where s t,i = g :t,i and a For these H t s, we have the following inequality g t Ht = g t, (ai + diag(s t g t t= t= g t, diag(s t g t t= d i= g :,i, (3 where the last inequality used Lemma 4 in (Duchi et al,, which implies this update is a nearly optimal update method for the diagonal matrix case Finally, the adaptive stochastic ADMM with diagonal matrix update (Ada- diag is summarized into the Algorithm For the convergence rate of the proposed Algorithm, we have the following specific theorem heorem Let l(w, ξ t and φ(w be convex functions for any t > hen for Algorithm, we have the following Algorithm Adaptive Stochastic ADMM with Diagonal Matrix Update (Ada- diag Initialize: w =, u =, θ =, and a > for t =,,, do Compute g t = l (w t, ξ t ; Update H t = ai + diag(s t, where s t,i = g :t,i ; w t+ = arg min w L β,t (w, v t, θ t ; v t+ = arg min v V L β,t (w t+, v, θ t ; θ t+ = θ t β(aw t+ + Bv t+ b; end for inequality for any and ρ > E[f(ū f(u + ρ A w + B v b ] ( d E[η g :,i i= + η max t w t w d i= g :,i ] + βdv,b + ρ β If we further set η = D w, / where D w, = max w,w w w, then we have E[f(ū f(u + ρ A w + B v b ] d ( E[Dw, g :,i ] + β D v,b + ρ β i= he proof is in the appendix 5 Full Matrix Proximal Functions In this subsection, we derive and analyze new updates when we estimate a full matrix H t for the proximal function instead of a diagonal one Although full matrix computation may not be attractive for high dimension problems, it may be helpful for tasks with low dimension Furthermore, it will provide us with a more complete insight Similar with the analysis for the diagonal case, we first introduce the following proposition (Lemma 5 in (Duchi et al, Proposition For any g, g,, g R d, we have the following equality min S, tr(s c t= g t S = c tr(g, where G = t= g tgt and the minimizer is attained at S = cg / /tr(g/ If G is not of full rank, then we use its pseudo-inverse to replace its inverse in the minimization problem Because the (subgradients are received sequentially, we propose to update the H t incrementally as H t = ai + G t, G t = t i= g igi, t =,, For these H ts, we have the following inequalities
5 g t Ht g t S g t S =tr(g, (4 t= t= t t= where the last inequality used Lemma in (Duchi et al,, which implies this update is a nearly optimal update method for the full matrix case Finally, the adaptive stochastic ADMM with full matrix update is summarized in Algorithm 3 Algorithm 3 Adaptive Stochastic ADMM with Full Matrix Update (Ada- full Initialize: w =, u =, θ =, G =, and a > for t =,,, do Compute g t = l (w t, ξ t ; Update G t = G t + g t g t ; Update H t = ai + S t, where S t = G t ; w t+ = arg min w L β,t (w, v t, θ t ; v t+ = arg min v V L β,t (w t+, v, θ t ; θ t+ = θ t β(aw t+ + Bv t+ b; end for For the convergence rate of the above proposed Algorithm 3, we have the following specific theorem heorem 3 Let l(w, ξ t and φ(w be convex functions for any t > hen for Algorithm 3, we have the following inequality for any, ρ >, E[f(ū f(u + ρ A w + B v b ] { E[ηtr (G + η max t w w t tr (G ] +βdv,b + ρ } β Furthermore, if we set η = D w, /, where D w, = max w,w w w, then we have E[f(ū f(u + ρ A w + B v b ] ( E[Dw, tr (G / ] + β D v,b + ρ β he proof is in the appendix 3 Experiment In this section, we evaluate the empirical performance of the proposed adaptive stochastic ADMM algorithms for solving Graph-Guided SVM (GGSVM tasks, which is formulated as the following problem (Ouyang et al, 3: min w,v n n [ y ix i w] + + γ w + ν v, i= st F w v =, where [z] + = max(, z and the matrix F is constructed based on a graph G = {V, E} For this graph, V = {w,, w d } is a set of variables and E = {e,, e E }, where e k = {i, j} is assigned with a weight α ij And the corresponding F is in the form: F ki = α ij and F kj = α ij o construct a graph for a given dataset, we adopt the sparse inverse covariance estimation (Friedman et al, 8 and determine the sparsity pattern of the inverse covariance matrix Σ Based on the inverse covariance matrix, we connect all index pairs (i, j with Σ ij and assign α ij = 3 Experimental estbed and Setup o examine the performance, we test all the algorithms on six real-world datasets from web machine learning repositories, which are listed in the able he news dataset was downloaded from wwwcsnyuedu/ roweis/datahtml All other datasets were downloaded from the LIBSVM website For each dataset, we randomly divide it into two folds: training set with 8% of examples and test set with the rest able Several real-world datasets in our experiments Dataset # examples # features a9a 48,84 3 mushrooms 8,4 news 6,4 splice 3,75 6 svmguide3,84 w8a 64,7 3 o make a fair comparison, all algorithms adopt the same experimental setup In particular, we set the penalty parameter γ = ν = /n, where n is the number of training examples, and the trade-off parameter β = In addition, we set the step size parameter η t = /(γt for according to the theorem in (Ouyang et al, 3 Finally, the smooth parameter a is set as, and the step size for adaptive stochastic ADMM algorithms are searched from [ 5:5] using cross validation All the experiments were conducted with 5 different random seeds and epochs (n iterations for each dataset All the result were reported by averaging over these 5 runs We evaluated the learning performance by measuring objective values, ie, f(u, and test error rates on the test datasets In addition, we also evaluate computational efficiency of all the algorithms by their running time All experiments were run in Matlab over a machine of 34GHz CPU 3 Performance Evaluation he figure shows the performance of all the algorithms in comparison over trials, from which we can draw several observations Firstly, the left column shows the objective
6 able Evaluation of stochastic ADMM algorithms on the real-world data sets Algorithm a9a mushrooms Objective value est error rate ime (s Objective value est error rate ime (s 6 ± ± ± 4 35 ± Ada- diag 355 ± 5 ± ± 5 6 ± 3355 Ada- full 3545 ± 498 ± ± ± Algorithm news splice Objective value est error rate ime (s Objective value est error rate ime (s 565 ± ± ± ± 3 98 Ada- diag 339 ± 3 8 ± ± ± Ada- full 34 ± 7 84 ± ± 4 55 ± Algorithm svmguide3 w8a Objective value est error rate ime (s Objective value est error rate ime (s 643 ± 33 6 ± ± ± Ada- diag 563 ± ± ± 93 ± Ada- full 53 ± 44 ± ± 6 99 ± values of the three algorithms We can observe that the t- wo adaptive stochastic ADMM algorithms converge much faster than, which shows the effectiveness of exploration of adaptive regularization (Bregman Divergence to accelerate stochastic ADMM Secondly, compared with Ada- diag, Ada- full achieves slightly s- maller objective values on most of the datasets, which indicates full matrix is slightly more informative than the diagonal one his should be due to the lower dimensions of these datasets hirdly, the central column provides test error rates of three algorithms, where we observe that the two adaptive algorithms achieve significantly smaller or comparable test error rates at 5-th epoch than at -th epoch his observation indicates that we can terminate the two adaptive algorithms earlier to save time and at the same time achieve similar performance compared with Finally, the right column shows the running time of three algorithms, which shows that during the learning process, the Ada- full is significantly s- lower while the Ada- diag is overall efficient compared with In summary, the Ada- diag algorithm achieves a good trade-off between efficiency and effectiveness able summarizes the performance of all the compared algorithms over the six datasets, from which we can make similar observations his again verifies the effectiveness of the proposed algorithms 4 Conclusion ADMM is a popular technique in machine learning his paper studied a method to accelerate stochastic ADMM with adaptive regularization, by replacing the fixed proximal function with adaptive proximal function Compared with traditional stochastic ADMM, we show that the proposed adaptive algorithms converge significantly faster through the proposed adaptive strategies Promising experimental results on a variety of real-world datasets further validate the effectiveness of our techniques Acknowledgments he research of Peilin Zhao and ong Zhang is partially supported by NSF-IIS and NSF-IIS 5985 he research of Jinwei Yang and Ping Li is partially supported by NSF-III-3697, NSF-Bigdata-49, ONR- N , and AFOSR-FA Appendix: Some Proofs Proof for Lemma Proof Firstly, using the convexity of l and the definition of δ t, we can obtain l(w t l(w l (w t, w t w = g t, w t+ w + δ t, w w t + g t, w t w t+ Combining the above inequality with the relation between θ t and θ t+ will derive l(w t l(w + w t+ w, A θ t+ g t, w t+ w + δ t, w w t + g t, w t w t+ + w t+ w, A [β(aw t+ + Bv t+ b θ t] = g t + A [β(aw t+ + Bv t b θ t ], w t+ w }{{} L t + w w t+, βa B(v t v t+ + δ t, w w t }{{} M t + g t, w t w t+ }{{} N t
7 a9a 5 5 mushrooms news 5 5 splice svmguide3 5 5 w8a 5 5 est Error Rates est Error Rates est Error Rates est Error Rates est Error Rates est Error Rates 8 6 a9a mushrooms news splice w8a svmguide ime Costs ime Costs ime Costs ime Costs ime Costs ime Costs a9a a9a 5 5 mushrooms news 5 5 splice svmguide w8a 5 5 Figure Comparison between with Ada- diag ( Ada-diag and Ada- full ( Ada-full on 6 real-world datasets Epoch for the horizontal axis is the number of iterations divided by dataset size Left Panels: Average objective values Middle Panels: Average test error rates Right Panels: Average time costs (in seconds
8 o provide an upper bound for the first term L t, taking D(u, v = B ϕt (u, v = u v H t and applying Lemma in (Ouyang et al, 3 to the step of getting w t+ in Algorithm, we will have g t + A [β(aw t+ + Bv t b θ t], w t+ w η [B ϕ t (w t, w B ϕt (w t+, w B ϕt (w t+, w t ] o provide an upper bound for the second term M t, we can derive as follows w w t+, βa B(v t v t+ = β Aw Aw t+, Bv t Bv t+ = β [( Aw + Bvt b Aw + Bv t+ b +( Aw t+ + Bv t+ b Aw t+ + Bv t b ] β ( Aw + Bv t b Aw + Bv t+ b + β θt+ θt o drive an upper bound for the final term N t, we can use Young s inequality to get g t, w t w t+ η g t H t = η g t H t + w t w t+ H t η + B ϕ t (w t, w t+ η Replacing the terms L t, M t and N t with their upper bounds, we will get l(w t l(w + w t+ w, A θ t+ η [B ϕ t (w t, w B ϕt w t+, w] + η gt H t + β ( Aw + Bvt b Aw + Bv t+ b + β θ t+ θ t + δ t, w w t Due to the optimality condition of the step of updating v in Algorithm, ie, v L β,t (w t+, v t+, θ t and the convexity of φ, we have φ(v t+ φ(v + v t+ v, B θ t+ Using the fact Aw t+ + Bv t+ b = (θ t θ t+ /β, we have θ t+ θ, Aw t+ + Bv t+ b = β ( θ θt θ θ t+ θ t+ θ t Combining the above three inequalities and re-arranging the terms will conclude the proof Proof of heorem Proof We have the following inequality = [B ϕt (w t, w B ϕt (w t+, w ] t= ( w t w H t w t+ w H t t= w w H + ( w t+ w H t+ w t+ w H t t= = w w H + w t+ w (H t+ H t t= w w H + max(w t+,i w i,i s t+ s t t= = w w H + w t+ w (s t+ s t t= w w H + max t w t w s w w s max t w t w d i= g :,i, where the last inequality used s, = d i= g :,i and w w H w w s Plugging the above inequality and the inequality (4 into the inequality (, will conclude the first part of the theorem hen the second part is trivial to be derived Proof of heorem 3 Proof We consider the sum of the difference = [B ϕt (w t, w B ϕt (w t+, w ] t= ( w t w H t w t+ w H t t= w w H + ( w t+ w H t+ w t+ w H t t= = w w H + w t+ w (G t+ G t t= w w H + w t+ w λ max(g t+ G t t= = w w H + w t+ w tr(g t+ G t t= w w H + max t w t w tr(g w w tr(g max t wt w tr(g Plugging the above inequality and the inequality (4 into the inequality (, will conclude the first part of the theorem hen the second part is trivial to be derived
9 References Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan Distributed optimization and statistical learning via the alternating direction method of multipliers Foundations and rends R in Machine Learning, 3(:, Deng, Wei and Yin, Wotao On the global and linear convergence of the generalized alternating direction method of multipliers echnical report, DIC Document, Duchi, John, Hazan, Elad, and Singer, Yoram Adaptive subgradient methods for online learning and stochastic optimization he Journal of Machine Learning Research, : 59, Eckstein, Jonathan and Bertsekas, Dimitri P On the douglasrachford splitting method and the proximal point algorithm for maximal monotone operators Mathematical Programming, 55(-3:93 38, 99 Friedman, Jerome, Hastie, revor, and ibshirani, Robert Sparse inverse covariance estimation with the graphical lasso Biostatistics, 9(3:43 44, 8 Gabay, Daniel Chapter ix applications of the method of multipliers to variational inequalities Studies in mathematics and its applications, 5:99 33, 983 Gabay, Daniel and Mercier, Bertrand A dual algorithm for the solution of nonlinear variational problems via finite element approximation Computers & Mathematics with Applications, (:7 4, 976 Glowinski, R and Marroco, A Sur lapproximation, par elements nis dordre un, et la resolution, par penalisationdualite, dune classe de problems de dirichlet non lineares Revue Francaise dautomatique, Informatique, et, 975 Luo, Zhi-Quan On the linear convergence of the alternating direction method of multipliers arxiv preprint arxiv:839, Monteiro, Renato D C and Svaiter, Benar Fux Iterationcomplexity of block-decomposition algorithms and the alternating direction method of multipliers SIAM Journal on Optimization, 3(:475 57, 3 Ouyang, Hua, He, Niao, ran, Long, and Gray, Alexander G Stochastic alternating direction method of multipliers In Proceedings of the 3th International Conference on Machine Learning (ICML-3, pp 8 88, 3 Suzuki, aiji Dual averaging and proximal gradient descent for online alternating direction multiplier method In Proceedings of the 3th International Conference on Machine Learning (ICML-3, pp 39 4, 3 Suzuki, aiji Stochastic dual coordinate ascent with alternating direction method of multipliers In Proceedings of the 3th International Conference on Machine Learning, ICML 4, Beijing, China, -6 June 4, pp , 4 Wang, Huahua and Banerjee, Arindam Online alternating direction method In Proceedings of the 9th International Conference on Machine Learning (ICML-, pp 9 6, Yang, Junfeng and Zhang, Yin Alternating direction algorithms for l -problems in compressive sensing SIAM journal on scientific computing, 33(:5 78, Zhong, Wenliang and Kwok, James in-yau Fast stochastic alternating direction method of multipliers In Proceedings of the 3th International Conference on Machine Learning, ICML 4, Beijing, China, -6 June 4, pp 46 54, 4 Glowinski, Roland and Le allec, Patrick Augmented Lagrangian and operator-splitting methods in nonlinear mechanics, volume 9 SIAM, 989 Goldfarb, Donald, Ma, Shiqian, and Scheinberg, Katya Fast alternating linearization methods for minimizing the sum of two convex functions Math Program, 4(-: , 3 Goldstein, om and Osher, Stanley he split bregman method for l-regularized problems SIAM Journal on Imaging Sciences, (:33 343, 9 He, Bingsheng and Yuan, Xiaoming On the o(/n convergence rate of the douglas-rachford alternating direction method SIAM Journal on Numerical Analysis, 5(: 7 79,
arxiv: v1 [math.oc] 23 May 2017
A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method
More informationOn Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization
On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider
More informationFast Rate Analysis of Some Stochastic Optimization Algorithms
Fast Rate Analysis of Some Stochastic Optimization Algorithms Chao Qu Department of Mechanical Engineering, National University of Singapore Huan Xu Department of Industrial and Systems Engineering, National
More informationThe Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent
The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent Yinyu Ye K. T. Li Professor of Engineering Department of Management Science and Engineering Stanford
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationContraction Methods for Convex Optimization and monotone variational inequalities No.12
XII - 1 Contraction Methods for Convex Optimization and monotone variational inequalities No.12 Linearized alternating direction methods of multipliers for separable convex programming Bingsheng He Department
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.11
XI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.11 Alternating direction methods of multipliers for separable convex programming Bingsheng He Department of Mathematics
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationDistributed Optimization via Alternating Direction Method of Multipliers
Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition
More informationLinearized Alternating Direction Method of Multipliers for Constrained Nonconvex Regularized Optimization
JMLR: Workshop and Conference Proceedings 63:97 19, 216 ACML 216 Linearized Alternating Direction Method of Multipliers for Constrained Nonconvex Regularized Optimization Linbo Qiao qiao.linbo@nudt.edu.cn
More informationDual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method
Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method aiji Suzuki s-taiji@stat.t.u-tokyo.ac.jp Department of Mathematical Informatics, he University of okyo, okyo
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.16
XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationOne-Pass Multi-View Learning
JMLR: Workshop and Conference Proceedings 45:407 422, 2015 ACML 2015 One-Pass Multi-View Learning Yue Zhu zhuy@lamda.nju.edu.cn Wei Gao gaow@lamda.nju.edu.cn Zhi-Hua Zhou zhouzh@lamda.nju.edu.cn National
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationFast Nonnegative Matrix Factorization with Rank-one ADMM
Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationConvergence rate of SGD
Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected
More informationFast-and-Light Stochastic ADMM
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Fast-and-Light Stochastic ADMM Shuai Zheng James T. Kwok Department of Computer Science and Engineering
More informationSparse inverse covariance estimation with the lasso
Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.18
XVIII - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No18 Linearized alternating direction method with Gaussian back substitution for separable convex optimization
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationThe Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent
The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent Caihua Chen Bingsheng He Yinyu Ye Xiaoming Yuan 4 December, Abstract. The alternating direction method
More informationDealing with Constraints via Random Permutation
Dealing with Constraints via Random Permutation Ruoyu Sun UIUC Joint work with Zhi-Quan Luo (U of Minnesota and CUHK (SZ)) and Yinyu Ye (Stanford) Simons Institute Workshop on Fast Iterative Methods in
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More information2 Regularized Image Reconstruction for Compressive Imaging and Beyond
EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationEE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)
EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in
More informationConvergence of Bregman Alternating Direction Method with Multipliers for Nonconvex Composite Problems
Convergence of Bregman Alternating Direction Method with Multipliers for Nonconvex Composite Problems Fenghui Wang, Zongben Xu, and Hong-Kun Xu arxiv:40.8625v3 [math.oc] 5 Dec 204 Abstract The alternating
More informationDistributed Optimization and Statistics via Alternating Direction Method of Multipliers
Distributed Optimization and Statistics via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University Stanford Statistics Seminar, September 2010
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationThe Alternating Direction Method of Multipliers
The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein December 2, 2015 1 / 25 Background The Dual Problem Consider
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Coping with Large Covariances: Latent Factor Models, Graphical Models, Graphical LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February
More informationOn convergence rate of the Douglas-Rachford operator splitting method
On convergence rate of the Douglas-Rachford operator splitting method Bingsheng He and Xiaoming Yuan 2 Abstract. This note provides a simple proof on a O(/k) convergence rate for the Douglas- Rachford
More informationPreconditioning via Diagonal Scaling
Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationEfficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization
Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationProximal ADMM with larger step size for two-block separable convex programming and its application to the correlation matrices calibrating problems
Available online at www.isr-publications.com/jnsa J. Nonlinear Sci. Appl., (7), 538 55 Research Article Journal Homepage: www.tjnsa.com - www.isr-publications.com/jnsa Proximal ADMM with larger step size
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationBregman Alternating Direction Method of Multipliers
Bregman Alternating Direction Method of Multipliers Huahua Wang, Arindam Banerjee Dept of Computer Science & Engg, University of Minnesota, Twin Cities {huwang,banerjee}@cs.umn.edu Abstract The mirror
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationIntroduction to Alternating Direction Method of Multipliers
Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationStochastic Semi-Proximal Mirror-Prox
Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology nhe6@gatech.edu Zaid Harchaoui NYU, Inria firstname.lastname@nyu.edu Abstract We present a direct extension of the Semi-Proximal
More informationACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING
ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)
More informationDual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationOptimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections
Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The
More informationAdaptive Gradient Methods AdaGrad / Adam
Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)
More informationSparse and Regularized Optimization
Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationMath 273a: Optimization Overview of First-Order Optimization Algorithms
Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization
More informationSplitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches
Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationOptimal Linearized Alternating Direction Method of Multipliers for Convex Programming 1
Optimal Linearized Alternating Direction Method of Multipliers for Convex Programming Bingsheng He 2 Feng Ma 3 Xiaoming Yuan 4 October 4, 207 Abstract. The alternating direction method of multipliers ADMM
More informationA Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem
A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem Kangkang Deng, Zheng Peng Abstract: The main task of genetic regulatory networks is to construct a
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationALTERNATING DIRECTION METHOD OF MULTIPLIERS WITH VARIABLE STEP SIZES
ALTERNATING DIRECTION METHOD OF MULTIPLIERS WITH VARIABLE STEP SIZES Abstract. The alternating direction method of multipliers (ADMM) is a flexible method to solve a large class of convex minimization
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Graphical LASSO Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 26 th, 2013 Emily Fox 2013 1 Multivariate Normal Models
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationRandomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming
Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationCoordinate Update Algorithm Short Course Subgradients and Subgradient Methods
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n
More informationDistributed Smooth and Strongly Convex Optimization with Inexact Dual Methods
Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Mahyar Fazlyab, Santiago Paternain, Alejandro Ribeiro and Victor M. Preciado Abstract In this paper, we consider a class of
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationA Primal-dual Three-operator Splitting Scheme
Noname manuscript No. (will be inserted by the editor) A Primal-dual Three-operator Splitting Scheme Ming Yan Received: date / Accepted: date Abstract In this paper, we propose a new primal-dual algorithm
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationThe picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R
The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationADMM and Accelerated ADMM as Continuous Dynamical Systems
Guilherme França 1 Daniel P. Robinson 1 René Vidal 1 Abstract Recently, there has been an increasing interest in using tools from dynamical systems to analyze the behavior of simple optimization algorithms
More informationOn the Convergence and O(1/N) Complexity of a Class of Nonlinear Proximal Point Algorithms for Monotonic Variational Inequalities
STATISTICS,OPTIMIZATION AND INFORMATION COMPUTING Stat., Optim. Inf. Comput., Vol. 2, June 204, pp 05 3. Published online in International Academic Press (www.iapress.org) On the Convergence and O(/N)
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationConvex Repeated Games and Fenchel Duality
Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,
More information