Coordinate descent methods

Size: px
Start display at page:

Download "Coordinate descent methods"

Transcription

1 Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5 4 Stochastic dual coordinate ascent for support vector machines 9 Exact coordinate descent The idea of coordinate descent is to decompose a large optimisation problem into a sequence of one-dimensional optimisation problems. The algorithm was first described for the minimization of quadratic functions by Gauss and Seidel in Seidel (874). Coordinate descent methods have become unavoidable in machine learning because they are very efficient for ey problems, namely Lasso, logistic regression and support vector machines. Moreover, the decomposition into small subproblems means that only a small part of the data is processed at each iteration and this maes coordinate descent easily scalable to high dimensions. We first decompose the space of optimisation variables X into blocs X... X n = X. A classical choice when X = R n is to choose X =... = X n = R. We will denote U i the canonical injection from X i to X, that is U i is such that for all h X i, U i h = (0,..., 0,h, 0,..., 0) X. }{{}}{{} i zeros n i zeros For a function f : X... X n R, we define the following algorithm. Algorithm : Exact coordinate descent Start at x 0 X. At iteration, choose l = ( mod n) + (cyclic rule) and define x + X by { + = arg min z X l f(x () + = x(i),..., x(l ), z, x (l+),..., x (n) ) if i = l if i l

2 Figure : The successive iterates of the coordinate descent method on a D example. The function we are minimising is represented by its level sets: the bluer is the circle, the lower is the function values. Proposition (Warga (963)). If f is continuously differentiable and strictly convex and there exists x = arg min x X f(x), then the exact coordinate descent method (Alg. ) converges to x. Example (least squares). f(x) = Ax b = m j= (a j x b j) At each iteration, we need to solve in z the D equation For all x R n, so we get f x (x() (l),..., x(l ), z, x (l+),..., x (n) ) = 0 f x (l) (x) = a l (Ax b) = a l a lx (l) + a l ( a j x (j) ) a l b j l z = x (l) + = ( ) a l a l ( a j x (j) ) + a l b = x (l) a j l l (a l ( j= a j x (j) ) a l b ) Example (non-differentiable function). f(x (), x () ) = x () x () min(x (), x () ) + I [0,] (x) f is convex but not differentiable. If we nevertheless try to run exact coordinate descent, the algorithm proceeds as x () = arg min z f(z, x () 0 ) = x() 0, x() = arg min z f(x (), z) = x() 0, and so on. Thus exact coordinate descent converges in two iterations to (x () 0, x() 0 ): the algorithm is stuc on the non-differentiability point on the line {x () = x () } and does not reach the minimiser (, ). Example 3 (non-convex differentiable function). f(x (), x (), x (3) ) = (x () x () + x () x (3) + x (3) x () ) + 3 max(0, x(i) ) As shown by Powell (976), exact coordinate descent on this function started at the initial point x (0) = ( ɛ, + ɛ/, ɛ/4) has a limit cycle around the 6 corners of the cube that are not minimisers and avoids the corners that are minimisers. Exercise: Show Powell s result.

3 x 0 x = x Figure : The function in Example Example 4 (Adaboost). The Adaboost algorithm Collins et al. (00) was designed to minimise the exponential loss given by m f(x) = exp( y j h j x). j= At each iteration, we select the variable l such that l = arg max i i f(x) and we perform an exact coordinate descent step along this coordinate. This variable selection rule is called the greedy or Gauss-Southwell rule. Lie the cyclic rule, it leads to a converging algorithm but requires to compute the full gradient at each iteration. Greedy coordinate descent is interesting in the case of the exponential loss because the gradient of the function has a few very large coefficients and many negligible coefficients. Exercise: Suppose that y j {, } and h j,i {, 0, } for all j, i. Give the explicit formulas of i f(x) and of the next update x + nowing x. Coordinate gradient descent Solving a one-dimensional optimisation problems is generally easy and the solution can be approximated very well by algorithms lie the bisection method. However, for the exact coordinate descent method, one needs to solve a huge number of one-dimensional problems and the expense quicly becomes prohibitive. Moreover, why should we solve to high accuracy the -dimensional problem and destroy this solution at the next iteration? The idea of coordinate gradient descent is to perform one iteration of gradient descent in the -dimensional problem min z Xl f(x (),..., x(l ), z, x (l+),..., x (n) ) instead of solving it completely. In general, this reduces drastically the cost of each iteration while eeping the same convergence behaviour. When choosing the cyclic or greedy rule, the algorithm does converge for any convex function f that has a Lipschitz-continuous gradient and such that arg min x f(x). In fact we will assume that we actually now the coordinate-wise Lipschitz constants of the gradient of f, namely the Lipschitz constants of the functions g i,x : X i R h f(x + U i h) = f(x (),..., x (i ), + h, x (i+),..., x (n) ) 3

4 Algorithm : Coordinate gradient descent Start at x 0. At iteration, choose i + {,..., n} and define x + by { + = x(i) γ i i f(x ) if i = i + + = x(i) if i i + We will denote L i = L( g i,x ) this Lipschitz constant. Written in terms of f, this means that x X, i {,..., n}, h X i, f(x + U i h) f(x) L i U i h. Lemma. If f has a coordinate-wise Lipschitz gradient with constants L,..., L n, then x X, i {,..., n}, h X i, f(x + U i h) f(x) + i f(x), h + L i h Proof. This is Taylor s inequality applied to g i,x. Note that we do not require the function to be twice differentiable. Proposition (Bec and Tetruashvili (03)). Assume that f is convex, f is Lipschitz continuous and arg min x X f(x). If i + is chosen with the cyclic rule i + = ( mod n) + and i, γ i = L i, then the coordinate gradient descent method (Alg. ) satisfies f(x + ) f(x ) 4L max ( + n 3 L max/l min) R (x 0 ) + 8/n where R (x 0 ) = max x,y X { x y : f(y) f(x) f(x 0 )}, L max = max i L i and L min = min i L i. The proof of this result is quite technical and in fact the bound is much more pessimistic than what is observed in practice (n 3 is very large if n is large). This is due to the fact that the cyclic rule behaves particularly bad on some extreme examples. To avoid such traps, it has been suggested to randomise the coordinate selection process. Proposition 3 (Nesterov (0)). Assume that f is convex, f is Lipschitz continuous and arg min x X f(x). If i + is randomly generated, independently of i,..., i and i {,..., n}, P(i + = i) = n and γ i = L i, then the coordinate gradient descent method (Alg. ) satisfies for all x arg min x f(x) E[f(x + ) f(x )] where x L = n L i. n ( ( + n n )(f(x 0) f(x )) + ) x x 0 L Proof. This is a particular case of the method developed in the next section. Comparison with gradient descent The iteration complexity of the gradient descent method is f(x + ) f(x ) L( f) ( + ) x x 0 4

5 This means that to get an ɛ-solution (i.e. such that f(x ) f(x ) ɛ), we need at most L( f) ɛ x x 0 iterations. What is most expensive in gradient descent is the evaluation of the gradient f(x) with a cost C, so the total cost of the method is C grad = C L( f) x x 0 ɛ Neglecting the effect of randomisation, we usually have an ɛ-solution with coordinate descent in n ( ɛ ( n )(f(x 0) f(x )) + x x 0 L) iterations. The cost of one iteration of coordinate descent is of the order of the cost of evaluation one partial derivative i f(x), with a cost c, so the total cost of the method is C cd = c n ( ( ɛ n )(f(x 0) f(x )) + x x 0 ) L How do these two quantities compare? Let us consider the case where f(x) = Ax b. Computing f(x) = A (Ax b) amounts to updating the residuals r = Ax b (one matrix vector product and a sum) and computing one matrix vector product. We thus have C = O(nnz(A)). Computing i f(x) = e i A (Ax b) amounts to. updating the residuals r = Ax b: one scalar-vector product and a sum since we have r + = r + (x (i +) + x (i +) )Ae i+,. computing one vector-vector product (the i th column of A versus the residuals). Thus c = O(nnz(Ae i+ )) = O(nnz(A)/n) = C/n if the columns of A are equally sparse. f(x 0 ) f(x ) L( f) x 0 x and it may happen that f(x 0) f(x ) L( f) x 0 x L( f) = λ max (A A) and L i = a i a i with a i = Ae i. We always have L i L( f) and it may happen that L i = O(L( f)/n). To conclude, in the quadratic case, C cd C grad and we may have C cd = O(C grad /n). 3 Proximal coordinate descent We are often interested in solving problems of the type min x X F (x) = min f(x) + g(x) (3.) x X where f and g are convex so that F = f + g is convex, f has a Lipschitz continuous gradient and g may be nonsmooth but is separable. This means that for all x X = X... X n, g(x) = g i ( ). We can solve this ind of problems with the proximal coordinate descent method (Alg. 3, Tseng (00)), which is also using the coordinate-wise Lipschitz constant. For this algorithm to be practical, we need to be able to compute efficiently prox γ,g (y) = arg min x X g(x) + x y γ, the proximal operator of g (remember that x γ = n γ i ). 5

6 Algorithm 3: Proximal coordinate descent Start at x 0 X. At iteration, choose i + {,..., n} and define x + X by + = arg min x X i g i (x) + f(x ) + i f(x ), x + = x(i) + L i x x(i) if i = i + if i i + Example 5 (Simple proximal operators). Indicator of a box: if g(x) = I [a,b] (x), then prox γ,g (y) = max(a, min(x, b)). This is the projection on [a, b] (it does not depend on γ). Absolute value: if g(x) = λ x, then prox γ,g (y) = sign(y) max(0, y γλ). This is the soft-thresholding operator. We define x + = prox L,g(x L f(x )) = arg min x X g(x) + f(x ) + f(x ), x x + x x L, so that + = Lemma. For all γ R n + and x X { + if i = i + if i i + g( x + )+ f(x ), x + x + x + x γ g(x)+ f(x ), x x + x x γ x + x γ Proof. The function ψ : x g(x) + f(x ), x x + x x γ is strongly convex and its minimiser is x +. The inequality is just the strong convexity inequality of this function with respect to the norm γ and applied at x and x +. Theorem (Richtári and Taáč (04)). The proximal coordinate descent method (Alg. 3) with the random selection rule applied to Problem 3. satisfies for all x arg min x F (x) E[F (x + ) F (x )] where x L = n L i. n ( ( + n n )(F (x 0) F (x )) + ) x x 0 L Proof. By definition of the algorithm, x + x = U i+ (x (i +) + x (i +) ), so by Lemma, f(x + ) f(x ) + i+ f(x ), x (i +) + x (i +) + L i + x(i +) + x (i +) = f(x ) + f(x ), x + x + x + x L (3.) Using the notation x + = arg min x X g(x) + f(x ) + f(x ), x x + x x L, we have { + = + if i = i + if i i + 6

7 Using the conditional expectation nowing F = (i,... i ), we get E[ + F ] = P(i + = i) + + P(i + i) = n x(i) + + ( n )x(i) E[ i f(x ), + x(i) F ] = P(i + = i) i f(x ), + x(i) + P(i + i) i f(x ), x(i) = n if(x ), + x(i) E[ f(x ), x + x F ] = E[ x + x L F ] = E[g(x + ) g(x ) F ] = Combining (3.3), (3.4) and (3.5) with (3.), we get E[ i f(x ), + x(i) F ] = n f(x ), x + x (3.3) E[ L i x(i) + x(i) F ] = n x + x L (3.4) E[g i ( + ) g i( ) F ] = n (g( x +) g(x )) (3.5) [ E[g(x + ) + f(x + ) F ] E[g(x + ) F ] + E f(x ) + f(x ), x + x + ] x + x L F = ( n )g(x ) + n g( x +) + f(x ) + n f(x ), x + x + n x + x L Using Lemma with x = x, we get E[F (x + ) F ] = E[g(x + ) + f(x + ) F ] g(x ) + f(x ) n x + x g(x ) + f(x ) = F (x ) (3.6) Then, using Lemma again, with x = x, we get E[F (x + ) F ] = E[g(x + ) + f(x + ) F ] ( n )g(x ) + f(x ) + n g(x ) + n f(x ), x x + n x x L n x x + L We remar that so that E[ x x L x x + L F ] = n x x L n x x + L, E[F (x + ) F ] ( n )g(x ) + f(x ) + n g(x ) + n f(x ), x x We use the convexity of f: + x x L E[ x x + L F ]. E[F (x + ) F ] ( n )(g(x )+f(x ))+ n (g(x )+f(x ))+ x x L E[ x x + L F ]. We rearrange and we apply total expectation: E[F (x + ) F (x ) + x x + L] E[( n )(F (x ) F (x )) + x x L]. 7

8 Summing for from 0 to K- yields E[F (x K+ ) F (x )] + K E[ x x K+ L] + E[ n (F (x ) F (x )] = Using (3.6) and the fact that E[ x x K+ L ] 0, ( n )(F (x 0) F (x )) + x x 0 L]. ( + n )E[F (x K+) F (x )] ( n )(F (x 0) F (x )) + x x 0 L] We just need to divide by n + to conclude. Example 6 (Lasso). Proximal coordinate descent is widely used to solve the Lasso problem given by min x R p y Zx + λ x Here, f(x) = y Zx is differentiable while g(x) = λ x is a non-differentiable function whose proximal operator is the soft-thresholding operator. Example 7 (Multi-tas Lasso). In the multi-tas framewor, the Lasso problem can be generalised as p min x R p q Y Zx F + λ x j,:. Here, the optimisation variable is a p q matrix. One can see that the nonsmooth part of the objective is g(x) = λ p j= x j,:. This function is not separable when we consider the entries of x one by one but it is separable if we group these entries column-wise. Hence, we can consider bloc coordinate descent with p blocs of size q for the resolution of the multi-tas Lasso problem. Example 8 (l /l -regularised multinomial logistic regression). Logistic regression is famous for classification problems. One observes for each i [n] a class label c i {,..., q} and a vector of features z i R p. This information can be recast into a matrix Y R n q filled by 0 s and s: Y i, = {ci =}. A matrix B R p q is formed by q vectors encoding the hyperplanes for the linear classification. The multinomial l /l regularized regression reads: min B R p q ( q ( q Y i, zi B :, + log = = j= ( ) )) exp zi B :, + λ p B j,:, Lie the multi-tas Lasso problem, this problem can be solved with proximal coordinate descent as long as we consider blocs of variables corresponding to the columns of B rather than single variables. Exercise:. Find the proximal operator of the non-smooth function g(b) = λ p j= B j,:.. Give the expression of the partial derivatives of the smooth function ( q ( q ( ) )) f(b) = Y i, zi B :, + log exp zi B :, = = j= 8

9 3. Give an estimate of the p bloc-wise Lipschitz constants of f. 4. Write the proximal coordinate descent method for l /l -regularised multinomial logistic regression. 4 Stochastic dual coordinate ascent for support vector machines In this section, we focus on the linear Support Vector Machines (SVM) problem min C max(0, y i z w R p i w) + w where C is a positive real number, y R n and i, z i R p. Note that we consider the formulation without intercept. The objective function contains a non-smooth and non-separable term so we cannot apply coordinate descent to it. However, a dual formulation of the SVM problem is given by max α R n p ( n Z i,j y i α (i)) + α (i) I [0,C] n(α). j= The objective function of this problem does decompose into a differentiable concave function f(α) = p j= ( n Z i,jy i α (i)) + n α(i) and a nonsmooth concave and separable function g(α) = I [0,C] n(α). Stochastic Dual Coordinate Ascent (SDCA) is proximal coordinate ascent (the version of coordinate descent for concave functions) on this problem. Exercise. Write an implement of SDCA. It may be useful to maintain residuals w defined by w (j) = n Z i,jy i α (i) for all j {,..., p}. Even if we are running the algorithm in the dual, we are interested in the primal problem. The following result shows that we can recover a good primal solution from the dual solution and gives theoretical guarantees for the convergence in the primal. Theorem (Shalev-Shwartz and Zhang (03)). Let us define a primal point w = Z Diag (y) α, where (α ) 0 is generated by SDCA. The duality gap satisfies for all K n, [ K E K =K ] P (w ) D(α ) n ( ( K + n n )(D(α ) D(α 0 )) + ) α α 0 L + n K C where the primal value is P (w ) = C n max(0, y izi w ) + w, the dual value is D(α ) = Z Diag (y) α + n α(i) I [0,C] n(α ) and i, L i = yi z i. Proof. As SDCA solves the dual problem with coordinate ascent, by Theorem, E[D(α ) D(α + )] n ( ( + n n )(D(α ) D(α 0 )) + ) α α 0 L. The goal of the theorem is to upper bound E[P (w ) D(α )] by quantities involving E[D(α ) D(α + )]. Note that by wea duality, P (w ) D(α ) D(α ) D(α + ) but what we need is an inequality in the other way. For this, we will need to use the fact that (α ) 0 is generated by the coordinate ascent method. L i 9

10 Using the feasibility of α and the definition of w, we can simplify D(α + ) as D(α + ) = Z Diag (y) α + + α (i) + I [0,C] n(α +) = w + + e α + As α + = α + U i+ (ᾱ (i +) + α (i +) ), w + = w + z i+ y i+ (ᾱ (i +) + α (i +) ) and D(α + ) = w + z i+ y i+ (ᾱ (i +) + α (i +) ) + e α + ᾱ (i +) + α (i +) To simplify notations, we will write here i = i +. Note that ᾱ (i) + = arg max (y iz i Z Diag (y) α + )(a α (i) a [0,C] ) y iz i = arg max a [0,C] w + z i y i (a α (i) ) + a α (i). (a α (i) ) So let us consider φ : x C max(0, x), u φ(y i z i w ) [0, C] and s [0, ]. D(α + ) = max a [0,C] w + z i y i (a α (i) ) + e α + a α (i) w + z i y i ((su + ( s)α (i) ) α(i) ) + e α + (su + ( s)α (i) ) α(i) w + z i y i s(u α (i) ) + e α + s(u α (i) ) = w s z iy i (u α (i) ) s(u α (i) )y iz i w + e α + s(u α (i) ) = D(α ) s z iy i (u α (i) ) s(u α (i) )y iz i w + s(u α (i) ) As u φ(y i z i w ) [0, C] and φ (q) = q + I [ C,0] (q), Fenchel-Young equality leads to: φ(y i z i w ) u = uy i z i w. Hence, D(α + ) D(α ) s z iy i (u α (i) ) + sφ(y i zi w ) + sα (i) y izi w sα (i) Applying conditional expectation, we get E[D(α + ) F ] D(α ) s n Now, So that P (w ) D(α ) = C = z i y i (u α (i) ) + s n ( φ(yi zi w ) + α (i) y izi w α (i) ) max(0, y i zi w ) + w ( w + e α ) E[D(α + ) F ] D(α ) s n φ(y i zi w ) + α (i) y izi w α (i) s n z i y i (u α (i) ) + s n (P (w ) D(α )) ( z i y i )C + s n (P (w ) D(α )) 0

11 where the last inequality derives from α (i) [0, C] and u [0, C]. We apply total expectation and we sum for from K to K : s n K =K E[P (w ) D(α )] E[D(α K )] E[D(α K )] + s n C E[D(α )] E[D(α K )] + s n C c 0n K + n + s n C ( z i y i )(K K ) ( z i y i )(K K ) ( z i y i )(K K ) where c 0 = ( n )(D(α ) D(α 0 )) + α α 0 L. Choosing K = K and s = n K, we obtain, for K n (because we need s ) K K =K E[P (w ) D(α )] c 0n K + n + n C K ( z i y i ) References Bec, A. and Tetruashvili, L. (03). On the convergence of bloc coordinate descent type methods. SIAM Journal on Optimization, 3(4): Collins, M., Schapire, R. E., and Singer, Y. (00). Logistic regression, adaboost and bregman distances. Machine Learning, 48(-3): Nesterov, Y. (0). Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, (): Powell, M. (976). Some global convergence properties of a variable metric algorithm for minimization without exact line searches. Nonlinear programming, 9:53 7. Richtári, P. and Taáč, M. (04). Iteration complexity of randomized bloc-coordinate descent methods for minimizing a composite function. Mathematical Programming, 44(-): Seidel, L. (874). Ueber ein verfahren, die gleichungen, auf welche die methode der leinsten quadrate führt, sowie lineäre gleichungen überhaupt, durch successive annäherung aufzulösen:(aus den abhandl. d bayer. aademie ii. cl. xi. bd. iii. abth. Verlag der. Aademie. Shalev-Shwartz, S. and Zhang, T. (03). Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 4(): Tseng, P. (00). Convergence of a bloc coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 09(3): Warga, J. (963). Minimizing certain convex functions. Journal of the Society for Industrial & Applied Mathematics, (3): Wright, S. J. (05). Coordinate descent algorithms. Mathematical Programming, 5():3 34.

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Iteration Complexity of Feasible Descent Methods for Convex Optimization

Iteration Complexity of Feasible Descent Methods for Convex Optimization Journal of Machine Learning Research 15 (2014) 1523-1548 Submitted 5/13; Revised 2/14; Published 4/14 Iteration Complexity of Feasible Descent Methods for Convex Optimization Po-Wei Wang Chih-Jen Lin Department

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Lecture 2: Convex Sets and Functions

Lecture 2: Convex Sets and Functions Lecture 2: Convex Sets and Functions Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 2 Network Optimization, Fall 2015 1 / 22 Optimization Problems Optimization problems are

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara Random coordinate descent algorithms for huge-scale optimization problems Ion Necoara Automatic Control and Systems Engineering Depart. 1 Acknowledgement Collaboration with Y. Nesterov, F. Glineur ( Univ.

More information

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Smoothing Proximal Gradient Method. General Structured Sparse Regression for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:

More information

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Math 273a: Optimization Overview of First-Order Optimization Algorithms Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization

More information

Coordinate gradient descent methods. Ion Necoara

Coordinate gradient descent methods. Ion Necoara Coordinate gradient descent methods Ion Necoara January 2, 207 ii Contents Coordinate gradient descent methods. Motivation..................................... 5.. Coordinate minimization versus coordinate

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Coordinate Descent Methods on Huge-Scale Optimization Problems

Coordinate Descent Methods on Huge-Scale Optimization Problems Coordinate Descent Methods on Huge-Scale Optimization Problems Zhimin Peng Optimization Group Meeting Warm up exercise? Warm up exercise? Q: Why do mathematicians, after a dinner at a Chinese restaurant,

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ???? DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Distributed Optimization via Alternating Direction Method of Multipliers

Distributed Optimization via Alternating Direction Method of Multipliers Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

Accelerated Gradient Method for Multi-Task Sparse Learning Problem Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Convex Optimization and Support Vector Machine

Convex Optimization and Support Vector Machine Convex Optimization and Support Vector Machine Problem 0. Consider a two-class classification problem. The training data is L n = {(x 1, t 1 ),..., (x n, t n )}, where each t i { 1, 1} and x i R p. We

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Duality in Linear Programs Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: proximal gradient descent Consider the problem x g(x) + h(x) with g, h convex, g differentiable, and

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

minimize x subject to (x 2)(x 4) u,

minimize x subject to (x 2)(x 4) u, Math 6366/6367: Optimization and Variational Methods Sample Preliminary Exam Questions 1. Suppose that f : [, L] R is a C 2 -function with f () on (, L) and that you have explicit formulae for

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information