Coordinate descent methods

Size: px

Start display at page:

Download "Coordinate descent methods"

Octavia Lilian McCarthy
5 years ago
Views:

1 Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5 4 Stochastic dual coordinate ascent for support vector machines 9 Exact coordinate descent The idea of coordinate descent is to decompose a large optimisation problem into a sequence of one-dimensional optimisation problems. The algorithm was first described for the minimization of quadratic functions by Gauss and Seidel in Seidel (874). Coordinate descent methods have become unavoidable in machine learning because they are very efficient for ey problems, namely Lasso, logistic regression and support vector machines. Moreover, the decomposition into small subproblems means that only a small part of the data is processed at each iteration and this maes coordinate descent easily scalable to high dimensions. We first decompose the space of optimisation variables X into blocs X... X n = X. A classical choice when X = R n is to choose X =... = X n = R. We will denote U i the canonical injection from X i to X, that is U i is such that for all h X i, U i h = (0,..., 0,h, 0,..., 0) X. }{{}}{{} i zeros n i zeros For a function f : X... X n R, we define the following algorithm. Algorithm : Exact coordinate descent Start at x 0 X. At iteration, choose l = ( mod n) + (cyclic rule) and define x + X by { + = arg min z X l f(x () + = x(i),..., x(l ), z, x (l+),..., x (n) ) if i = l if i l

If f is continuously differentiable and strictly convex and there exists x = arg min x X f(x), then the exact coordinate descent method (Alg. ) converges to x. Example (least squares).

2 Figure : The successive iterates of the coordinate descent method on a D example. The function we are minimising is represented by its level sets: the bluer is the circle, the lower is the function values. Proposition (Warga (963)). If f is continuously differentiable and strictly convex and there exists x = arg min x X f(x), then the exact coordinate descent method (Alg. ) converges to x. Example (least squares). f(x) = Ax b = m j= (a j x b j) At each iteration, we need to solve in z the D equation For all x R n, so we get f x (x() (l),..., x(l ), z, x (l+),..., x (n) ) = 0 f x (l) (x) = a l (Ax b) = a l a lx (l) + a l ( a j x (j) ) a l b j l z = x (l) + = ( ) a l a l ( a j x (j) ) + a l b = x (l) a j l l (a l ( j= a j x (j) ) a l b ) Example (non-differentiable function). f(x (), x () ) = x () x () min(x (), x () ) + I [0,] (x) f is convex but not differentiable. If we nevertheless try to run exact coordinate descent, the algorithm proceeds as x () = arg min z f(z, x () 0 ) = x() 0, x() = arg min z f(x (), z) = x() 0, and so on. Thus exact coordinate descent converges in two iterations to (x () 0, x() 0 ): the algorithm is stuc on the non-differentiability point on the line {x () = x () } and does not reach the minimiser (, ). Example 3 (non-convex differentiable function). f(x (), x (), x (3) ) = (x () x () + x () x (3) + x (3) x () ) + 3 max(0, x(i) ) As shown by Powell (976), exact coordinate descent on this function started at the initial point x (0) = ( ɛ, + ɛ/, ɛ/4) has a limit cycle around the 6 corners of the cube that are not minimisers and avoids the corners that are minimisers. Exercise: Show Powell s result.

3 x 0 x = x Figure : The function in Example Example 4 (Adaboost). The Adaboost algorithm Collins et al. (00) was designed to minimise the exponential loss given by m f(x) = exp( y j h j x). j= At each iteration, we select the variable l such that l = arg max i i f(x) and we perform an exact coordinate descent step along this coordinate. This variable selection rule is called the greedy or Gauss-Southwell rule. Lie the cyclic rule, it leads to a converging algorithm but requires to compute the full gradient at each iteration. Greedy coordinate descent is interesting in the case of the exponential loss because the gradient of the function has a few very large coefficients and many negligible coefficients. Exercise: Suppose that y j {, } and h j,i {, 0, } for all j, i. Give the explicit formulas of i f(x) and of the next update x + nowing x. Coordinate gradient descent Solving a one-dimensional optimisation problems is generally easy and the solution can be approximated very well by algorithms lie the bisection method. However, for the exact coordinate descent method, one needs to solve a huge number of one-dimensional problems and the expense quicly becomes prohibitive. Moreover, why should we solve to high accuracy the -dimensional problem and destroy this solution at the next iteration? The idea of coordinate gradient descent is to perform one iteration of gradient descent in the -dimensional problem min z Xl f(x (),..., x(l ), z, x (l+),..., x (n) ) instead of solving it completely. In general, this reduces drastically the cost of each iteration while eeping the same convergence behaviour. When choosing the cyclic or greedy rule, the algorithm does converge for any convex function f that has a Lipschitz-continuous gradient and such that arg min x f(x). In fact we will assume that we actually now the coordinate-wise Lipschitz constants of the gradient of f, namely the Lipschitz constants of the functions g i,x : X i R h f(x + U i h) = f(x (),..., x (i ), + h, x (i+),..., x (n) ) 3

4 Algorithm : Coordinate gradient descent Start at x 0. At iteration, choose i + {,..., n} and define x + by { + = x(i) γ i i f(x ) if i = i + + = x(i) if i i + We will denote L i = L( g i,x ) this Lipschitz constant. Written in terms of f, this means that x X, i {,..., n}, h X i, f(x + U i h) f(x) L i U i h. Lemma. If f has a coordinate-wise Lipschitz gradient with constants L,..., L n, then x X, i {,..., n}, h X i, f(x + U i h) f(x) + i f(x), h + L i h Proof. This is Taylor s inequality applied to g i,x. Note that we do not require the function to be twice differentiable. Proposition (Bec and Tetruashvili (03)). Assume that f is convex, f is Lipschitz continuous and arg min x X f(x). If i + is chosen with the cyclic rule i + = ( mod n) + and i, γ i = L i, then the coordinate gradient descent method (Alg. ) satisfies f(x + ) f(x ) 4L max ( + n 3 L max/l min) R (x 0 ) + 8/n where R (x 0 ) = max x,y X { x y : f(y) f(x) f(x 0 )}, L max = max i L i and L min = min i L i. The proof of this result is quite technical and in fact the bound is much more pessimistic than what is observed in practice (n 3 is very large if n is large). This is due to the fact that the cyclic rule behaves particularly bad on some extreme examples. To avoid such traps, it has been suggested to randomise the coordinate selection process. Proposition 3 (Nesterov (0)). Assume that f is convex, f is Lipschitz continuous and arg min x X f(x). If i + is randomly generated, independently of i,..., i and i {,..., n}, P(i + = i) = n and γ i = L i, then the coordinate gradient descent method (Alg. ) satisfies for all x arg min x f(x) E[f(x + ) f(x )] where x L = n L i. n ( ( + n n )(f(x 0) f(x )) + ) x x 0 L Proof. This is a particular case of the method developed in the next section. Comparison with gradient descent The iteration complexity of the gradient descent method is f(x + ) f(x ) L( f) ( + ) x x 0 4

5 This means that to get an ɛ-solution (i.e. such that f(x ) f(x ) ɛ), we need at most L( f) ɛ x x 0 iterations. What is most expensive in gradient descent is the evaluation of the gradient f(x) with a cost C, so the total cost of the method is C grad = C L( f) x x 0 ɛ Neglecting the effect of randomisation, we usually have an ɛ-solution with coordinate descent in n ( ɛ ( n )(f(x 0) f(x )) + x x 0 L) iterations. The cost of one iteration of coordinate descent is of the order of the cost of evaluation one partial derivative i f(x), with a cost c, so the total cost of the method is C cd = c n ( ( ɛ n )(f(x 0) f(x )) + x x 0 ) L How do these two quantities compare? Let us consider the case where f(x) = Ax b. Computing f(x) = A (Ax b) amounts to updating the residuals r = Ax b (one matrix vector product and a sum) and computing one matrix vector product. We thus have C = O(nnz(A)). Computing i f(x) = e i A (Ax b) amounts to. updating the residuals r = Ax b: one scalar-vector product and a sum since we have r + = r + (x (i +) + x (i +) )Ae i+,. computing one vector-vector product (the i th column of A versus the residuals). Thus c = O(nnz(Ae i+ )) = O(nnz(A)/n) = C/n if the columns of A are equally sparse. f(x 0 ) f(x ) L( f) x 0 x and it may happen that f(x 0) f(x ) L( f) x 0 x L( f) = λ max (A A) and L i = a i a i with a i = Ae i. We always have L i L( f) and it may happen that L i = O(L( f)/n). To conclude, in the quadratic case, C cd C grad and we may have C cd = O(C grad /n). 3 Proximal coordinate descent We are often interested in solving problems of the type min x X F (x) = min f(x) + g(x) (3.) x X where f and g are convex so that F = f + g is convex, f has a Lipschitz continuous gradient and g may be nonsmooth but is separable. This means that for all x X = X... X n, g(x) = g i ( ). We can solve this ind of problems with the proximal coordinate descent method (Alg. 3, Tseng (00)), which is also using the coordinate-wise Lipschitz constant. For this algorithm to be practical, we need to be able to compute efficiently prox γ,g (y) = arg min x X g(x) + x y γ, the proximal operator of g (remember that x γ = n γ i ). 5

6 Algorithm 3: Proximal coordinate descent Start at x 0 X. At iteration, choose i + {,..., n} and define x + X by + = arg min x X i g i (x) + f(x ) + i f(x ), x + = x(i) + L i x x(i) if i = i + if i i + Example 5 (Simple proximal operators). Indicator of a box: if g(x) = I [a,b] (x), then prox γ,g (y) = max(a, min(x, b)). This is the projection on [a, b] (it does not depend on γ). Absolute value: if g(x) = λ x, then prox γ,g (y) = sign(y) max(0, y γλ). This is the soft-thresholding operator. We define x + = prox L,g(x L f(x )) = arg min x X g(x) + f(x ) + f(x ), x x + x x L, so that + = Lemma. For all γ R n + and x X { + if i = i + if i i + g( x + )+ f(x ), x + x + x + x γ g(x)+ f(x ), x x + x x γ x + x γ Proof. The function ψ : x g(x) + f(x ), x x + x x γ is strongly convex and its minimiser is x +. The inequality is just the strong convexity inequality of this function with respect to the norm γ and applied at x and x +. Theorem (Richtári and Taáč (04)). The proximal coordinate descent method (Alg. 3) with the random selection rule applied to Problem 3. satisfies for all x arg min x F (x) E[F (x + ) F (x )] where x L = n L i. n ( ( + n n )(F (x 0) F (x )) + ) x x 0 L Proof. By definition of the algorithm, x + x = U i+ (x (i +) + x (i +) ), so by Lemma, f(x + ) f(x ) + i+ f(x ), x (i +) + x (i +) + L i + x(i +) + x (i +) = f(x ) + f(x ), x + x + x + x L (3.) Using the notation x + = arg min x X g(x) + f(x ) + f(x ), x x + x x L, we have { + = + if i = i + if i i + 6

7 Using the conditional expectation nowing F = (i,... i ), we get E[ + F ] = P(i + = i) + + P(i + i) = n x(i) + + ( n )x(i) E[ i f(x ), + x(i) F ] = P(i + = i) i f(x ), + x(i) + P(i + i) i f(x ), x(i) = n if(x ), + x(i) E[ f(x ), x + x F ] = E[ x + x L F ] = E[g(x + ) g(x ) F ] = Combining (3.3), (3.4) and (3.5) with (3.), we get E[ i f(x ), + x(i) F ] = n f(x ), x + x (3.3) E[ L i x(i) + x(i) F ] = n x + x L (3.4) E[g i ( + ) g i( ) F ] = n (g( x +) g(x )) (3.5) [ E[g(x + ) + f(x + ) F ] E[g(x + ) F ] + E f(x ) + f(x ), x + x + ] x + x L F = ( n )g(x ) + n g( x +) + f(x ) + n f(x ), x + x + n x + x L Using Lemma with x = x, we get E[F (x + ) F ] = E[g(x + ) + f(x + ) F ] g(x ) + f(x ) n x + x g(x ) + f(x ) = F (x ) (3.6) Then, using Lemma again, with x = x, we get E[F (x + ) F ] = E[g(x + ) + f(x + ) F ] ( n )g(x ) + f(x ) + n g(x ) + n f(x ), x x + n x x L n x x + L We remar that so that E[ x x L x x + L F ] = n x x L n x x + L, E[F (x + ) F ] ( n )g(x ) + f(x ) + n g(x ) + n f(x ), x x We use the convexity of f: + x x L E[ x x + L F ]. E[F (x + ) F ] ( n )(g(x )+f(x ))+ n (g(x )+f(x ))+ x x L E[ x x + L F ]. We rearrange and we apply total expectation: E[F (x + ) F (x ) + x x + L] E[( n )(F (x ) F (x )) + x x L]. 7

8 Summing for from 0 to K- yields E[F (x K+ ) F (x )] + K E[ x x K+ L] + E[ n (F (x ) F (x )] = Using (3.6) and the fact that E[ x x K+ L ] 0, ( n )(F (x 0) F (x )) + x x 0 L]. ( + n )E[F (x K+) F (x )] ( n )(F (x 0) F (x )) + x x 0 L] We just need to divide by n + to conclude. Example 6 (Lasso). Proximal coordinate descent is widely used to solve the Lasso problem given by min x R p y Zx + λ x Here, f(x) = y Zx is differentiable while g(x) = λ x is a non-differentiable function whose proximal operator is the soft-thresholding operator. Example 7 (Multi-tas Lasso). In the multi-tas framewor, the Lasso problem can be generalised as p min x R p q Y Zx F + λ x j,:. Here, the optimisation variable is a p q matrix. One can see that the nonsmooth part of the objective is g(x) = λ p j= x j,:. This function is not separable when we consider the entries of x one by one but it is separable if we group these entries column-wise. Hence, we can consider bloc coordinate descent with p blocs of size q for the resolution of the multi-tas Lasso problem. Example 8 (l /l -regularised multinomial logistic regression). Logistic regression is famous for classification problems. One observes for each i [n] a class label c i {,..., q} and a vector of features z i R p. This information can be recast into a matrix Y R n q filled by 0 s and s: Y i, = {ci =}. A matrix B R p q is formed by q vectors encoding the hyperplanes for the linear classification. The multinomial l /l regularized regression reads: min B R p q ( q ( q Y i, zi B :, + log = = j= ( ) )) exp zi B :, + λ p B j,:, Lie the multi-tas Lasso problem, this problem can be solved with proximal coordinate descent as long as we consider blocs of variables corresponding to the columns of B rather than single variables. Exercise:. Find the proximal operator of the non-smooth function g(b) = λ p j= B j,:.. Give the expression of the partial derivatives of the smooth function ( q ( q ( ) )) f(b) = Y i, zi B :, + log exp zi B :, = = j= 8

9 3. Give an estimate of the p bloc-wise Lipschitz constants of f. 4. Write the proximal coordinate descent method for l /l -regularised multinomial logistic regression. 4 Stochastic dual coordinate ascent for support vector machines In this section, we focus on the linear Support Vector Machines (SVM) problem min C max(0, y i z w R p i w) + w where C is a positive real number, y R n and i, z i R p. Note that we consider the formulation without intercept. The objective function contains a non-smooth and non-separable term so we cannot apply coordinate descent to it. However, a dual formulation of the SVM problem is given by max α R n p ( n Z i,j y i α (i)) + α (i) I [0,C] n(α). j= The objective function of this problem does decompose into a differentiable concave function f(α) = p j= ( n Z i,jy i α (i)) + n α(i) and a nonsmooth concave and separable function g(α) = I [0,C] n(α). Stochastic Dual Coordinate Ascent (SDCA) is proximal coordinate ascent (the version of coordinate descent for concave functions) on this problem. Exercise. Write an implement of SDCA. It may be useful to maintain residuals w defined by w (j) = n Z i,jy i α (i) for all j {,..., p}. Even if we are running the algorithm in the dual, we are interested in the primal problem. The following result shows that we can recover a good primal solution from the dual solution and gives theoretical guarantees for the convergence in the primal. Theorem (Shalev-Shwartz and Zhang (03)). Let us define a primal point w = Z Diag (y) α, where (α ) 0 is generated by SDCA. The duality gap satisfies for all K n, [ K E K =K ] P (w ) D(α ) n ( ( K + n n )(D(α ) D(α 0 )) + ) α α 0 L + n K C where the primal value is P (w ) = C n max(0, y izi w ) + w, the dual value is D(α ) = Z Diag (y) α + n α(i) I [0,C] n(α ) and i, L i = yi z i. Proof. As SDCA solves the dual problem with coordinate ascent, by Theorem, E[D(α ) D(α + )] n ( ( + n n )(D(α ) D(α 0 )) + ) α α 0 L. The goal of the theorem is to upper bound E[P (w ) D(α )] by quantities involving E[D(α ) D(α + )]. Note that by wea duality, P (w ) D(α ) D(α ) D(α + ) but what we need is an inequality in the other way. For this, we will need to use the fact that (α ) 0 is generated by the coordinate ascent method. L i 9

10 Using the feasibility of α and the definition of w, we can simplify D(α + ) as D(α + ) = Z Diag (y) α + + α (i) + I [0,C] n(α +) = w + + e α + As α + = α + U i+ (ᾱ (i +) + α (i +) ), w + = w + z i+ y i+ (ᾱ (i +) + α (i +) ) and D(α + ) = w + z i+ y i+ (ᾱ (i +) + α (i +) ) + e α + ᾱ (i +) + α (i +) To simplify notations, we will write here i = i +. Note that ᾱ (i) + = arg max (y iz i Z Diag (y) α + )(a α (i) a [0,C] ) y iz i = arg max a [0,C] w + z i y i (a α (i) ) + a α (i). (a α (i) ) So let us consider φ : x C max(0, x), u φ(y i z i w ) [0, C] and s [0, ]. D(α + ) = max a [0,C] w + z i y i (a α (i) ) + e α + a α (i) w + z i y i ((su + ( s)α (i) ) α(i) ) + e α + (su + ( s)α (i) ) α(i) w + z i y i s(u α (i) ) + e α + s(u α (i) ) = w s z iy i (u α (i) ) s(u α (i) )y iz i w + e α + s(u α (i) ) = D(α ) s z iy i (u α (i) ) s(u α (i) )y iz i w + s(u α (i) ) As u φ(y i z i w ) [0, C] and φ (q) = q + I [ C,0] (q), Fenchel-Young equality leads to: φ(y i z i w ) u = uy i z i w. Hence, D(α + ) D(α ) s z iy i (u α (i) ) + sφ(y i zi w ) + sα (i) y izi w sα (i) Applying conditional expectation, we get E[D(α + ) F ] D(α ) s n Now, So that P (w ) D(α ) = C = z i y i (u α (i) ) + s n ( φ(yi zi w ) + α (i) y izi w α (i) ) max(0, y i zi w ) + w ( w + e α ) E[D(α + ) F ] D(α ) s n φ(y i zi w ) + α (i) y izi w α (i) s n z i y i (u α (i) ) + s n (P (w ) D(α )) ( z i y i )C + s n (P (w ) D(α )) 0

11 where the last inequality derives from α (i) [0, C] and u [0, C]. We apply total expectation and we sum for from K to K : s n K =K E[P (w ) D(α )] E[D(α K )] E[D(α K )] + s n C E[D(α )] E[D(α K )] + s n C c 0n K + n + s n C ( z i y i )(K K ) ( z i y i )(K K ) ( z i y i )(K K ) where c 0 = ( n )(D(α ) D(α 0 )) + α α 0 L. Choosing K = K and s = n K, we obtain, for K n (because we need s ) K K =K E[P (w ) D(α )] c 0n K + n + n C K ( z i y i ) References Bec, A. and Tetruashvili, L. (03). On the convergence of bloc coordinate descent type methods. SIAM Journal on Optimization, 3(4): Collins, M., Schapire, R. E., and Singer, Y. (00). Logistic regression, adaboost and bregman distances. Machine Learning, 48(-3): Nesterov, Y. (0). Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, (): Powell, M. (976). Some global convergence properties of a variable metric algorithm for minimization without exact line searches. Nonlinear programming, 9:53 7. Richtári, P. and Taáč, M. (04). Iteration complexity of randomized bloc-coordinate descent methods for minimizing a composite function. Mathematical Programming, 44(-): Seidel, L. (874). Ueber ein verfahren, die gleichungen, auf welche die methode der leinsten quadrate führt, sowie lineäre gleichungen überhaupt, durch successive annäherung aufzulösen:(aus den abhandl. d bayer. aademie ii. cl. xi. bd. iii. abth. Verlag der. Aademie. Shalev-Shwartz, S. and Zhang, T. (03). Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 4(): Tseng, P. (00). Convergence of a bloc coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 09(3): Warga, J. (963). Minimizing certain convex functions. Journal of the Society for Industrial & Applied Mathematics, (3): Wright, S. J. (05). Coordinate descent algorithms. Mathematical Programming, 5():3 34.

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem: