Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak- Lojasiewicz Condition

Size: px

Start display at page:

Download "Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak- Lojasiewicz Condition"

Randell Wright
5 years ago
Views:

1 Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polak- Lojasiewicz Condition Hamed Karimi, Julie Nutini, and Mark Schmidt Department of Computer Science, Universit of British Columbia Vancouver, British Columbia, Canada Abstract. In 963, Polak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the Lojasiewicz inequalit proposed in the same ear, and it does not require strong convexit (or even convexit). In this work, we show that this much-older Polak- Lojasiewicz (PL) inequalit is actuall weaker than the main conditions that have been explored to show linear convergence rates without strong convexit over the last 5 ears. We also use the PL inequalit to give new analses of randomized and greed coordinate descent methods, sign-based gradient descent methods, and stochastic gradient methods in the classic setting (with decreasing or constant step-sizes) as well as the variancereduced setting. We further propose a generalization that applies to proximal-gradient methods for non-smooth optimization, leading to simple proofs of linear convergence of these methods. Along the wa, we give simple convergence results for a wide variet of problems in machine learning: least squares, logistic regression, boosting, resilient backpropagation, L-regularization, support vector machines, stochastic dual coordinate ascent, and stochastic variance-reduced gradient methods. Introduction Fitting most machine learning models involves solving some sort of optimization problem. Gradient descent, and variants of it like coordinate descent and stochastic gradient, are the workhorse tools used b the field to solve ver large instances of these problems. In this work we consider the basic problem of minimizing a smooth function and the convergence rate of gradient descent methods. It is wellknown that if f is strongl-convex, then gradient descent achieves a global linear convergence rate for this problem Nesterov, 4]. However, man of the fundamental models in machine learning like least squares and logistic regression ield objective functions that are convex but not strongl-convex. Further, if f is onl convex, then gradient descent onl achieves a sub-linear rate. This situation has motivated a variet of alternatives to strong convexit (SC) in the literature, in order to show that we can obtain linear convergence rates for problems like least squares and logistic regression. One of the oldest of these conditions is the error bounds (EB) of Luo and Tseng 993], but four other recentl-considered conditions are essential strong convexit (ESC) Liu et al., 4], weak strong convexit (WSC) Necoara et al., 5], the restricted secant inequalit (RSI) Zhang and Yin, 3], and the quadratic growth (QG) condition Anitescu, ]. Some of these conditions have different names in the special case of convex functions. For example, a convex function satisfing RSI is said to satisf restricted strong convexit (RSC) Zhang and Yin, 3]. Names describing convex functions satisfing QG include optimal strong convexit (OSC) Liu and Wright, 5], semi-strong convexit (SSC) Gong and Ye, 4], and (confusingl) WSC Ma et al., 5]. The proofs of linear convergence under all of these relaxations are tpicall not straightforward, and it is rarel discussed how these conditions relate to each other. In this work, we consider a much older condition that we refer to as the Polak- Lojasiewicz (PL) inequalit. This inequalit was originall introduced b Polak 963], who showed that it is a sufficient condition for gradient descent to achieve a linear convergence rate. We describe it as the PL inequalit because it is also a special case of the inequalit introduced in the same ear b Lojasiewicz 963]. We review the PL inequalit in the next section and how it leads to a trivial proof of the linear convergence rate of gradient descent. Next, in terms of showing a global linear convergence rate to the optimal solution, we show that the PL inequalit is weaker than all of the more recent conditions discussed in the previous paragraph. This suggests that we can replace the long and complicated proofs under an of the conditions above with simpler proofs based on the PL inequalit. Subsequentl, we show how

2 this result implies gradient descent achieves linear rates for standard problems in machine learning like least squares and logistic regression that are not necessaril SC, and even for some non-convex problems (Section.3). In Section 3 we use the PL inequalit to give new convergence rates for randomized and greed coordinate descent (impling a new convergence rate for certain variants of boosting), sign-based gradient descent methods, and stochastic gradient methods in either the classical or variance-reduced setting. Next we turn to the problem of minimizing the sum of a smooth function and a simple non-smooth function. We propose a generalization of the PL inequalit that allows us to show linear convergence rates for proximal-gradient methods without SC. In this setting, the new condition is equivalent to the well-known Kurdka- Lojasiewicz (KL) condition which has been used to show linear convergence of proximal-gradient methods for certain problems like support vector machines and l -regularized least squares Bolte et al., 5]. But this new alternate generalization of the PL inequalit leads to shorter and simpler proofs in these cases. Polak- Lojasiewicz Inequalit We first focus on the basic unconstrained optimization problem argmin f(x), () x R d and we assume that the first derivative of f is L-Lipschitz continuous. This means that f() f(x) + f(x), x + L x, () for all x and. For twice-differentiable objectives this assumption means that the eigenvalues of f(x) are bounded above b some L, which is tpicall a reasonable assumption. We also assume that the optimization problem has a non-empt solution set X, and we use f to denote the corresponding optimal function value. We will sa that a function satisfies the PL inequalit if the following holds for some µ >, f(x) µ(f(x) f ), x. (3) This inequalit simpl requires that the gradient grows faster than a quadratic function as we move awa from the optimal function value. Note that this inequalit implies that ever stationar point is a global minimum. But unlike SC, it does not impl that there is a unique solution. Linear convergence of gradient descent under these assumptions was first proved b Polak 963]. Below we give a simple proof of this result when using a step-size of /L. Theorem. Consider problem (), where f has an L-Lipschitz continuous gradient (), a non-empt solution set X, and satisfies the PL inequalit (3). Then the gradient method with a step-size of /L, has a global linear convergence rate, f(x k ) f x k+ = x k L f(x k), (4) ( µ L) k (f(x ) f ). Proof. B using update rule (4) in the Lipschitz inequalit condition () we have Now b using the PL inequalit (3) we get f(x k+ ) f(x k ) L f(x k). f(x k+ ) f(x k ) µ L (f(x k) f ). Re-arranging and subtracting f from both sides gives us f(x k+ ) f ( L) µ (f(xk ) f ). Appling this inequalit recursivel gives the result.

3 Note that the above result also holds if we use the optimal step-size at each iteration, because under this choice we have ( f(x k+ ) = min{f(x k α f(x k ))} f x k ) α L f(x k). A beautiful aspect of this proof is its simplicit; in fact it is simpler than the proof of the same fact under the usual SC assumption. It is certainl simpler than tpical proofs which rel on the other conditions mentioned in Section. Further, it is worth noting that the proof does not assume convexit of f. Thus, this is one of the few general results we have for global linear convergence on non-convex problems.. Relationships Between Conditions As mentioned in the Section, several other assumptions have been explored over the last 5 ears in order to show that gradient descent achieves a linear convergence rate. These tpicall assume that f is convex, and lead to more complicated proofs than the one above. However, it is rarel discussed how the conditions relate to each other. Indeed, all of the relationships that have been explored have onl been in the context of convex functions Bolte et al., 5, Liu and Wright, 5, Necoara et al., 5, Zhang, 5]. In Appendix A, we give the precise definitions of all conditions and also prove the result below giving relationships between the conditions. Theorem. For a function f with a Lipschitz-continuous gradient, the following implications hold: (SC) (ESC) (W SC) (RSI) (EB) (P L) (QG). If we further assume that f is convex then we have (RSI) (EB) (P L) (QG). Note the equivalence between EB and PL is a special case of a more general result b Bolte et al. 5, Theorem 5], while Zhang 6] independentl also recentl gave the relationships between RSI, EB, PL, and QG. This result shows that QG is the weakest assumption among those considered. However, QG allows non-global local minima so it is not enough to guarantee that gradient descent finds a global minimizer. This means that, among those considered above, PL and the equivalent EB are the most general conditions that allow linear convergence to a global minimizer. Note that in the convex case QG is called OSC or SSC, but the result above shows that in the convex case it is also equivalent to EB and PL (as well as RSI which is known as RSC in this case).. Invex and Non-Convex Functions While the PL inequalit does not impl convexit of f, it does impl the weaker condition of invexit. A function is invex if it is differentiable and there exists a vector valued function η such that for an x and in IR n, the following inequalit holds f() f(x) + f(x) T η(x, ). We obtain convex functions as the special case where η(x, ) = x. Invexit was first introduced b Hanson 98], and has been used in the context of learning output kernels Dinuzzo et al., ]. Craven and Glover 985] show that a smooth f is invex if and onl if ever stationar point of f is a global minimum. Since the PL inequalit implies that all stationar points are global minimizers, functions satisfing the PL inequalit must be invex. It is eas to see this b noting that at an stationar point x we have f( x) =, so we have = f( x) µ(f(x) f ), where the last inequalit holds because µ > and f(x) f for all x. This implies that f( x) = f and thus an stationar point must be a global minimum. Drusvatski and Lewis 6] is a recent work discussing the relationships among man of these conditions for non-smooth functions. 3

4 Theorem shows that all of the previous conditions (except QG) impl invexit. The function f(x) = x + 3 sin (x) is an example of an invex but non-convex function satisfing the PL inequalit (with µ = /3). Thus, Theorem implies gradient descent obtains a global linear convergence rate on this function. Unfortunatel, man complicated models have non-optimal stationar points. For example, tpical deep feed-forward neural networks have sub-optimal stationar points and are thus not invex. A classic wa to analze functions like this is to consider a global convergence phase and a local convergence phase. The global convergence phase is the time spent to get close to a local minimum, and then once we are close to a local minimum the local convergence phase characterizes the convergence rate of the method. Usuall, the local convergence phase starts to appl once we are locall SC around the minimizer. But this means that the local convergence phase ma be arbitraril small: for example, for f(x) = x + 3 sin (x) the local convergence rate would not even appl over the interval x, ]. If we instead defined the local convergence phase in terms of locall satisfing the PL inequalit, then we see that it can be much larger (x IR for this example)..3 Relevant Problems If f is µ-sc, then it also satisfies the PL inequalit with the same µ (see Appendix B). Further, b Theorem, f satisfies the PL inequalit if it satisfies an of ESC, WSC, RSI, or EB (while for convex f, QG is also sufficient). Although it is hard to precisel characterize the general class of functions for which the PL inequalit is satisfied, we note one important special case below. Strongl-convex composed with linear: This is the case where f has the form f(x) = g(ax) for some σ-sc function g and some matrix A. In Appendix B, we show that this class of functions satisfies the PL inequalit, and we note that this form frequentl arises in machine learning. For example, least squares problems have the form f(x) = Ax b, and b noting that g(z) z b is SC we see that least squares falls into this categor. Indeed, this class includes all convex quadratic functions. In the case of logistic regression we have f(x) = n log( + exp(b i a T i x)). i= This can be written in the form g(ax), where g is strictl convex but not SC. In cases like this where g is onl strictl convex, the PL inequalit will still be satisfied over an compact set. Thus, if the iterations of gradient descent remain bounded, the linear convergence result still applies. It is reasonable to assume that the iterates remain bounded when the set of solutions is finite, since each step must decrease the objective function. Thus, for practical purposes, we can relax the above condition to strictl-convex composed with linear and the PL inequalit implies a linear convergence rate for logistic regression. 3 Convergence of Huge-Scale Methods In this section, we use the PL inequalit to analze several variants of two of the most widel-used techniques for handling large-scale machine learning problems: coordinate descent and stochastic gradient methods. In particular, the PL inequalit ields ver simple analses of these methods that appl to more general classes of functions than previousl analzed. We also note that the PL inequalit has recentl been used b Garber and Hazan 5a] to analze the Frank-Wolfe algorithm. Further, inspired b the resilient backpropagation (RPROP) algorithm of Riedmiller and Braun 99], in Appendix C we also give a convergence rate analsis for a sign-based gradient descent method. 3. Randomized Coordinate Descent Nesterov ] shows that randomized coordinate descent achieves a faster convergence rate than gradient descent for problems where we have d variables and it is d times cheaper to update one coordinate than it is to compute the entire gradient. The expected linear convergence rates in this previous work 4

5 rel on SC, but in this section we show that randomized coordinate descent achieves an expected linear convergence rate if we onl assume that the PL inequalit holds. To analze coordinate descent methods, we assume that the gradient is coordinate-wise Lipschitz continuous, meaning that for an x and we have f(x + αe i ) f(x) + α i f(x) + L α, α R, x R d, (5) for an coordinate i, and where e i is the ith unit vector. Theorem 3. Consider problem (), where f has a coordinate-wise L-Lipschitz continuous gradient (5), a non-empt solution set X, and satisfies the PL inequalit (3). Consider the coordinate descent method with a step-size of /L, x k+ = x k L i k f(x k )e ik. (6) If we choose the variable to update i k uniforml at random, then the algorithm has an expected linear convergence rate of ( Ef(x k ) f ] µ ) k f(x ) f ]. dl Proof. B using the update rule (6) in the Lipschitz condition (5) we have f(x k+ ) f(x k ) L i k f(x k ). B taking the expectation of both sides with respect to i k we have E f(x k+ )] f(x k ) L E ik f(x k ) ] = f(x k ) L i d if(x k ) = f(x k ) dl f(x k). B using the PL inequalit (3) and subtracting f from both sides, we get ( Ef(x k+ ) f ] µ ) f(x k ) f ]. dl Appling this recursivel and using iterated expectations ields the result. As before, instead of using /L we could perform exact coordinate optimization and the result would still hold. If we have a Lipschitz constant L i for each coordinate and sample proportional to the L i as suggested b Nesterov ], then the above argument (using a step-size of /L ik ) can be used to show that we obtain a faster rate of ( Ef(x k ) f ] µ ) k f(x ) f ], d L where L = d d j= L j. 3. Greed Coordinate Descent Nutini et al. 5] have recentl analzed coordinate descent under the greed Gauss-Southwell (GS) rule, and argued that this rule ma be suitable for problems with a large degree of sparsit. The GS rule chooses i k according to the rule i k = argmax j j f(x k ). Using the fact that max i f(x k ) i d 5 d i f(x k ), i=

6 it is straightforward to show that the GS rule satisfies the rate above for the randomized method. However, Nutini et al. 5] show that a faster convergence rate can be obtained for the GS rule b measuring SC in the -norm. Since the PL inequalit is defined on the dual (gradient) space, in order to derive an analogous result we could measure the PL inequalit in the -norm, f(x) µ (f(x) f ). Because of the equivalence between norms, this is not introducing an additional assumptions beond that the PL inequalit is satisfied. Further, if f is µ -SC in the -norm, then it satisfies the PL inequalit in the -norm with the same constant µ. B using that ik f(x k ) = f(x k ) when the GS rule is used, the above argument can be used to show that coordinate descent with the GS rule achieves a convergence rate of ( f(x k ) f µ L ) k f(x ) f ], when the function satisfies the PL inequalit in the -norm with a constant of µ. B the equivalence between norms we have that µ/d µ, so this is faster than the rate with random selection. Meir and Rätsch 3] show that we can view some variants of boosting algorithms as implementations of coordinate descent with the GS rule. The use the error bound propert to argue that these methods achieve a linear convergence rate, but this propert does not lead to an explicit rate. Our simple result above thus provides the first explicit convergence rate for these variants of boosting. 3.3 Stochastic Gradient Methods Stochastic gradient (SG) methods appl to the general stochastic optimization problem argmin f(x) = Ef i (x)], (7) x IR d where the expectation is taken with respect to i. These methods are tpicall used to optimize finite sums, f(x) = n f i (x). (8) n Here, each f i tpicall represents the fit of a model on an individual training example. SG methods are suitable for cases where the number of training examples n is so large that it is infeasible to compute the gradient of all n examples more than a few times. Stochastic gradient methods use the iteration i x k+ = x k α k f ik (x k ), (9) where α k is the step size and i k is a sample from the distribution over i so that E f ik (x k )] = f(x k ). Below, we analze the convergence rate of stochastic gradient methods under standard assumptions on f, and under both a decreasing and a constant step-size scheme. Theorem 4. Consider problem (7). Assume that each f has an L-Lipschitz continuous gradient (), f has a non-empt solution set X, f satisfies the PL inequalit (3), and E f i (x k ) ] C for all x k and some C. If we use the SG algorithm (9) with α k = k+ µ(k+), then we get a convergence rate of Ef(x k ) f ] LC kµ. If instead we use a constant α k = α < µ, then we obtain a linear convergence rate up to a solution level that is proportional to α, Ef(x k ) f ] ( µα) k f(x ) f ] + LC α 4µ. 6

7 Proof. B using the update rule (9) inside the Lipschitz condition (), we have f(x k+ ) f(x k ) α k f (x k ), f ik (x k ) + Lα k f i k (x k ). Taking the expectation of both sides with respect to i k we have Ef(x k+ )] f(x k ) α k f(x k ), E f ik (x k )] + Lα k E f i(x k ) ] f(x k ) α k f (x k ) + LC α k f(x k ) µα k (f(x k ) f ) + LC αk, where the second line uses that E f ik (x k )] = f(x k ) and E f i (x k ) ] C, and the third line uses the PL inequalit. Subtracting f from both sides ields: Decreasing step size: With α k = k+ µ(k+) Ef(x k+ ) f ] ( α k µ)f(x k ) f ] + LC αk. () Ef(x k+ ) f ] in () we obtain k (k+) f(x k ) f ] + LC (k+) 8µ (k+). 4 Multipling both sides b (k + ) and letting δ f (k) k Ef(x k ) f ] we get δ f (k + ) δ f (k) + LC (k + ) 8µ (k + ) δ f (k) + LC µ, where the second line follows from k+ k+ <. Summing up this inequalit from k = to k and using the fact that δ f () = we get δ f (k + ) δ f () + LC k µ i= LC (k+) µ (k + ) Ef(x k+ ) f ] LC (k+) µ which gives the stated rate. Constant step size: Choosing α k = α for an α < /µ and appling () recursivel ields Ef(x k+ ) f ] ( αµ) k f(x ) f ] + LC α ( αµ) k f(x ) f ] + LC α = ( αµ) k f(x ) f ] + LC α 4µ, where the last line uses that α < /µ and the limit of the geometric series. k ( αµ) i i= ( αµ) i The O(/k) rate for a decreasing step size matches the convergence rate of stochastic gradient methods under SC Nemirovski et al., 9]. It was recentl shown using a non-trivial analsis that a stochastic Newton method could achieve an O(/k) rate for least squares problems Bach and Moulines, 3], but our result above shows that the basic stochastic gradient method alread achieves this propert (although the constants are worse than for this Newton-like method). Further, our result does not rel on convexit. Note that if we are happ with a solution of fixed accurac, then the result with a constant step-size is perhaps the more useful strateg in practice: it supports the often-used empirical strateg of using a constant size for a long time, then halving the step-size if the algorithm appears to have stalled (the above result indicates that halving the step-size will at least halve the sub-optimalit). i= 7

8 3.4 Finite Sum Methods In the setting of (8) where we are minimizing a finite sums, it has recentl been shown that there are methods that have the low iteration cost of stochastic gradient methods but that still have linear convergence rates for SC functions Le Roux et al., ]. While the first methods that achieved this remarkable propert required a memor of previous gradient values, the stochastic variance-reduced gradient (SVRG) method of Johnson and Zhang 3] does not have this drawback. Gong and Ye 4] show that SVRG has a linear convergence rate without SC under the weaker assumption of QG plus convexit (where QG is equivalent to PL). We review how the analsis of Johnson and Zhang 3] can be easil modified to give a similar result in Appendix D. A related result appears in Garber and Hazan 5b], who assume that f is SC but do not assume that the individual functions are convex. More recent analses b Reddi et al. 6a,b] have considered these tpes of methods under the PL inequalit without convexit assumptions. 4 Proximal-Gradient Generalization A generalization of the PL inequalit for non-smooth optimization is the KL inequalit Kurdka, 998, Bolte et al., 8]. The KL inequalit has been used to analze the convergence of the classic proximalpoint algorithm Attouch and Bolte, 9] as well as a variet of other optimization methods Attouch et al., 3]. In machine learning, a popular generalization of gradient descent is proximal-gradient methods. Bolte et al. 5] show that the proximal-gradient method has a linear convergence rate for functions satisfing the KL inequalit, while Li and Pong 6] give a related result. The set of problems satisfing the KL inequalit notabl includes problems like support vector machines and l -regularized least squares, impling that the algorithm has a linear convergence rate for these problems. In this section we propose a different generalization of the PL inequalit that leads to a simpler linear convergence rate analsis for the proximal-gradient method as well as its coordinate-wise variant. Proximal-gradient methods appl to problems of the form argmin F (x) = f(x) + g(x), () x R d where f is a differentiable function with an L-Lipschitz continuous gradient and g is a simple but potentiall non-smooth convex function. Tpical examples of simple functions g include a scaled l -norm of the parameter vectors, g(x) = λ x, and indicator functions that are zero if x lies in a simple convex set and are infinit otherwise. In order to analze proximal-gradient algorithms, a natural (though not particularl intuitive) generalization of the PL inequalit is that there exists a µ > satisfing D g(x, L) µ(f (x) F ), () where D g (x, α) α min f(x), x + α ] x + g() g(x). (3) We call this the proximal-pl inequalit, and we note that if g is constant (or linear) then it reduces to the standard PL inequalit. Below we show that this inequalit is sufficient for the proximal-gradient method to achieve a global linear convergence rate. Theorem 5. Consider problem (), where f has an L-Lipschitz continuous gradient (), F has a nonempt solution set X, g is convex, and F satisfies the proximal-pl inequalit (). Then the proximalgradient method with a step-size of /L, x k+ = argmin converges linearl to the optimal value F, f(x k ), x k + L ] x k + g() g(x k ) F (x k ) F ( µ L) k F (x ) F ]. (4) 8

9 Proof. B using Lipschitz continuit of the gradient of f we have F (x k+ ) = f(x k+ ) + g(x k ) + g(x k+ ) g(x k ) F (x k ) + f(x k ), x k+ x k + L x k+ x k + g(x k+ ) g(x k ) F (x k ) L D g(x k, L) F (x k ) µ L F (x k) F ], which uses the definition of x k+ and D g followed b the proximal-pl inequalit (). This subsequentl implies that ( F (x k+ ) F µ ) F (x k ) F ], (5) L which applied recursivel gives the result. While other conditions have been proposed to show linear convergence rates of proximal-gradient methods without SC Kadkhodaie et al., 4, Bolte et al., 5, Zhang, 5, Li and Pong, 6], their analses tend to be more complicated than the above. Further, in Appendix G we show that the proximal-pl condition is in fact equivalent to the KL condition, which itself is known to be equivalent to a proximal-gradient variant on the EB condition Bolte et al., 5]. Thus, the proximal-pl inequalit includes the standard scenarios where existing conditions appl. 4. Relevant Problems As with the PL inequalit, we now list several important function classes that satisf the proximal-pl inequalit (). We give proofs that these classes satisf the inequalit in Appendix F and G.. The inequalit is satisfied if f satisfies the PL inequalit and g is constant. Thus, the above result generalizes Theorem.. The inequalit is satisfied if f is SC. This is the usual assumption used to show a linear convergence rate for the proximal-gradient algorithm Schmidt et al., ], although we note that the above analsis is much simpler than standard arguments. 3. The inequalit is satisfied if f has the form f(x) = h(ax) for a SC function h and a matrix A, while g is an indicator function for a polhedral set. 4. The inequalit is satisfied if F is convex and satisfies the QG propert. 5. The inequalit is satisfied if F satisfies the proximal-eb condition or the KL inequalit. B the equivalence shown in Appendix G, the proximal-pl inequalit also holds for other problems where a linear convergence rate has been show like group L-regularization Tseng, ], sparse group L-regularization Zhang et al., 3], nuclear-norm regularization Hou et al., 3], and other classes of functions Zhou and So, 5, Drusvatski and Lewis, 6]. 4. Least Squares with L-Regularization Perhaps the most interesting example of problem () is the l -regularized least squares problem, argmin x IR Ax b + λ x, d where λ > is the regularization parameter. This problem has been studied extensivel in machine learning, signal processing, and statistics. This problem structure seems well-suited to using proximalgradient methods, but the first works analzing proximal-gradient methods for this problem onl showed sub-linear convergence rates Beck and Teboulle, 9]. Subsequent works show that linear convergence rates can be achieved under additional assumptions. For example, Gu et al. 3] prove that their algorithm achieves a linear convergence rate if A satisfies a restricted isometr propert (RIP) and the solution is sufficientl sparse. Xiao and Zhang 3] also assume the RIP propert and show linear convergence using a homotop method that slowl decreases the value of λ. Agarwal et al. ] give a 9

10 linear convergence rate under a modified restricted strong convexit and modified restricted smoothness assumption. But these problems have also been shown to satisf proximal variants of the KL and EB conditions Tseng,, Bolte et al., 5, Necoara and Clipici, 6], and Bolte et al. 5] in particular analzes the proximal-gradient method under KL while giving explicit bounds on the constant. This means an L-regularized least squares problem also satisfies the proximal-pl inequalit. Thus, Theorem 5 gives a simple proof of global linear convergence for these problems without making additional assumptions or making an modifications to the algorithm. 4.3 Proximal Coordinate Descent It is also possible to adapt our results on coordinate descent and proximal-gradient methods in order to give a linear convergence rate for coordinate-wise proximal-gradient methods for problem (). To do this, we require the extra assumption that g is a separable function. This means that g(x) = i g i(x i ) for a set of univariate functions g i. The update rule for the coordinate-wise proximal-gradient method is x k+ = argmin α We state the convergence rate result below. α ik f(x k ) + L ] α + g ik (x ik + α) g ik (x ik ), (6) Theorem 6. Assume the setup of Theorem 5 and that g is a separable function g(x) = i g i(x i ), where each g i is convex. Then the coordinate-wise proximal-gradient update rule (6) achieves a convergence rate ( EF (x k ) F ] µ ) k F (x ) F ], (7) dl when i k is selected uniforml at random. The proof is given in Appendix H and although it is more complicated than the proofs of Theorems 4 and 5, it is arguabl still simpler than existing proofs for proximal coordinate descent under SC Richtárik and Takáč, 4], KL Attouch et al., 3], or QG Zhang, 6]. It is also possible to analze stochastic proximal-gradient algorithms, and indeed Reddi et al. 6c] use the proximal-pl inequalit to analze finite-sum methods in the proximal stochastic case. 4.4 Support Vector Machines Another important model problem that arises in machine learning is support vector machines, argmin x IR d λ xt x + n max(, b i x T a i ). (8) i= where (a i, b i ) are the labelled training set with a i R d and b i {, }. We often solve this problem b performing coordinate optimization on its Fenchel dual, which has the form min w f( w) = wt M w w i, w i, U], (9) for a particular positive semi-definite matrix M and constant U. This convex function satisfies the QG propert and thus Theorem 6 implies that coordinate optimization achieves a linear convergence rate in terms of optimizing the dual objective. Further, note that Hush et al. 6] show that we can obtain an ɛ-accurate solution to the primal problem with an O(ɛ )-accurate solution to the dual problem. Thus this result also implies we can obtain a linear convergence rate on the primal problem b showing that stochastic dual coordinate ascent has a linear convergence rate on the dual problem. Global linear convergence rates for SVMs have also been shown b others Tseng and Yun, 9, Wang and Lin, 4, Ma et al., 5], but again we note that these works lead to more complicated analses. Although the constants in these convergence rate ma be quite bad (depending on the smallest non-zero singular value of the Gram matrix), we note that the existing sublinear rates still appl in the earl iterations while, as the algorithm begins to identif support vectors, the constants improve (depending on the smallest non-zero singular value of the block of the Gram matrix corresponding to the support vectors).

11 The result of the previous section is not onl restricted to SVMs. Indeed, the result of the previous subsection implies a linear convergence rate for man l -regularized linear prediction problems, the framework considered in the stochastic dual coordinate ascent (SDCA) work of Shalev-Shwartz and Zhang 3]. While Shalev-Shwartz and Zhang 3] show that this is true when the primal is smooth, our result gives linear rates in man cases where the primal is non-smooth. 5 Discussion We believe that this work provides a unifing and simplifing view of a variet of optimization and convergence rate issues in machine learning. Indeed, we have shown that man of the assumptions used to achieve linear convergence rates can be replaced b the PL inequalit and its proximal generalization. While we have focused on sufficient conditions for linear convergence, another recent work has turned to the question of necessar conditions for convergence Zhang, 6]. Further, while we ve focused on non-accelerated methods, Zhang 6] has recentl analzed Nesterov s accelerated gradient method without strong convexit. We also note that, while we have focused on first-order methods, Nesterov and Polak 6] have used the PL inequalit to analze a second-order Newton-stle method with cubic regularization. The also consider a generalization of the inequalit under the name gradient-dominated functions. Throughout the paper, we have pointed out how our analses impl convergence rates for a variet of machine learning models and algorithms. Some of these were previousl known, tpicall under stronger assumptions or with more complicated proofs, but man of these are novel. Note that we have not provided an experimental results in this work, since the main contributions of this work are showing that existing algorithms actuall work better on standard problems than we previousl thought. We expect that going forward efficienc will no longer be decided b the issue of whether functions are SC, but rather b whether the satisf a variant of the PL inequalit. Acknowledgments. We would like to thank Simon LaCoste-Julien, Martin Takáč, Ruou Sun, Hui Zhang, and Dmitri Drusvatski for valuable discussions. We would like to thank Ting Kei Pong and Zirui Zhou for pointing out an error in the first version of this paper, to Ting Kei Pong for discussions that lead to the addition of Appendix G, to Jérôme Bolte for an informative discussion about the KL inequalit and pointing us to related results that we had missed, and to Boris Polak for providing an English translation of his original work. This research was supported b the Natural Sciences and Engineering Research Council of Canada (NSERC RGPIN-668-5). Julie Nutini is funded b a UBC Four Year Doctoral Fellowship (4YF) and Hamed Karimi is support b a Mathematics of Information Technolog and Complex Sstems (MITACS) Elevate Fellowship. Appendix A Relationships Between Conditions We start b stating the different conditions. All of these definitions involve some constant µ > (which ma not be the same across conditions), and we ll use the convention that x p is the projection of x onto the solution set X.. Strong Convexit (SC): For all x and we have f() f(x) + f(x), x + µ x.. Essential Strong Convexit (ESC): For all x and such that x p = p we have f() f(x) + f(x), x + µ x. 3. Weak Strong Convexit (WSC): For all x we have f f(x) + f(x), x p x + µ x p x.

12 4. Restricted Secant Inequalit (RSI): For all x we have f(x), x x p µ x p x. If the function f is also convex it is called restricted strong convexit (RSC). 5. Error Bound (EB): For all x we have 6. Polak- Lojasiewicz (PL): For all x we have 7. Quadratic Growth (QG): For all x we have f(x) µ x p x. f(x) µ(f(x) f ). f(x) f µ x p x. If the function f is also convex it is called optimal strong convexit (OSC) or semi-strong convexit or sometimes WSC (but we ll reserve the expression WSC for the definition above). Below we prove a subset of the implications in Theorem. The remaining relationships in Theorem follow from these results and transitivit. SC ESC: The SC assumption implies that the ESC inequalit is satisfied for all x and, so it is also satisfied under the constraint x p = p. ESC WSC: Take = x p in the ESC inequalit (which clearl has the same projection as x) to get WSC with the same µ as a special case. WSC RSI: Re-arrange the WSC inequalit to f(x), x x p f(x) f + µ x p x. Since f(x) f, we have RSI with µ. RSI EB: Using Cauch-Schwartz on the RSI we have f(x) x x p f(x), x x p µ x p x, and dividing both sides b x x p (assuming x x p ) gives EB with the same µ (while EB clearl holds if x = x p ). EB PL: B Lipschitz continuit we have f(x) f(x p ) + f(x p ), x x p + L x p x, and using EB along with f(x p ) = f and f(x p ) = we have f(x) f L x p x L µ f(x), which is the PL inequalit with constant µ L. PL EB: Below we show that PL implies QG. Using this result, while denoting the PL constant with µ p and the QG constant with µ q, we get f(x) µ p (f(x) f ) µ pµ q x x p, which implies that EB holds with constant µ p µ q. QG + Convex RSI: B convexit we have Re-arranging and using QG we get which is RSI with constant µ. f(x p ) f(x) + f(x), x p x. f(x), x x p f(x) f µ x p x,

13 PL QG: Our argument that this implication holds is similar to the argument used in related works Bolte et al., 5, Zhang, 5] Define the function g(x) = f(x) f. If we assume that f satisfies the PL inequalit then for an x X we have or that g(x) = f(x) f(x) f µ, B the definition of g, to show QG it is sufficient to show that g(x) µ. () g(x) µ x x p. () As f is assumed to satisf the PL inequalit we have that f is an invex function and thus b definition g is a positive invex function (g(x) ) with a closed optimal solution set X such that for all X, g() =. For an point x X, consider solving the following differential equation: dx(t) = g(x(t)) dt x(t = ) = x, () for x(t) X. (This is a flow orbit starting at x and flowing along the gradient of g.) B (), g is bounded from below, and as g is a positive invex function g is also bounded from below. Thus, b moving along the path defined b () we are sufficientl reducing the function and will eventuall reach the optimal set. Thus there exists a T such that x(t ) X (and at this point the differential equation ceases to be defined). We can show this b using the steps g(x ) g(x t ) = ( ) = = = x x t g(x), dx (gradient theorem for line integrals) xt x g(x), dx (flipping integral bounds) = µt. g(x(t)), dx(t) dt (reparameterization) dt g(x(t)) dt (from ()) µdt (from ()) As g(x t ), this shows we need to have T g(x )/µ, so there must be a T with x(t ) X. The length of the orbit x(t) starting at x, which we ll denote b L(x ), is given b L(x ) = dx(t)/dt dt = g(x(t)) dt x x p, (3) where x p is the projection of x onto X and the inequalit follows because the orbit is a path from x to a point in X (and thus it must be at least as long as the projection distance). Starting from the line marked ( ) above we have g(x ) g(x T ) = µ g(x(t)) dt g(x(t)) dt (b the PL inequalit variation in ()) µ x x p. (b (3)) 3

14 As g(x T ) =, this ields our result (), or equivalentl which is QG with a different constant. f(x) f µ x x p, Appendix B Relevant Problems Strongl-convex: B minimizing both sides of the SC inequalit with respect to we get f(x ) f(x) µ f(x), which implies the PL inequalit holds with the same value µ. Thus, Theorem exactl matches the known rate for gradient descent with a step-size of /L for a µ-sc function. Strongl-convex composed with linear: To show that this class of functions satisfies the PL inequalit, we first define f(x) := g(ax) for a σ-strongl convex function g. For arbitrar x and, we define u := Ax and v := A. B the strong convexit of g, we have g(v) g(u) + g(u) T (v u) + σ v u. B our definitions of u and v, we get g(a) g(ax) + g(ax) T (A Ax) + σ A Ax, where we can write the middle term as (A T g(ax)) T ( x). B the definition of f and its gradient being f(x) = A T g(ax) b the multivariate chain rule, we obtain f() f(x) + f(x), x + σ A( x). Using x p to denote the projection of x onto the optimal solution set X, we have f(x p ) f(x) + f(x), x p x + σ A(x p x) f(x) + f(x), x p x + σθ(a) x p x f(x) + min f(x), x + σθ(a) ] x = f(x) θ(a)σ f(x). In the second line we use that X is polhedral, and use the theorem of Hoffman 95] to obtain a bound in terms of θ(a) (the smallest non-zero singular value of A). This derivation implies that the PL inequalit is satisfied with µ = σθ(a). Appendix C Sign-Based Gradient Methods The learning heuristic RPROP (Resilient backpropagation) is a classic iterative method used for supervised learning problems in feedforward neural networks Riedmiller and Braun, 99]. The general update for some vector of step sizes α k IR d is given b x k+ = x k α k sign( f(x k )), where the operator indicates coordinate-wise multiplication. Although this method has been used for man ears in the machine learning communit, we are not aware of an previous convergence rate analsis of such a method. Here we give a convergence rate when the individual step-sizes αi k are chosen 4

15 proportional to / L i, where the L i are constants such that the gradient is -Lipschitz continuous in the norm defined b z L ] z i. Li i Formall, we assume that the L i are set so that for all x and we have f() f(x) L ] x L ], and where the dual norm of the L ] norm above is given b the L ] norm, z L ] max Li z i. i We note that such L i alwas exist if the gradient is Lipschitz continuous, so this is not adding an assumptions on the function f. The particular choice of the step-sizes αi k that we will analze is α k i = f(xk ) L ] Li, which ields a linear convergence rate for problems where the PL inequalit is satisfied. The coordinate-wise iteration update under this choice of αi k is given b x k+ i = x k i f(xk ) L ] Li sign( i f(x k )). Defining a diagonal matrix Λ with / L i along the diagonal, the update can be written as x k+ = x k f(x k ) L ]Λ sign( f(x k )). Consider the function g(τ) = f(x + τ( x)) with τ IR. Then f() f(x) f(x), x = g() g() f(x), x = = = dg (τ) f(x), x dτ dτ f(x + τ( x)), x f(x), x dτ f(x + τ( x)) f(x), x dτ f(x + τ( x)) f(x) L ] x L ] dτ τ x L ] dτ = τ x L ] = x L ] = x L ]. where the second inequalit uses the Lipschitz assumption, and in the first inequalit we ve used the Cauch-Schwarz inequalit and that the dual norm of the L ] norm is the L ] norm. The above gives an upper bound on the function in terms of this L ]-norm, f() f(x) + f(x), x + x L ]. 5

16 Plugging in our iteration update we have f(x k+ ) f(x k ) + f(x k ), x k+ x k + xk+ x k L ] = f(x k ) f(x k ) L ] f(x k ), Λ sign( f(x k )) + f(xk ) L ] Λ sign( f(x k )) L ] = f(x k ) f(x k ) L ] + f(xk ) ( L ] ) max Li sign( i f(x k )) i Li = f(x k ) f(xk ) L ]. Subtracting f from both sides ields f(x k+ ) f(x ) f(x k ) f(x ) f(xk ) L ]. Appling the PL inequalit with respect to the L ]-norm (which, if the PL inequalit is satisfied, holds for some µ L ] b the equivalence between norms), we have Appendix D f(xk ) L ] µ ( L ] f(x k ) f ), f(x k+ ) f(x ) ( µ L ] ) ( f(x k ) f(x ) ). Linear Convergence Rate of SVRG Method In this section, we look at the SVRG method for the finite-sum optimization problem, f(w) = f i (w). (4) n To minimize functions of this form, the SVRG algorithm of Johnson and Zhang 3] uses iterations of the form x t = x t α f it (x t ) f it (x s ) + µ s ], (5) where i t is chosen uniforml from {,,..., n} and we assume the step-size satisfies α < /L. In this algorithm we start with some x and initiall set µ = f(x ) and x = x, but after ever m steps we set x s+ to a random x t for t {ms +,..., m(s + )}, then replace µ s with f(x s ) and x t with x s+. Analogous to Johnson and Zhang 3] for the SC case, we now show tnat SVRG has a linear convergence rate if each f i is a convex function with a Lipschitz-continuous gradient and f satisfies the PL inequalit. Following the same argument as Johnson and Zhang 3], for an solution x the assumptions on the f i mean that the outer SVRG iterations x s satisf α( Lα)mEf(x s ) f ] E x s x ] + 4Lα mef(x s ) f ]. Choosing the particular x that is the projection of x s onto the solution set and using QG (which is equivalent to PL in this convex setting) we have α( Lα)mEf(x s ) f ] µ Ef(xs ) f ] + 4Lα mef(x s ) f ]. Dividing both sides b α( Lα)m we get Ef(x s ) f ] αl i ( ) mµα + Lα Ef(x s ) f ], which is a linear convergence rate for sufficientl large m and sufficientl small α. 6

17 Appendix E Proximal-PL Lemma In this section we give a useful propert of the function D g. Lemma. For an differentiable function f and an convex function g, given µ µ > we have D g (x, µ ) D g (x, µ ). We ll prove Lemma as a corollar of a related result. We first restate the definition D g (x, λ) = λ min f(x), x + λ ] x + g() g(x), (6) and we note that we require λ >. B completing the square, we have D g (x, λ) = min f(x) + f(x) + λ f(x), x + λ x + λ(g() g(x)) ] = f(x) min λ( x) + f(x) + λ(g() g(x)) ]. Notice that if g =, then D g (x, λ) = f(x) and the proximal-pl inequalit reduces to the PL inequalit. We ll the define the proximal residual function as the second part of the above equalit, R g (λ, x, a) min λ( x) + a + λ(g() g(x) ]. (7) Lemma. If g is convex then for an x and a, and for < λ λ we have Proof. Without loss of generalit, assume x =. Then we have R g (λ, x, a) R g (λ, x, a). (8) R g (λ, a) = min λ + a + λ(g() g() ] = min ȳ + a + λ(g(ȳ/λ) g() ], (9) ȳ where in the second line we used a changed of variables ȳ = λ (note that we are minimizing over the whole space of IR n ). B the convexit of g, for an α, ] and z IR n we have B using < λ /λ and using the choices α = λ λ Adding ȳ + a to both sides, we get g(αz) αg(z) + ( α)g() g(αz) g() α(g(z) g()). (3) and z = ȳ/λ we have g(ȳ/λ ) g() λ (g(ȳ/λ ) g()) λ λ (g(ȳ/λ ) g()) λ (g(ȳ/λ ) g()), (3) ȳ + a + λ (g(ȳ/λ ) g()) ȳ + a + λ (g(ȳ/λ ) g()). (3) Taking the minimum over both sides with respect to ȳ ields Lemma due to (9). Corollar. For an differentiable function f and convex function g, given λ λ, we have D g (x, λ ) D g (x, λ ). (33) B using D g (x, λ) = f(x) R g (λ, x, f(x)), Corollar is exactl Lemma. 7

18 Appendix F Relevant Problems In this section we prove that the three classes of functions listed in Section 4. satisf the proximal-pl inequalit condition. Note that while we prove these hold for D g (x, λ) for λ L, b Lemma above the also hold for D g (x, L).. f(x), where f satisfies the PL inequalit (g is constant): As g is assumed to be constant, we have g() g(x) = and the left-hand side of the proximal-pl inequalit simplifies to D g (x, µ) = µ min f(x), x + µ x } ( = µ ) µ f(x) { = f(x), Thus, the proximal PL inequalit simplifies to f satisfing the PL inequalit, as we assumed.. F (x) = f(x) + g(x) and f is strongl convex: B the strong convexit of f we have f(x) µ (f(x) f ), f() f(x) + f(x), x + µ x, (34) which leads to F () F (x) + f(x), x + µ x + g() g(x). (35) Minimizing both sides respect to, F F (x) + min f(x), x + µ x + g() g(x) = F (x) µ D g(x, µ). (36) Rearranging, we have our result. 3. F (x) = f(ax) + g(x) and f is strongl convex, g is the indicator function for a polhedral set X, and A is a linear transformation: B defining f(x) = f(ax) and using strong convexit of f, we have f() f(x) + f(x), x + µ A( x), (37) which leads to F () F (x) + f(x), x + µ A( x) + g() g(x). (38) Since X is polhedral, it can be written as a set {x : Bx c} for a matrix B and a vector c. As before, assume that x p is the projection of x onto the optimal solution set X which in this case is 8

19 {x : Bx c, Ax = z} for some z. F = F (x p ) F (x) + f(x), x p x + µ A(x x p) + g(x p ) g(x) = F (x) + f(x), x p x + µ Ax z + g(x p ) g(x) = F (x) + f(x), x p x + µ {Ax z} + + { Ax + z} + + g(x p ) g(x) = F (x) + f(x), x p x + µ A z A x z + g(x p ) g(x) B c F (x) + f(x), µθ(a, B) x p x + x x p + g(x p ) g(x) f(x), x + F (x) + min = F (x) µθ(a, B) x + g() g(x) µ θ(a) D g(x, µθ(a, B)). (39) where we ve used the notation that { } + = max{, }, the fourth equalit follows because x was projected onto X in the previous iteration (so Bx c ), and the line after that uses Hoffman s bound Hoffman, 95]. 4. F (x) = f(x) + g(x), f is convex, and F satisfies the quadratic growth (QG) condition: A function F satisfies the QG condition if + F (x) F µ x x p. (4) ] For an λ > we have, min f(x), x + λ ] x + g() g(x) f(x), x p x + λ x p x + g(x p ) g(x) f(x p ) f(x) + λ x p x + g(x p ) g(x) = λ x p x + F F (x) ( λ ) (F F ). (4) µ The third line follows from the convexit of f, and the last inequalit uses the QG condition of F. Multipling both sides b λ, we have D g (x, λ) = λ min f(x), x + λ ] ( x + g() g(x) λ λ ) (F (x) F ). (4) µ This is true for an λ >, and b choosing λ = µ/ we have D g (x, µ/) µ (F (x) F ). (43) 5. F satisfies the KL inequalit or the proximal-eb inequalit: In the next section we show that these are equivalent to the proximal-pl inequalit. Appendix G Equivalence of Proximal-PL with KL and EB The equivalence of the KL condition and the proximal-gradient variant of the Luo-Tseng EB condition is known for convex f, see Drusvatski and Lewis, 6, Corollar 3.6] and the proof of Bolte et al., 5, Theorem 5]. Here we prove the equivalence of these conditions with the proximal-pl inequalit for non-convex f. First we review the definitions of the three conditions: 9

20 . Proximal-PL: There exists a µ > such that where D g(x, L) µ(f (x) F ) { D g (x, L) = L min f(x), x + L } x + g() g(x).. Proximal-EB: There exists c > such that we have x x p c (x x prox L g ) L f(x). (44) 3. Kurdka- Lojasiewicz: The KL condition with exponent holds if there exist µ > such that min s F (x) s µ(f (x) F ) (45) where F (x) is Frechet subdifferential. In particular, if F : H R is a real-valued function then we sa that s H is a Frechet subdifferential of F at x dom F if F () F (x) s, x lim inf x, x x. (46) Note that for differentiable f the Frechet subdifferential onl contains the gradient, f(x). In our case where F (x) = f(x) + g(x) with a differentiable f and a convex g we have F (x) = { f(x) + ξ ξ g(x)}. The KL inequalit is an intuitive generalization of the PL inequalit since, analogous to the gradient vector in the smooth case, the negation of the quantit argmin s F (x) s points in the direction of steepest descent see Bertsekas et al., 3, Section 8.4] We first derive an alternative representation of D g (x, L) in terms of the so-called forward-backward envelope F of F see Stella et al., 6, Definition.]. Indeed, L { D g (x, L) = L min f(x), x + L } x + g() g(x) { = L min f(x) + f(x), x + L } ] (47) x + g() f(x) g(x) It follows from the definition of F L = LF L (x) F (x)] = LF (x) F L (x)], F L (x) F = min (x) that we have { f(x) + f(x), x + L x + g() } f(x ) g(x ) f(x) + f(x), x x + L x x + g(x ) f(x ) g(x ) = f(x) f(x ) + f(x), x x + L x x = f(x) f(x ) + f(x), x x + L x x L x x, (48) where the second line uses that we are taking the minimizer and the last line uses the Lipschitz continuit of f as follows, f(x) f() + f(x), x f(), x + L x + f(x), x = f() f(x), x + L x (49) f() f(x) x + L x 3L x.

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning