arxiv: v4 [math.oc] 5 Jan PDF Free Download

Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The University of Iowa, Iowa City, IA 54 tianbao-yang@uiowa.edu, qihang-lin@uiowa.edu Abstract In this paper, we show that a restarted stochastic (sub)gradient descent (RSGD) can find an -optimal solution for a broad class of non-smooth and/or non-strongly convex optimization problems with a lower iteration complexity than vanilla stochastic subgradient descent (SGD). Of foremost interest, we show that for nonsmooth and non-strongly convex optimization, RSGD reduces the dependence of SGD s iteration complexity on the distance of initial solution to the optimal set to the dependence on the maximum distance of points in the -level set to the optimal set. This result of RSGD for non-smooth and non-strongly convex problems complements our prior work [7] that establishes a linear convergence rate for RSGD for a special family of non-smooth and non-strongly convex optimization problems whose epigraph is a polyhedron. Besides that, we also obtain some interesting results of RSGD for locally semi-strongly convex problems and smooth optimization problems that have a Łojasiewicz property. In particular, (i) for locally semi-strongly convex functions, RSGD has an O( 1 log( 1 )) iteration complexity, which in the strongly convex scenario is similar to if not better than that of SGD; (ii) for smooth optimization problems that have a Łojasiewicz property with a power constant of β (0, 1), RSGD 1 has an O( 1 log( 1 β )) iteration complexity, which is better that that of SGD for such smooth optimization problems whose iteration complexity is O( 1 ). The novelty of our analysis lies at exploiting the lower bound of the subgradient or gradient of the objection function at the -level set. 1 Introduction In this section, we present the problem and some necessary notations and basic results. We consider the following generic unconstrained optimization problem without imposing additional assumptions on the structure of the objective function: min f(w) (1) w R d where f(w) is a non-smooth and/or non-strongly convex function. We abuse the notation f(w) to denote the set of subgradients of a non-smooth function or the gradient of a smooth function. In the sequel, we will use the term subgradient generally without explicitly referring to the gradient for a smooth function. Let g(w) denote a stochastic subgradient of f(w) that depends on a random variable ξ such that E[g(w)] f(w). Since the objective function is not necessarily stronlgy convex, thus the optimal solution is not necessarily unique. We use Ω to denote the optimal solution set and use f to denote the unique optimal objective value. 1 In this case, RSGD refers to restarted stochastic gradient descent. 1

Figure 1: An illustration of Lemma 1 for a non-smooth and non-stronlgy convex problem. The value of κ is a constant (just the slope of the linear line), where w (red point) is the closest -optimal solution to w (blue point). Throughout the paper, we make the following assumptions unless stated otherwise. Assumption 1. For a convex minimization problem (1), we assume a. There exist known w 0 Ω and 0 0 such that f(w 0 ) min w Ω f(w) 0. b. There exists a known constant G such that g(w) G almost surely for any w R d. c. Ω is a non-empty convex compact set. Below, we first introduce some notations. We denote by the Euclidean norm. Let w denote the closest optimal solution in Ω to w measured in terms of norm, i.e., w = arg min u Ω u w. Since Ω is a non-empty convex compact set, w is unique. We denote by L the -level set of f(w) and by S the -sublevel set of f(w), respectively, i.e., L = {w R d : f(w) = f + }, S = {w R d : f(w) f + } Since Ω is bounded and f(x) is convex, it follows from [5] (Corollary 8.7.1) that the sublevel set S = {w R d : f(w) f + } is bounded for any 0 and so as the level set L. Define B to be the maximum distance of points on the level set L to the optimal set Ω, i.e., B = max min w u = max w w. () u Ω w L w L Due to that L and Ω are bounded, B is also bounded. For any > 0, we define κ B. Let w to denote the closest point in the -sublevel set to w, i.e.., w = arg min u S u w = arg min u R d u w, s.t. f(u) f +. (3) It is easy to show that w L when w / S. In the sequel, we let f(w) be defined as f(w) = inf v f(w) v Given the notations above, we provide the following lemma which is the key to our analysis.

Lemma 1. For any w, we have w w 1 (f(w) f(w κ )). In general, when f(w) ρ for any w L for a constant ρ > 0, we have w w 1 (f(w) f(w ρ )). Proof. First we note that if w S then w = w and the lemma holds trivially. Next we consider w S. Then w L, i.e., f(w ) = f +. By the definition of w, there exists λ 0 and a subgradient v f(w ) such that By the convexity of f( ) we have w w + λv = 0. f(w) f(w ) (w w ) v = w w v (4) where the equality is due to that w w is in the same direction of v. From the above inequality, the second part of the Lemma is proved due to the assumption f(w) ρ for any w L. To prove the first part, we use the convexity of f( ) again to obtain f(w ) f(w ) (w w ) f(w ) where w is the closest point in Ω to w. To proceed, we have w w f(w ) (w w ) f(w ) f(w ) f(w ) = f(w ) f = (5) By the definition of B, we have B f(w ) = f(w ) B Combining the above inequality with that in (4), we have which completes the proof. w w B (f(w) f(w )) Remark 1: It is important to note that the above results are deterministic independent of any stochastic or randomized optimization algorithms. Although we focus on SGD-type algorithms in this work, it might be of further interest to emploit the above results for the analysis of other types of stochastic or randomized algorithms. For example, it is easy to generalize the analysis and algorithms to restarted stochastic mirror descent as long as the Euclidean norm for w is replaced by a general norm w and the Euclidean norm of subgradient f(w) is replaced by a dual norm f(w). In Section, we leverage the first part of Lemma 1 to analyze the convergence for non-smooth non-strongly convex optimization problems. In Section 3, we leverage the second part to analyze the convergence for other types of optimization problems, including problems with a weaker notion of strong convexity and a Łojasiewicz property, where both types of problems enjoy a lower bound of the subgradients or gradient at the -level set. Indeed, we note from the proof above that the key to prove the first part is also to establish a lower bound of the subgradients at points in the - level set. This observation might be extended to many special problems. For the problem shown in Figure 1, the subgradient f(w + ) is a constant, i.e., B = c where c is a constant, implying that κ is a constant. More generally, when f(x) is polyhedral function as considered in [7], κ κ when 0, implying B = Θ(). We emphasize that the second part of Lemma 1 could be of independent interest for further research. 3

Algorithm 1 SGD: ŵ T = SGD(w 1, η, T ) 1: Input: a step size η, the number of iterations T, and the initial solution w 1, : for t = 1,..., T do 3: compute a stochastic subgradient of f(w) at w t denoted by g t 4: update w t+1 = w t ηg t 5: end for 6: Output: ŵ T = T w t t=1 T Algorithm RSGD 1: Input: the number of epochs K and the number of iterations t per-epoch, w 0 Ω and 0 as in Assumption 1.a : for k = 1,..., K do 3: Set η k = ( k 1 + (a k 1 1))/(G ) // where a k = k 1. k 4: Run SGD to obtain w k = SGD(w k 1, η k, t) 5: Set k = k 1 / 6: end for 7: Output: w K RSGD for Non-smooth Non-strongly Convex Optimization Our goal is to find an -optimal solution ŵ. Algorithm 1 describes the vanilla stochastic subgradient descent method that serves as a subroutine in the proposed restarted stochastic subgradient method (RSGD) shown in Algorithm. We note that the restarting scheme is similar to RSGD proposed in [7]. However, the step size is slightly different. The following lemma [8] provides guarantee on the convergence of SGD methods. Lemma. Let SGD run T iterations. Then for any w we have for using subgradients, and E [f(ŵ T ) f(w)] G η + E[ w 1 w ]. ηt Theorem 1. Suppose Assumption 1 is satisfied. Let Algorithm run for t iterations per-epoch with t 4G κ. With at most K = log ( 0 ) epochs, we have E[f(w K ) f ]. Hence, the total number of iterations for achieving an -suboptimal solution is at most T = tk = t log ( 0 ). Remark : Compare with vanilla SGD. Since the number of iterations per-epoch t is in the order of 4G B, the overall iteration complexity is 4G B log( 0 /). Compared to traditional SGD whose iteration complexity is G w 0 w 0 by setting the optimal step size η, RSGD s iteration complexity depends on B instead of w 0 w0 and only has a logarithmic dependence on 0, the upper bound of f(w 0 ) f. When the initial solution is far from the optimal set, or B is very small, the proposed RSGD could be much faster. In some special cases, e.g., the problem shown in Figure 1, B = Θ(), RSGD only needs a constant number of iterations. Remark 3: Compare with []. Freund & Lu [] proposed new subgradient descent methods with a new step-size rule by leveraging the lower bound of the objective function and a function growth condition, which could lead to improved iteration complexity for both non-smooth and smooth optimizations. For non-smooth optimization, their method achieves an iteration complexity of O(G C ( log H + 1 ) for a solution with a relative accuracy (defined shortly) of. Here, H = f(w0) f f f slb, f slb is a lower bound f and C is called growth constant. We emphasize that there are several key differences between our work and []:(i) The convergence results in [] are established based on finding an solution ŵ T with a relative accuracy of, i.e., f(ŵ T ) f (f f slb ), The constant is set artificially for ease of presentation. 4

while we consider absolute accuracy. (ii) The methods in [] are the deterministic first-order methods while we focus on stochastic first-order methods. (iii) Their version of subgradient descent method keeps track of the best solution up to the current iteration while our method maintains the average of all solutions within each epoch. (iv) Their convergence results depend on the growth constant C while ours depends on the quantity B. Proof. We assume that the epoch solutions w k S for k = 1,..., K 1 and induce how many epochs are sufficient for reaching to a point in the -sublevel set in the worst case. Let w k, denote the closest point to w k in the sublevel set. Since S S, therefore w k S and consequentially f(w k, ) = f +. We first define a k = k 1 1 so that a k k = a k 1+1 for k = 0, 1,.... According to Algorithm, we have k = k 1. We will show by induction that E[f(w k) f ] k + a k for k = 0, 1,... which leads to our conclusion when k = K. The inequality holds obviously for k = 0. Assuming E[f(w k 1 ) f ] k 1 + a k 1, we need to show that E[f(w k ) f ] k + a k. We apply Lemma to the k-th epoch of Algorithm and get By Lemma 1, we have E[f(w k ) f(w k 1, )] G η k + E[ w k 1 w k 1, ]. η k t E[ w k 1 w k 1, ] 1 κ E[(f(w k 1 ) f(w k 1, ))] k 1 + (a k 1 1) κ where the second inequality is from the fact that E[f(w k 1 ) f(w k 1, )] E[f(w k 1) f + (f f(w k 1, ))] k 1 + (a k 1 1) Choosing η k = k 1 + (a k 1 1) G E[f(w k ) f(w k 1, )] k 1 + (a k 1 1) 4 and t 4G κ, we have + k 1 + (a k 1 1) 4 which, together with the fact that f(w k 1, ) = f +, implies E[f(w k ) f ] + k 1 + (a k 1 1) Therefore by induction, we have = k + (a k 1 + 1) E[f(w K ) f ] K + a K = 0 K + K 1 K where the last inequality is due to the value of K. k 1 + (a k 1 1), = k + a k Before ending this section, we would like to emphasize that the same analysis can be employed to analyze the iteration complexity of restarted version of other stochastic/randomized algorithms. For example, it is not difficult to obtain similar results for randomized coordinate subgradient descent and stochastic dual averaging methods, which have been analyzed in the previous work [7]. 3 RSGD for other classes of problems In this section, we show some interesting convergence results of RSGD for (weaker) strongly convex optimization problems and smooth optimization problems. First, we consider a weaker notation of strong convexity f(w). A function f(w) is said to be (λ, )-locally semi-strongly convex, iff for any w S it satisfies w w λ (f(w) f(w )), (6) 5

Figure : An illustration of weaker classes of strong convexity and the applicable representative algorithms. where w is the closest point to w in the optimal set. If the inequality (6) holds for any w R d, we refer to it as semi-strong convexity. The above inequality implies w w for any w L. Therefore, in the proof of Lemma 1 f(w ) which, together with (4), further leads to w w λ λ as a result of (5), λ (f(w) f(w )) for any w. By choosing t 8G in Algorithm and following a similar analysis as the proof of Theorem 1 λ we can obtain an O( G λ log( 1 ) iteration complexity of Algorithm for (λ, )-locally semi-strongly convex optimization. This result is summarized in the following corollary. Corollary. Suppose Assumption 1 is satisfied and f(w) is (λ, )-locally semistrongly convex such that the inequality in (6) holds for any w S. Let Algorithm run for t iterations per-epoch with t 8G λ. With at most K = log ( 0 ) epochs, we have E[f(w) f ]. In comparison, to the best of our knowledge the vanilla SGD can only take advantage of strong convexity for obtaining Õ(1/) iteration complexity where Õ( ) suppresses constant and logarithmic terms. A function f(w) is said to be λ-strongly convex if it satisfies the following inequality for any w, u and any α [0, 1] f(αw + (1 α)w) αf(w) + (1 α)f(u) λ α(1 α) w u which is stronger than (6). In particular, λ-strong convexity implies (λ, )-local semi-strong convexity [3] but not vice versa, which is illustrated in Figure. The convergence rate of SGD for λ-strongly convex optimization is O( G log T λt ). It is worth noting that a different variant of SGD (namely Epoch-SGD) proposed in [3] applies to semi-strong convexity such that the inequality in (6) holds for any w and enjoys an O( G λt ) convergence rate. Nonetheless, RSGD is still attractive when only local semi-strong convexity holds, which is discussed further below for some examples. Locally strongly convex functions include, e.g., f(w) = e w z and f(w) = log(1+exp(w z)) that cover some commonly used loss functions in machine learning. A family of functions that admit (6) is in the form f(w) = h(aw) where h( ) is a strongly convex function (e.g., h(x) = x y ). By exploiting the Hoffman s bound, it can be shown that the f function in this form satisfies (6) for a constant λ > 0 3. As a result, for any locally strongly convex function h( ), the composition h(aw) is locally semi-strongly convex. Next, we consider a family of smooth optimization with an objective function satisfying the Łojasiewicz inequality, i.e., for every w there is an open neighborhood U of w and constants β (0, 1) and c > 0 such that for all u U f(u) f(w) β c f(u) 3 Indeed, λ is the minimum non-zero eigen-value of A A. 6

Algorithm 3 Multi-stage RSGD 1: Input: the accuracy parameter 1 for the first stage and the associated number iterations t 1 per-epoch, a decreasing power parameter r N + : Initialization: w 0 Ω and 0 as in Assumption 1.a 3: for s = 1,..., do 4: Run RSGD with the initial solution w s 1 and K s = log ( s 1 / s ) epochs and t s iterations per-epoch, and obtain a solution w s 5: Evaluate w s according to the earlier stopping criterion 6: If the criterion is not satisfied, update s+1 = s / r and t s+1 accordingly 7: Otherwise terminate the for loop 8: end for In particular, for any w L we assume that the Łojasiewicz inequality holds for w Ω and a neighborhood U of w such that w U. Then f(w) β c, w L The same analysis in Section for RSGD yields an O( G log( 1 β )) iteration complexity. In comparison, stochastic gradient descent for smooth optimization only has an O( 1 ) iteration complexity. Therefore the smaller the power constant β, the smaller the iteration complexity of RSGD. It is worth mentioning that the most notable class of functions that have the Łojasiewicz property at any point are real-analytic functions [6]. Recently, many works study when a function admits the Łojasiewicz property or the Kurdyka - Łojasiewicz property [1]. However, research about this issue is beyond the scope of this paper. The corollary below summarizes the iteration complexity of RSGD for minimizing a smooth function that has that the Łojasiewicz property. Corollary 3. Suppose Assumption 1 is satisfied and f(w) has a Łojasiewicz property - for any w L there exist constants β (0, 1) and c > 0 and an open neighborhood of w Ω that subsumes w such that f(w) f(w ) β c f(w) where w is the closest point to w in Ω. Let Algorithm run for t iterations per-epoch with t 4G c. With at most K = log β ( 0 ) epochs, we have E[f(w) f ]. 4 Implementation Tricks for Machine Learning Problems In this section, we discuss some technical issues related to implementation. First, we note that the RSGD algorithm presented in Algorithm is a unified framework for non-smooth and non-strongly convex problems, locally semi-stronlgy convex problems, and smooth problems of interest. There is only one parameter in RSGD needed to be tuned in practice, i.e., the number of iterations perepoch. The step size η is determined automatically given the accuracy. Although the value of is usually given by the user, in machine learning applications people may find it inconvenient if no prior knowledge about that gives good prediction performance is known. In particular, if is set too small the total number of iterations becomes unnecessarily large, which could harm the prediction performance on testing data as well. On the other hand, if is not sufficiently large, people would worry about that whether the best prediction performance is obtained. In contrast, SGD can be run with infinite number of iterations to get the optimal solution. To solve this issue, we present a multi-stage implementation scheme of RSGD guided by an earlier stopping criterion. The earlier stopping criterion is based on the prediction performance on a validation data, which is usually a common practice in big data learning [4]. We let the first stage start from a randomly chosen initial solution w 0 with a target accuracy of reasonably small 1 and a tuned value of t 1 per-iteration. After K 1 = log ( 0 / 1 ) epochs, we obtain an 1 -optimal model w 1 and evaluate the earlier stopping criterion to decided whether to stop or not. If the decision is to continue, we then decrease 1 by a constant number of times (e.g., ) and increase t 1 accordingly to obtain = 1 / and t. In particular, to get t we can increase t 1 by 4 times if according to the theory the number 7

of iterations per-epoch t depends on 1/, or times if t depends on 1/. Then we start the second stage of RSGD from the last solution w 1 with a target accuracy of and t iterations per-peoch. Note that since the new initial solution w 1 is 1 -optimal, therefore according to our theory, we can stop the second stage in K = log ( 1 / ) epochs and the resulting model w is guaranteed to be -optimal solution. Then we repeat the above process till that the earlier stopping criterion is satisfied, e.g, there is no change in the validation error or overfitting occurs 4. The multi-stage scheme is presented in Algorithm 3. 5 Conclusion In this work, we have analyzed the iteration complexity of a novel restarted stochastic subgradient descent method for non-smooth and/or non-strongly convex optimization for obtaining an - suboptimal solution. By leveraging the maximum distance of points in the -level set to the optimal set, we establish a lower bound of the subgradient of points in the -level set, which together with a novel analysis leads to an iteration complexity that could be better than that of the vanilla SGD for non-smooth and non-strongly convex optimization problems. Our novel analysis also leads to improved iteration complexities of RSGD for locally semi-strongly convex optimization problems, and smooth problems that have a Łojasiewicz property. It remains an interesting question to explore RSGD for solving non-convex problems. References [1] J. Bolte, A. Daniilidis, A. S. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18:556 57, 007. [] R. M. Freund and H. Lu. New computational guarantees for solving convex optimization problems with first order methods, via a function growth condition measure. ArXiv e-prints, 015. [3] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 4th Annual Conference on Learning Theory (COLT), pages 41 436, 011. [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1106 1114, 01. [5] R. Rockafellar. Convex Analysis. Princeton mathematical series. Princeton University Press, 1970. [6] R. Schneider and A. Uschmajew. Convergence results for projected line-search methods on varieties of low-rank matrices via łojasiewicz inequality. CoRR, abs/140.584, 014. [7] T. Yang and Q. Lin. Stochastic subgradient methods with linear convergence for polyhedral convex optimization. arxiv:1510.01444 cs.lg, 015. [8] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the International Conference on Machine Learning (ICML), pages 98 936, 003. 4 In the later case, the solution from the previous stage may be used as the final solution. 8

arxiv: v4 [math.oc] 5 Jan 2016