A Theoretical Analysis of a Warm Start Technique

Size: px

Start display at page:

Download "A Theoretical Analysis of a Warm Start Technique"

Marylou Lester
6 years ago
Views:

1 A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful for early steps where the current position is nowhere near optial. There has been a lot of interest in war-start approaches to gradient descent techniques, but little analysis. In this paper, we forally analyze a ethod of war-starting batch gradient descent using sall batch sizes. We argue that this approach is fundaentally different than ini-batch, in that after an initial shuffle, it requires only sequential passes over the data, iproving perforance on datasets stored on a disk drive. 1 Support Vector Machine Probles Suppose you have a set of exaples x 1, y 1... x, y, drawn fro a distribution D, and a regularization paraeter λ. For each exaple i, there is soe convex loss L i : R d R over the paraeter space associated with exaple i, such that for any S {1... }, we have a cost function f S : R d R: f S w = λ w + 1 S L i w. 1 Also, we can have a distribution D over loss functions. Using this we can define: f D w = λ w + LwD dl. Here, we will focus on finding a point w such that f D w is as sall as possible. This is often done by iniizing f As an interediate step, we focus on Euclidean distance to w = argin w f D w, as any ties this is a syetric proble. Define G to be a bound on the agnitude of the gradient of functions in D. Define w 1... R d such that w 1... = argin w f 1... w. We will first bound f S w with high probability, and then connect this to the distance between w 1... and w. A quick theore, in the vein of statistical learning theory: Lea 1 With probability of at least 1 δ, w 1... w ln d+ln ln δ8dg λ. The proof is in the appendix. In [1], they argue that with access to an infinite aount of data, it akes sense to balance the loss due to optiization error and the loss due to estiation error. Here, we use this balance not just to tune the algorith but to alter its structure. We odify batch gradient by having ultiple epochs, doubling the size of the batch each epoch. One key practical eleent that we want to ephasize is that rando access has a high cost, especially on hard disk drives. For exaple, a odern disk drive can read or write about 100 egabytes a second of data written in sequence in a file: however, if you want a rando chunk off the disk, it can take about 10 illiseconds for the disk to spin to a specific location and for the head to ove i S 1

2 to access that one byte. Therefore, if your exaples are 1 kilobyte in size, then you can either read 10,000 sequentially per second or 100 randoly per second. On ore expensive solid state drives, the gap is saller, but this can ake storage prohibitively expensive. However, special tricks can be applied to shuffle data see below. Thus, stochastic gradient descent [] and ini-batch [7, 5] can be significantly ore expensive than one ight expect by just looking at the nuber of iterations. In practical ipleentations such as [6], the stochastic aspect of the stochastic gradient descent is ignored, and the exaples are passed over sequentially. Although in practice this algorith works fine and you can prove Algorith 1 Shuffled Stochastic Gradient Descent Input: Cost functions L 1... L, epochs T, step size η, initial w R n. If not already shuffled, shuffle L 1... L. for t = 1 to T do for i = 1 to do w w η f i w that it converges, getting practical convergence rates is incredibly difficult. This represents a huge discrepency between theory and practice in the field. One critical issue is that you ust shuffle the data: if the data was generated independently and identically each exaple drawn at rando fro the sae distribution, then no shuffling is necessary. On the other hand, if it was then reordered for exaple, the data was ordered by date, or label, then it ust be shuffled. However, oddly, shuffling the data does not require uch rando access. For exaple, we can take a dataset, and add a rando key to each entry. Then, using fast ethods for large scale sorting such as ergesort, we can sort by the rando key, effecting a shuffle using n log n tie. Therefore, the question we ask in this paper is: Given a dataset that is shuffled once and accessed sequentially thereafter, what theoretical guarantees can be obtained? Although we do not have a tight analysis of Shuffled Stochastic Gradient Descent, in this paper we present another practical algorith which is easier to analyze. Algorith Doubling Batch Gradient Descent Input: Cost functions L 1... L, initial batch size initial, iterations T per size, step size η, initial w R n, iterations T 0 in the first phase. If not already shuffled, shuffle L 1... L. initial. {The first epoch} for t = 1 to T 0 do w w η f 1... w while < do {Each iteration of the while loop is another epoch} in,. for t = 1 to T do w w η f 1... w end while Theoretical Analysis of Doubling Batch Gradient Descent In this section we show how to set the paraeters to prove a significant iproveent over traditional analysis. Define C = ln d+ln ln δ+ln log 8dG λ see Lea 3 for why.

3 Theore Running Algorith 3 with = 1, T 0 = T = κ + 1 ln1 +, η = 1 H+λ achieves an error of H+λ4C. Before we prove this, we need two leas. The paraeters T 0 and T are selected such that at the end of each epoch, the bound on w w 1... is roughly equal to the bound on w 1... w. Lea 3 With probability at least 1 δ, for all used, w 1... w C. Proof: This follows fro union bounds over the log bounds, using Lea 1, substituting δ/ log for δ. Suppose that there is a upper bound H on the Lipschitz constant on the gradient of the functions L 1... L. Then, for all S {1... }, we have a bound on the Lipschitz constant of the gradient of f S. Define dx, y = x y. Therefore, we can use a slight tweak on Lea 16 fro the appendix of [8], siilar to the exponential convergence in cost described in [3], chapter 9: Lea 4 Given a convex function L where L is Lipschitz continuous with constant H, define cx = λ x + Lx. If η λ + H 1, and η < 1, then for all x: dw η cw, w 1 ηλdw, w 3 The proof is ostly in [8], but can be slightly generalized aking strict inequalities conditions siply inequalities. Proof of Theore : Thus, if we set η = 1 H+λ, then we know that we ove exponentially fast. Define κ = H λ to be the conditioning nuber of the syste. Define w 0 to be the initial position at the beginning of a phase. After T steps: dw, w λ T dw, w 0 4 H + λ dw, w 1... exp λ H + λ T dw, w 0 5 Suppose that we choose T 0 and the initial saple size initial such that dw, w 1... = the first epoch. Then, since dw, w 1... C and dw, w 1... C, then: C after dw, w 1... dw, w dw, w dw, w dw, w 1... C + C + dw, w C C 7 8 Therefore, if T = κ + 1 ln κ + 1 ln1 + <.68κ , then at the end of the epoch, dw, w 1... C. Here is the key eleent: note that T only has a linear dependence on κ, and is independent of, or even the final error of the algorith. Moreover, a single iteration has a cost of d, and so a single epoch except the first has a cost of ln1 + κ + 1 d. Note that C > G λ, so if = 1, and w 0 = 0, then w 0 w 1... C/. Therefore, we can choose T 0 < T, and push initial to be significantly higher than 1. So, the first epoch will not take longer than others. 3

4 Assue that / initial is a power of. The coputational cost of the algorith will be: Cost initial + initial + 4 initial / +.68κ d 9 Cost t.68κ d 10 t=0 Cost.68κ d. 11 This algorith achieves a paraeter C distance fro the optial. Using the bounds on the Lipschitz constant on the second derivative, this achieves a bound in cost 1 of H+λ4C. Note that we did not change and T 0 fro the natural defaults of 1 and T. One other reason for increasing is to affect the conditioning nuber of the resulting proble. As the loss functions fro any exaples are averaged, the proble can becoe ore well-conditioned. Moreover, in certain cases, one can epirically bound the value of H by one pass over the data where you average the exaples, allowing you to adjust the nuber of iterations per epoch accordingly. See Appendix B. Running batch gradient ethod to obtain this optiization error, referring to [1], or using the analysis here, requires at least on the order κd log 4C H+λ barring varying definitions of the conditioning nuber. Therefore, by scaling up the data set size, we have eliinated this last ter fro our bounds. 3 Conclusion Most discussions of large-scale learning focus on parallelization across different achines. However, another direction is to push learning deeper into the eory hierarchy. This paper has argued that deeper in the eory hierarchy, the difference between sequential access tie and rando access tie is very significant, aking stochastic gradient and ini-batch approaches less attractive than they would otherwise be. One very powerful arguent against batch is the overkill involved in looking at the gradient of every exaple at the initial point: this paper has to soe degree addressed this issue. However, barring the issues of rando access, stochastic gradient descent and ini-batch have very different properties, not to ention second order ethods. Overall, the larger question is: Given a dataset that is shuffled once and accessed sequentially thereafter, what theoretical guarantees can be obtained? In other words, what guarantees can be obtained with pseudo-stochastic gradient descent, where the data is shuffled once, but then passed over ultiple ties? What if ini-batch is used? Theoretically speaking, Newton s ethod basically takes six steps once it has reached its asyptotic stage [3]. However, it can take a while to reach that phase: can this analysis apply in that phase? On reasonable arguent for ignoring soe of the practical issues with stochastic ethods is the arrival of solid state drives, which have radically different perforance characteristics than disk drives. But for the foreseeable future, a achine focused on processing data streaing fro a disk will be a very cheap option. Secondly, the eory hierarchy itself presents interesting opportunities and challenges: it is conceivable that data could be accessed ultiple ties in eory while new data is being drawn fro disk. These types of approaches are being enabled with tools like Torch [4], but there is little theoretical understanding. We ake no clais that this is the first exaple of shuffling once and running any ties. In fact, this represents a standard trick e.g., it can be applied using Vowpal Wabbit [6], but one with little theory supporting it. If the theory behind shuffling algoriths and pseudo-stochastic algoriths can be expanded, it will better connect the theory of achine learning optiization and its practice. 1 Note that this is a slightly different bound than in [1]: note that C has a 1/λ ter, so that this bound is neither strictly better or worse than the bounds in that paper. In this work, we define the conditioning nuber to be the ratio of the Lipschitz constant of the gradient to the strong convexity: in [1], they focus on a siilar ratio in the vicinity of the optial point, whereas [3] use a easure siilar to the one in this paper. 4

5 References [1] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Inforation Processing Systes, 008. [] Lon Bottou. Stochastic learning. In Olivier Bousquet and Ulrike von Luxburg, editors, Advanced Lectures on Machine Learning, pages , 004. [3] S. Boyd and L. Vandenberghe. Convex Optiization. Cabridge University Press, New York, 004. [4] R. Collobert, S. Bengio, L. Bottou, J. Weston, and I.Melvin. Torch 5, [5] O. Dekel, R. Gilad-Bachrach, O. Shair, and L. Xiao. Optial distributed online prediction. In Proceedings of the Twenty-Eighth International Conference on Machine Learning, 011. [6] J. Langford. Vowpal wabbit, [7] N. Schraudolph and T. Graepel. Cobining conjugate direction ethods with stochastic approxiation of gradients. In C. Bishop and B. Frey, editors, Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, 003. [8] M. Zinkevich, M. Weier, A. Sola, and L. Li. Parallelized stochastic gradient descent. In J. Lafferty, C. K. I. Willias, J. Shawe-Taylor, R. S. Zeel, and A. Culotta, editors, Advances in Neural Inforation Processing Systes,

6 A Non-Asyptotic Estiation Error Lea 5 Hoeffding s Inequality for all i in 1... n, PrX i [a i, b i] = 1, and X = X X n/n, then Pr X E[X ] t n t exp n bi. 1 ai We will also use a basic result fro strong convexity: Lea 6 [3], Equation 9.11If x is the iniu of f, and f is h-strongly convex, then x x h fx. An interesting issue is establishing how accurate we are. Suppose that we know that regardless of the dataset sapled, we have a lower bound on the curvature. Define f t : R d R to be a function with saples. For any point x, if x is the optial point of f, then: Lea 1 With probability of at least 1 δ, w 1... w ln d+ln ln δ8dg λ. Proof: Since w iniizes f D, f D w = 0. Therefore, if L is drawn fro D, then the expected gradient of λ w + Lw at w is zero. Thus, the ith coponent of the gradient λwi + L w i has expected value zero, and is between G + λwi and G + λwi. Thus, we can use Hoeffding bounds to show that the average for any diension i: Pr f 1...w i t exp t 13 G Then, using union bounds: Pr f 1...w i t exp t. 14 G Pr i, f 1...w i t d exp t G Pr f 1...w t d d exp t G Now, fro Lea 6, since f 1... is λ-strongly convex, we can use f 1...w to bound the distance between w and the optial solution of f 1... Forally, w w 1... λ f1...w 17 Therefore, we push this into our previous results: Pr λ f1...w λ t d Pr w w 1... λ t d d exp t G d exp t G Setting ɛ = t d λ Therefore, setting We now have established:, and substituting ɛλ d for t: Pr f 1...w ɛ d exp ɛ λ 8dG 0 δ = d exp ɛ λ. 1 8dG Pr f 1...w ɛ δ 6

7 Therefore, solving for ɛ: ln δ = ln + ln d + ɛ λ 8dG ln δ ln ln d8dg = ɛ 4 λ ln d + ln ln δ8dg = ɛ 5 λ ln d + ln ln δ8dg = ɛ 6 λ 3 B Bounding H One drawback of this approach is that it requires deep knowledge of the paraeters of the syste, such as the axiu gradient and the Lipschitz constant of the derivative of the average of the losses. It turns out, it is fairly easy to bound H for the whole dataset in a single pass. Suppose that you consider the axiu curvature of the average of a bunch of loss functions. Suppose that these loss functions are of the for: L iw = lw x i, y i. 7 We assue that the Lipschitz constant of the derivative of the loss function l in the first ter is bounded above by H, and for all i, x i = 1. First, we are trying to bound the curvature of: L 1...w = 1 L iw 8 Define x = 1 xi. Lea 7 If for all i, for all j, x i j H x. 0, then the Lipschitz constant of L 1...w is bounded above by Proof: We know that: L iw = δlŷ, yi x i 9 δy i ŷ=w x i L iw L iw = δlŷ, yi x i δlŷ, yi x i 30 δy i ŷ=w x δy i i ŷ=w x i L iw δlŷ, y i L iw = δlŷ, yi x i 31 δy i ŷ=w x δy i i ŷ=w x i L iw L iw δlŷ, y i = δlŷ, yi x i 3 δy i ŷ=w x δy i i ŷ=w x i Using the Lipschitz bound on the derivative of l yields and the bound of x i : Liw L iw w x i w x i H Liw L iw w w x i H

8 Suing over all i: L1...w L 1...w 1 L iw L iw w w 1 1 Liw L iw 35 w w x i H 36 w w x i H 37 w w L iw L iw H = ax 38 w w w w 1 w w x i H 39 w w w w Suppose now that for all i, x i j > 0. Thus, for all j, if we wish to upper bound the left hand side, we can assue without loss of generality, w j w j 0, and reove the absolute values. Replace w w with a vector v: Using Cauchy-Schwarz: 1 w w H w w w w w w v 0 v 0 v 0 w w x i w w H 40 H v v x i 41 4 x i 43 H v v x 44 H v x 45 v v 0 H x 46 H H x 47 C One Shuffle Mini-Batch The ipractical aspect of ini-batch to apply to a disk drive is to grab a bunch of independent eleents. If one first divides the data into ini-batches, and these ini-batches are sufficiently large such that seek tie does not doinate the tie to read the batch data. Algorith 3 One Shuffle Mini-Batch Input: Cost functions L 1... L, ini-batch size, iterations T, initial w R n. If not already shuffled, shuffle L 1... L. for t = 1 to T do Draw i fro 0... / 1. w w η f i +1...i + w The average of the gradients of the ini-batches are still correct, so running stochastic gradient descent on the for a sufficiently sall gradient will work well. However, what is difficult is understanding what kind of win is obtained in ters of the variances of the gradients: oreover, it ay be necessary to bound this over the entire doain of the paraeters, which ight prove difficult. 8

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds