On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract his paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. his allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of PEGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. Introduction Online regret minimizing algorithms provide some of the most successful algorithms for many machine learning problems, both in terms of the speed of optimization and the quality of generalization. Notable examples include efficient learning algorithms for structured prediction [Collins, 00] (an algorithm now widely used) and for ranking problems [Crammer et al., 006] (providing competitive results with a fast implementation). Online convex optimization is a sequential paradigm in which at each round, the learner predicts a vector w t S R n, nature responds with a convex loss function, l t, and the learner suffers loss l t (w t ). In this setting, the goal of the learner is to minimize the regret: l t (w t ) min w S l t (w) which is the difference between his cumulative loss and the cumulative loss of the optimal fixed vector. ypically, these algorithms are used to train a learning algorithm incrementally, by sequentially feeding the algorithm a data sequence, (X, Y ),..., (X, Y ) (generated in an i.i.d. manner). In essence, the loss function used in the above paradigm at time t is l(w; (X t, Y t )), and this leads to a guaranteed bound on the regret: Reg = l(w t ; (X t, Y t )) min w S l(w; (X t, Y t )) However, in the batch setting, we are typically interested in finding a parameter ŵ with good generalization ability, i.e. we would like: R(ŵ) min w S R(w) to be small, where R(w) := E [l(w; (X, Y ))] is the risk.

Intuitively, it seems plausible that low regret on an i.i.d. sequence, should imply good generalization performance. In fact, for most of the empirically successful online algorithms, we have a set of techniques to understand the generalization performance of these algorithms on new data via online to batch conversions the conversions relate the regret of the algorithm (on past data) to the generalization performance (on future data). hese include cases which are tailored to general convex functions [Cesa-Bianchi et al., 004] (whose regret is O( )) and mistake bound settings [Cesa- Bianchi and Gentile, 008] (where the the regret could be O() under separability assumptions). In these conversions, we typically choose ŵ to be the average of the w t produced by our online algorithm. Recently, there has been a growing body of work providing online algorithms for strongly convex loss functions (i.e. l t is strongly convex), with regret guarantees that are merely O(ln ). Such algorithms have the potential to be highly applicable since many machine learning optimization problems are in fact strongly convex either with strongly convex loss functions (e.g. log loss, square loss) or, indirectly, via strongly convex regularizers (e.g. L or KL based regularization). Note that in the latter case, the loss function itself may only be just convex but a strongly convex regularizer effectively makes this a strongly convex optimization problem; e.g. the SVM optimization problem uses the hinge loss with L regularization. In fact, for this case, the PEGASOS algorithm of Shalev-Shwartz et al. [007] based on the online strongly convex programming algorithm of Hazan et al. [006] is a state-of-the-art SVM solver. Also, Ratliff et al. [007] provide a similar subgradient method for max-margin based structured prediction, which also has favorable empirical performance. he aim of this paper is to examine the generalization properties of online convex programming algorithms when the loss function is strongly convex (where strong convexity can be defined in a general sense, with respect to some arbitrary norm ). Suppose we have an online algorithm which has some guaranteed cumulative regret bound Reg (e.g. say Reg ln with samples). hen a corollary of our main result shows that with probability greater than δ ln, we obtain a parameter ŵ from our online algorithm such that: R(ŵ) min R(w) Reg w + O Reg ln δ + ln δ Here, the constants hidden in the O-notation are determined by the Lipschitz constant and the strong convexity parameter of the loss l. Importantly, note that the correction term is of lower order than the regret if the regret is ln then the additional penalty is O( ln ). If one naively uses the Hoeffding-Azuma methods in Cesa-Bianchi et al. [004], one would obtain a significantly worse penalty of O(/ ). his result solves an open problem in Shalev-Shwartz et al. [007], which was on characterizing the convergence rate of the PEGASOS algorithm, with high probability. PEGASOS is an online strongly convex programming algorithm for the SVM objective function it repeatedly (and randomly) subsamples the training set in order to minimize the empirical SVM objective function. A corollary to this work essentially shows the convergence rate of PEGASOS (as a randomized optimization algorithm) is concentrated rather sharply. Ratliff et al. [007] also provide an online algorithm (based on Hazan et al. [006]) for max-margin based structured prediction. Our results are also directly applicable in providing a sharper concentration result in their setting (In particular, see the regret bound in Equation 5, for which our results can be applied to). his paper continues the line of research initiated by several researchers [Littlestone, 989, Cesa- Bianchi et al., 004, Zhang, 005, Cesa-Bianchi and Gentile, 008] which looks at how to convert online algorithms into batch algorithms with provable guarantees. Cesa-Bianchi and Gentile [008] prove faster rates in the case when the cumulative loss of the online algorithm is small. Here, we are interested in the case where the cumulative regret is small. he work of Zhang [005] is closest to ours. Zhang [005] explicitly goes via the exponential moment method to derive sharper concentration results. In particular, for the regression problem with squared loss, Zhang [005] gives a result similar to ours (see heorem 8 therein). he present work can also be seen as generalizing his result to the case where we have strong convexity with respect to a general norm. Coupled with.

recent advances in low regret algorithms in this setting, we are able to provide a result that holds more generally. Our key technical tool is a probabilistic inequality due to Freedman [Freedman, 975]. his, combined with a variance bound (Lemma ) that follows from our assumptions about the loss function, allows us to derive our main result (heorem ). We then apply it to statistical learning with bounded loss, and to PEGASOS in Section 4. Setting Fix a compact convex subset S of some space equipped with a norm. Let be the dual norm defined by v := sup w : w v w. Let Z be a random variable taking values in some space Z. Our goal is to minimize F (w) := E [f(w; Z)] over w S. Here, f : S Z [0, B] is some function satisfying the following assumption. Assumption LIS. (LIpschitz and Srongly convex assumption) For all z Z, the function f z (w) = f(w; z) is convex in w and satisfies:. f z has Lipschitz constant L w.r.t. to the norm, i.e. w S, f z (w) ( f z denotes the subdifferential of f z ), L. Note that this assumption implies w, w S, f z (w) f z (w ) L w w.. f z is ν-strongly convex w.r.t., i.e. θ [0, ], w, w S, f z (θw + ( θ)w ) θf z (w) + ( θ)f z (w ) ν θ( θ) w w. Denote the minimizer of F by w, w := arg min w S F (w). We consider an online setting in which independent (but not necessarily identically distributed) random variables Z,..., Z become available to us in that order. hese have the property that t, w S, E [f(w; Z t )] = F (w). Now consider an algorithm that starts out with some w and at time t, having seen Z t, updates the parameter w t to w t+. Let E t [ ] denote conditional expectation w.r.t. Z,..., Z t. Note that w t is measurable w.r.t. Z,..., Z t and hence E t [f(w t ; Z t )] = F (w t ). Define the statistics, Reg := Diff := f(w t ; Z t ) min w S f(w; Z t ), (F (w t ) F (w )) = Define the sequence of random variables F (w t ) F (w ). ξ t := F (w t ) F (w ) (f(w t ; Z t ) f(w ; Z t )). () Since E t [f(w t ; Z t )] = F (w t ) and E t [f(w ; Z t )] = F (w ), ξ t is a martingale difference sequence. his definition needs some explanation as it is important to look at the right martingale difference sequence to derive the results we want. Even under assumption LIS, t f(w t; Z t ) and t f(w ; Z t ) will not be concentrated around t F (w t) and F (w ) respectively at a rate better then O(/ ) in general. But if we look at the difference, we are able to get sharper concentration. 3 A General Online to Batch Conversion he following simple lemma is crucial for us. It says that under assumption LIS, the variance of the increment in the regret f(w t ; Z t ) f(w ; Z t ) is bounded by its (conditional) expectation 3

F (w t ) F (w ). Such a control on the variance is often the main ingredient in obtaining sharper concentration results. Lemma. Suppose assumption LIS holds and let ξ t be the martingale difference sequence defined in (). Let Var t ξ t := E t [ ξ t ] be the conditional variance of ξ t given Z,..., Z t. hen, under assumption LIS, we have, Var t ξ t 4L ν (F (w t) F (w )). he variance bound given by the above lemma allows us to prove our main theorem. heorem. Under assumption LIS, we have, with probability at least 4 ln( )δ, F (w t ) F (w ) Reg { } L ln(/δ) Reg 6L ln(/δ) + 4 + max ν ν, 6B Further, using Jensen s inequality, 3. Proofs Proof of Lemma. We have, t F (w t) can be replaced by F ( w) where w := t w t. Var t ξ t E t [ (f(w t ; Z t ) f(w ; Z t )) ] [ Assumption LIS, part ] E t [ L w t w ] = L w t w. () On the other hand, using part of assumption LIS, we also have for any w, w S, f(w; Z) + f(w ( ) ; Z) w + w f ; Z + ν 8 w w. aking expectation this gives, for any w, w S, F (w) + F (w ( ) ) w + w F + ν 8 w w. Now using this with w = w t, w = w, we get F (w t ) + F (w ( ) wt + w ) F + ν 8 w t w [ w minimizes F ] F (w ) + ν 8 w t w. his implies that Combining () and (3) we get, w t w 4(F (w t) F (w )) ν (3) Var t ξ t 4L ν (F (w t) F (w )) he proof of heorem relies on the following inequality for martingales which is an easy consequence of Freedman s inequality [Freedman, 975, heorem.6]. he proof of this lemma can be found in the appendix. 4

Lemma 3. Suppose X,..., X is a martingale difference sequence with X t b. Let Var t X t = Var (X t X,..., X t ). Let V = Var tx t be the sum of conditional variances of X t s. Further, let σ = V. hen we have, for any δ < /e and 3, ( { Prob X t > max σ, 3b ) } ln(/δ) ln(/δ) 4 ln( )δ. Proof of heorem. By Lemma, we have σ := 4L Var t ξ t ν Diff. Note that ξ t B because our f has range [0, B]. herefore, Lemma 3 gives us that with probability at least 4 ln( )δ, we have By definition of Reg, { ξ t max σ, 6B ln(/δ)} ln(/δ). Diff Reg and therefore, with probability, 4 ln( )δ, we have { L Diff Reg max 4 ν Diff, 6B } ln(/δ) ln(/δ). Using Lemma 4 below to solve the above quadratic inequality for Diff, gives ξ t F (w t) F (w ) Reg + 4 L ln(/δ) Reg ν { 6L + max ν } ln(/δ), 6B he following elementary lemma was required to solve a recursive inequality in the proof of the above theorem. Its proof can be found in the appendix. Lemma 4. Suppose s, r, d, b, 0 and we have hen, it follows that 4 Applications s r max{4 ds, 6b }. s r + 4 dr + max{6d, 6b}. 4. Online to Batch Conversion for Learning with Bounded Loss Suppose (X, Y ),..., (X, Y ) are drawn i.i.d. from a distribution. he pairs (X i, Y i ) belong to X Y and our algorithm are allowed to make predictions in a space D Y. A loss function l : D Y [0, ] measures quality of predictions. Fix a convex set S of some normed space and a function h : X S D. Let our hypotheses class be {x h(x; w) w S}. On input x, the hypothesis parameterized by w predicts h(x; w) and incurs loss l(h(x; w), y) if the correct prediction is y. he risk of w is defined by R(w) := E [l(h(x; w), Y )] 5

and let w := arg min w S R(w) denote the (parameter for) the hypothesis with minimum risk. It is easy to see that this setting falls under the general framework given above by thinking of the pair (X, Y ) as Z and setting f(w; Z) = f(w; (X, Y )) to be l(h(x; w), Y ). Note that F (w) becomes the risk R(w). he range of f is [0, ] by our assumption about the loss functions so B =. Suppose we run an online algorithm on our data that generates a sequence of hypotheses w 0,..., w such that w t is measurable w.r.t. X <t, Y <t. Define the statistics, Reg := Diff := l(h(x t ; w t ), Y t ) min w S (R(w t ) R(w )) = l(h(x t ; w), Y t ), R(w t ) R(w ). At the end, we output w := ( w t)/. he following corollary then follows immediately from heorem. It bounds the excess risk R( w) R(w ). Corollary 5. Suppose assumption LIS is satisfied for f(w; (x, y)) := l(h(x; w), y). hen we have, with probability at least 4 ln( )δ, R( w) R(w ) Reg + 4 L ln(/δ) Reg ν { } 6L ln(/δ) + max ν, 6 Recently, it has been proved [Kakade and Shalev-Shwartz, 008] that if assumption LIS is satisfied for w l(h(x; w), y) then there is an online algorithm that generates w,..., w such that Reg L ( + ln ) ν Plugging it in the corollary above gives the following result. Corollary 6. Suppose assumption LIS is satisfied for f(w; (x, y)) := l(h(x; w), y). hen there is an online algorithm that generates w,..., w and in the end outputs w such that, with probability at least 4 ln( )δ, for any 3. R( w) R(w ) L ln ν + 4L ln ν ln. ( ) { } 6L ln(/δ) + max δ ν, 6, 4. High Probability Bound for PEGASOS PEGASOS [Shalev-Shwartz et al., 007] is a recently proposed method for solving the primal SVM problem. Recall that in the SVM optimization problem we are given m example, label pairs (x i, y i ) R d {±}. Assume that x i R for all i where is the standard L norm. Let F (w) = w + m l(w; (x i, y i )) (4) m be the SVM objective function. he loss function l(w; (x, y)) = [ y(w x)] + is the hinge loss. At time t, PEGASOS takes a (random) approximation i= f(w; Z t ) = w + l(w; (x, y)), k (x,y) Z t of the SVM objective function to estimate the gradient and updates the current weight vector w t to w t+. Here Z t is a random subset of the data set of size k. Note that F (w) can be written as [ ] F (w) = E w + l(w; Z) 6

where Z is an example (x i, y i ) drawn uniformly at random from the m data points. It is also easy to verify that w, E [f(w; Z t )] = F (w). It can be shown that w := arg min F (w) will satisfy w / so we set { S = w R d : w }. For any z that is a subset of the data set, the function w f(w; z) = w + z (x,y) z l(w; (x, y)) is Lipschitz on S with Lipschitz constant L = + R and is -strongly convex. Also f(w; z) [0, 3/ + R/ ]. So, the PEGASOS setting falls under our general framework and satisfies assumption LIS. heorem in Shalev-Shwartz et al. [007] says, for any w, 3, f(w t ; Z t ) f(w; Z t ) + L ln, (5) where L = + R. It was noted in that paper that plugging in w = w and taking expectations, we easily get [ ] E Z,...,Z F (w t ) F (w ) + L ln. Here we use heorem to prove an inequality that holds with high probability, not just in expectation. Corollary 7. Let F be the SVM objective function defined in (4) and w,..., w be the sequence of weight vectors generated by the PEGASOS algorithm. Further, let w denote the minimizer of the SVM objective. hen, with probability 4δ ln( ), we have F (w t ) F (w ) L ln + 4L ln ln ( ) { 6L +max δ, 9 + 6R } ( ) ln, (6) δ for any 3. herefore, assuming R =, we have, for small enough, with probability at least δ, ( ) ln F (w t ) F (w δ ) = O. Proof. Note that (5) implies that Reg L ln. he corollary then follows immediately from heorem by plugging in ν = and B = 3/ + R/. References N. Cesa-Bianchi and C. Gentile. Improved risk tail bounds for on-line algorithms. IEEE ransactions on Information heory, 54():86 390, 008. N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE ransactions on Information heory, 50(9):050 057, September 004. M. Collins. Discriminative training methods for hidden Markov models: heory and experiments with perceptron algorithms. In Conference on Empirical Methods in Natural Language Processing, 00. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. Journal of Machine Learning Research, 7:55 585, Mar 006. David A. Freedman. On tail probabilities for martingales. he Annals of Probability, 3():00 8, Feb 975. 7

E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning heory, 006. S. Kakade and S. Shalev-Shwartz. Mind the duality gap: Logarithmic regret algorithms for online optimization. Advances in Neural Information Processing Systems, 008. N. Littlestone. Mistake bounds and logarithmic linear-threshold learning algorithms. PhD thesis, U. C. Santa Cruz, March 989. Nathan Ratliff, James (Drew) Bagnell, and Martin Zinkevich. (online) subgradient methods for structured prediction. In Eleventh International Conference on Artificial Intelligence and Statistics (AIStats), March 007. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-gradient SOlver for SVM. In Proceedings of the wenty-fourth International Conference on Machine Learning (ICML), pages 807 84, 007.. Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the Eighteenth Annual Conference on Computational Learning heory, pages 73 87, 005. 8

Appendix Proof of Lemma 3. Note that a crude upper bound on Var t X t is b. hus, σ b. We choose a discretization 0 = α < α 0 <... < α l such that α i+ = rα i for i 0 and α l b. We will specify the choice of α 0 and r shortly. We then have, for any c > 0,! Prob = j=0 j=0 X t X t > c max{rσ, α 0} p ln(/δ) P Prob t Xt > c max{rσ, α0}p «ln(/δ) & α j < σ α j P Prob t Xt > p «cαj ln(/δ) & αj < V αj X t > cα j p ln(/δ) & V α j! Prob X j=0 t 0 ( ) c exp @ αj ln(/δ) j=0 αj + p A cα 3 j ln(/δ) b 0 = exp @ c α j ln(/δ) j=0 α j + c p A ln(/δ) b 3 where the inequality ( ) follows from Freedman s inequality. If we now choose α 0 = bc ln(/δ) then α j bc ) ln(/δ) for all j and hence every term in the above summation is bounded by exp which is less then δ if we choose c = 5/3. Set r = /c = 6/5. We want ( c ln(/δ) +/3 α 0 r l b. Since c ln(/δ), choosing l = log r ( ) ensures that. hus we have ( Prob X t > 5 3 max{6 5 σ, 5 3 b ln(/δ)} ) ln(/δ) ( = Prob X t > c max{rσ, α 0 } ) ln(/δ) t (l + )δ = (log 6/5 ( ) + )δ (6 ln( ) + )δ 4 ln( )δ. ( 3) Proof of Lemma 4. he assumption of the lemma implies that one of the following inequalities holds: In the second case, we have s r 6b s r 4 ds. (7) ( s ) (4 d ) s r 0 which means that s should be smaller than the larger root of the above quadratic. his gives us, s = ( ( s) d + ) 4d + r 4d + 4d + r + 4 4d 4 + d r [ x + y x + y] 8d + r + 8d + 4 dr Combining (7) and (8) finishes the proof. r + 4 dr + 6d. (8) 9