On the Generalization Ability of Online Strongly Convex Programming Algorithms
|
|
- Margery Sutton
- 6 years ago
- Views:
Transcription
1 On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL sham@tti-c.org Ambuj ewari I Chicago Chicago, IL tewari@tti-c.org Abstract his paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. his allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of PEGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. Introduction Online regret minimizing algorithms provide some of the most successful algorithms for many machine learning problems, both in terms of the speed of optimization and the quality of generalization. Notable examples include efficient learning algorithms for structured prediction [Collins, 00] (an algorithm now widely used) and for ranking problems [Crammer et al., 006] (providing competitive results with a fast implementation). Online convex optimization is a sequential paradigm in which at each round, the learner predicts a vector w t S R n, nature responds with a convex loss function, l t, and the learner suffers loss l t (w t ). In this setting, the goal of the learner is to minimize the regret: l t (w t ) min w S l t (w) which is the difference between his cumulative loss and the cumulative loss of the optimal fixed vector. ypically, these algorithms are used to train a learning algorithm incrementally, by sequentially feeding the algorithm a data sequence, (X, Y ),..., (X, Y ) (generated in an i.i.d. manner). In essence, the loss function used in the above paradigm at time t is l(w; (X t, Y t )), and this leads to a guaranteed bound on the regret: Reg = l(w t ; (X t, Y t )) min w S l(w; (X t, Y t )) However, in the batch setting, we are typically interested in finding a parameter ŵ with good generalization ability, i.e. we would like: R(ŵ) min w S R(w) to be small, where R(w) := E [l(w; (X, Y ))] is the risk.
2 Intuitively, it seems plausible that low regret on an i.i.d. sequence, should imply good generalization performance. In fact, for most of the empirically successful online algorithms, we have a set of techniques to understand the generalization performance of these algorithms on new data via online to batch conversions the conversions relate the regret of the algorithm (on past data) to the generalization performance (on future data). hese include cases which are tailored to general convex functions [Cesa-Bianchi et al., 004] (whose regret is O( )) and mistake bound settings [Cesa- Bianchi and Gentile, 008] (where the the regret could be O() under separability assumptions). In these conversions, we typically choose ŵ to be the average of the w t produced by our online algorithm. Recently, there has been a growing body of work providing online algorithms for strongly convex loss functions (i.e. l t is strongly convex), with regret guarantees that are merely O(ln ). Such algorithms have the potential to be highly applicable since many machine learning optimization problems are in fact strongly convex either with strongly convex loss functions (e.g. log loss, square loss) or, indirectly, via strongly convex regularizers (e.g. L or KL based regularization). Note that in the latter case, the loss function itself may only be just convex but a strongly convex regularizer effectively makes this a strongly convex optimization problem; e.g. the SVM optimization problem uses the hinge loss with L regularization. In fact, for this case, the PEGASOS algorithm of Shalev-Shwartz et al. [007] based on the online strongly convex programming algorithm of Hazan et al. [006] is a state-of-the-art SVM solver. Also, Ratliff et al. [007] provide a similar subgradient method for max-margin based structured prediction, which also has favorable empirical performance. he aim of this paper is to examine the generalization properties of online convex programming algorithms when the loss function is strongly convex (where strong convexity can be defined in a general sense, with respect to some arbitrary norm ). Suppose we have an online algorithm which has some guaranteed cumulative regret bound Reg (e.g. say Reg ln with samples). hen a corollary of our main result shows that with probability greater than δ ln, we obtain a parameter ŵ from our online algorithm such that: R(ŵ) min R(w) Reg w + O Reg ln δ + ln δ Here, the constants hidden in the O-notation are determined by the Lipschitz constant and the strong convexity parameter of the loss l. Importantly, note that the correction term is of lower order than the regret if the regret is ln then the additional penalty is O( ln ). If one naively uses the Hoeffding-Azuma methods in Cesa-Bianchi et al. [004], one would obtain a significantly worse penalty of O(/ ). his result solves an open problem in Shalev-Shwartz et al. [007], which was on characterizing the convergence rate of the PEGASOS algorithm, with high probability. PEGASOS is an online strongly convex programming algorithm for the SVM objective function it repeatedly (and randomly) subsamples the training set in order to minimize the empirical SVM objective function. A corollary to this work essentially shows the convergence rate of PEGASOS (as a randomized optimization algorithm) is concentrated rather sharply. Ratliff et al. [007] also provide an online algorithm (based on Hazan et al. [006]) for max-margin based structured prediction. Our results are also directly applicable in providing a sharper concentration result in their setting (In particular, see the regret bound in Equation 5, for which our results can be applied to). his paper continues the line of research initiated by several researchers [Littlestone, 989, Cesa- Bianchi et al., 004, Zhang, 005, Cesa-Bianchi and Gentile, 008] which looks at how to convert online algorithms into batch algorithms with provable guarantees. Cesa-Bianchi and Gentile [008] prove faster rates in the case when the cumulative loss of the online algorithm is small. Here, we are interested in the case where the cumulative regret is small. he work of Zhang [005] is closest to ours. Zhang [005] explicitly goes via the exponential moment method to derive sharper concentration results. In particular, for the regression problem with squared loss, Zhang [005] gives a result similar to ours (see heorem 8 therein). he present work can also be seen as generalizing his result to the case where we have strong convexity with respect to a general norm. Coupled with.
3 recent advances in low regret algorithms in this setting, we are able to provide a result that holds more generally. Our key technical tool is a probabilistic inequality due to Freedman [Freedman, 975]. his, combined with a variance bound (Lemma ) that follows from our assumptions about the loss function, allows us to derive our main result (heorem ). We then apply it to statistical learning with bounded loss, and to PEGASOS in Section 4. Setting Fix a compact convex subset S of some space equipped with a norm. Let be the dual norm defined by v := sup w : w v w. Let Z be a random variable taking values in some space Z. Our goal is to minimize F (w) := E [f(w; Z)] over w S. Here, f : S Z [0, B] is some function satisfying the following assumption. Assumption LIS. (LIpschitz and Srongly convex assumption) For all z Z, the function f z (w) = f(w; z) is convex in w and satisfies:. f z has Lipschitz constant L w.r.t. to the norm, i.e. w S, f z (w) ( f z denotes the subdifferential of f z ), L. Note that this assumption implies w, w S, f z (w) f z (w ) L w w.. f z is ν-strongly convex w.r.t., i.e. θ [0, ], w, w S, f z (θw + ( θ)w ) θf z (w) + ( θ)f z (w ) ν θ( θ) w w. Denote the minimizer of F by w, w := arg min w S F (w). We consider an online setting in which independent (but not necessarily identically distributed) random variables Z,..., Z become available to us in that order. hese have the property that t, w S, E [f(w; Z t )] = F (w). Now consider an algorithm that starts out with some w and at time t, having seen Z t, updates the parameter w t to w t+. Let E t [ ] denote conditional expectation w.r.t. Z,..., Z t. Note that w t is measurable w.r.t. Z,..., Z t and hence E t [f(w t ; Z t )] = F (w t ). Define the statistics, Reg := Diff := f(w t ; Z t ) min w S f(w; Z t ), (F (w t ) F (w )) = Define the sequence of random variables F (w t ) F (w ). ξ t := F (w t ) F (w ) (f(w t ; Z t ) f(w ; Z t )). () Since E t [f(w t ; Z t )] = F (w t ) and E t [f(w ; Z t )] = F (w ), ξ t is a martingale difference sequence. his definition needs some explanation as it is important to look at the right martingale difference sequence to derive the results we want. Even under assumption LIS, t f(w t; Z t ) and t f(w ; Z t ) will not be concentrated around t F (w t) and F (w ) respectively at a rate better then O(/ ) in general. But if we look at the difference, we are able to get sharper concentration. 3 A General Online to Batch Conversion he following simple lemma is crucial for us. It says that under assumption LIS, the variance of the increment in the regret f(w t ; Z t ) f(w ; Z t ) is bounded by its (conditional) expectation 3
4 F (w t ) F (w ). Such a control on the variance is often the main ingredient in obtaining sharper concentration results. Lemma. Suppose assumption LIS holds and let ξ t be the martingale difference sequence defined in (). Let Var t ξ t := E t [ ξ t ] be the conditional variance of ξ t given Z,..., Z t. hen, under assumption LIS, we have, Var t ξ t 4L ν (F (w t) F (w )). he variance bound given by the above lemma allows us to prove our main theorem. heorem. Under assumption LIS, we have, with probability at least 4 ln( )δ, F (w t ) F (w ) Reg { } L ln(/δ) Reg 6L ln(/δ) max ν ν, 6B Further, using Jensen s inequality, 3. Proofs Proof of Lemma. We have, t F (w t) can be replaced by F ( w) where w := t w t. Var t ξ t E t [ (f(w t ; Z t ) f(w ; Z t )) ] [ Assumption LIS, part ] E t [ L w t w ] = L w t w. () On the other hand, using part of assumption LIS, we also have for any w, w S, f(w; Z) + f(w ( ) ; Z) w + w f ; Z + ν 8 w w. aking expectation this gives, for any w, w S, F (w) + F (w ( ) ) w + w F + ν 8 w w. Now using this with w = w t, w = w, we get F (w t ) + F (w ( ) wt + w ) F + ν 8 w t w [ w minimizes F ] F (w ) + ν 8 w t w. his implies that Combining () and (3) we get, w t w 4(F (w t) F (w )) ν (3) Var t ξ t 4L ν (F (w t) F (w )) he proof of heorem relies on the following inequality for martingales which is an easy consequence of Freedman s inequality [Freedman, 975, heorem.6]. he proof of this lemma can be found in the appendix. 4
5 Lemma 3. Suppose X,..., X is a martingale difference sequence with X t b. Let Var t X t = Var (X t X,..., X t ). Let V = Var tx t be the sum of conditional variances of X t s. Further, let σ = V. hen we have, for any δ < /e and 3, ( { Prob X t > max σ, 3b ) } ln(/δ) ln(/δ) 4 ln( )δ. Proof of heorem. By Lemma, we have σ := 4L Var t ξ t ν Diff. Note that ξ t B because our f has range [0, B]. herefore, Lemma 3 gives us that with probability at least 4 ln( )δ, we have By definition of Reg, { ξ t max σ, 6B ln(/δ)} ln(/δ). Diff Reg and therefore, with probability, 4 ln( )δ, we have { L Diff Reg max 4 ν Diff, 6B } ln(/δ) ln(/δ). Using Lemma 4 below to solve the above quadratic inequality for Diff, gives ξ t F (w t) F (w ) Reg + 4 L ln(/δ) Reg ν { 6L + max ν } ln(/δ), 6B he following elementary lemma was required to solve a recursive inequality in the proof of the above theorem. Its proof can be found in the appendix. Lemma 4. Suppose s, r, d, b, 0 and we have hen, it follows that 4 Applications s r max{4 ds, 6b }. s r + 4 dr + max{6d, 6b}. 4. Online to Batch Conversion for Learning with Bounded Loss Suppose (X, Y ),..., (X, Y ) are drawn i.i.d. from a distribution. he pairs (X i, Y i ) belong to X Y and our algorithm are allowed to make predictions in a space D Y. A loss function l : D Y [0, ] measures quality of predictions. Fix a convex set S of some normed space and a function h : X S D. Let our hypotheses class be {x h(x; w) w S}. On input x, the hypothesis parameterized by w predicts h(x; w) and incurs loss l(h(x; w), y) if the correct prediction is y. he risk of w is defined by R(w) := E [l(h(x; w), Y )] 5
6 and let w := arg min w S R(w) denote the (parameter for) the hypothesis with minimum risk. It is easy to see that this setting falls under the general framework given above by thinking of the pair (X, Y ) as Z and setting f(w; Z) = f(w; (X, Y )) to be l(h(x; w), Y ). Note that F (w) becomes the risk R(w). he range of f is [0, ] by our assumption about the loss functions so B =. Suppose we run an online algorithm on our data that generates a sequence of hypotheses w 0,..., w such that w t is measurable w.r.t. X <t, Y <t. Define the statistics, Reg := Diff := l(h(x t ; w t ), Y t ) min w S (R(w t ) R(w )) = l(h(x t ; w), Y t ), R(w t ) R(w ). At the end, we output w := ( w t)/. he following corollary then follows immediately from heorem. It bounds the excess risk R( w) R(w ). Corollary 5. Suppose assumption LIS is satisfied for f(w; (x, y)) := l(h(x; w), y). hen we have, with probability at least 4 ln( )δ, R( w) R(w ) Reg + 4 L ln(/δ) Reg ν { } 6L ln(/δ) + max ν, 6 Recently, it has been proved [Kakade and Shalev-Shwartz, 008] that if assumption LIS is satisfied for w l(h(x; w), y) then there is an online algorithm that generates w,..., w such that Reg L ( + ln ) ν Plugging it in the corollary above gives the following result. Corollary 6. Suppose assumption LIS is satisfied for f(w; (x, y)) := l(h(x; w), y). hen there is an online algorithm that generates w,..., w and in the end outputs w such that, with probability at least 4 ln( )δ, for any 3. R( w) R(w ) L ln ν + 4L ln ν ln. ( ) { } 6L ln(/δ) + max δ ν, 6, 4. High Probability Bound for PEGASOS PEGASOS [Shalev-Shwartz et al., 007] is a recently proposed method for solving the primal SVM problem. Recall that in the SVM optimization problem we are given m example, label pairs (x i, y i ) R d {±}. Assume that x i R for all i where is the standard L norm. Let F (w) = w + m l(w; (x i, y i )) (4) m be the SVM objective function. he loss function l(w; (x, y)) = [ y(w x)] + is the hinge loss. At time t, PEGASOS takes a (random) approximation i= f(w; Z t ) = w + l(w; (x, y)), k (x,y) Z t of the SVM objective function to estimate the gradient and updates the current weight vector w t to w t+. Here Z t is a random subset of the data set of size k. Note that F (w) can be written as [ ] F (w) = E w + l(w; Z) 6
7 where Z is an example (x i, y i ) drawn uniformly at random from the m data points. It is also easy to verify that w, E [f(w; Z t )] = F (w). It can be shown that w := arg min F (w) will satisfy w / so we set { S = w R d : w }. For any z that is a subset of the data set, the function w f(w; z) = w + z (x,y) z l(w; (x, y)) is Lipschitz on S with Lipschitz constant L = + R and is -strongly convex. Also f(w; z) [0, 3/ + R/ ]. So, the PEGASOS setting falls under our general framework and satisfies assumption LIS. heorem in Shalev-Shwartz et al. [007] says, for any w, 3, f(w t ; Z t ) f(w; Z t ) + L ln, (5) where L = + R. It was noted in that paper that plugging in w = w and taking expectations, we easily get [ ] E Z,...,Z F (w t ) F (w ) + L ln. Here we use heorem to prove an inequality that holds with high probability, not just in expectation. Corollary 7. Let F be the SVM objective function defined in (4) and w,..., w be the sequence of weight vectors generated by the PEGASOS algorithm. Further, let w denote the minimizer of the SVM objective. hen, with probability 4δ ln( ), we have F (w t ) F (w ) L ln + 4L ln ln ( ) { 6L +max δ, 9 + 6R } ( ) ln, (6) δ for any 3. herefore, assuming R =, we have, for small enough, with probability at least δ, ( ) ln F (w t ) F (w δ ) = O. Proof. Note that (5) implies that Reg L ln. he corollary then follows immediately from heorem by plugging in ν = and B = 3/ + R/. References N. Cesa-Bianchi and C. Gentile. Improved risk tail bounds for on-line algorithms. IEEE ransactions on Information heory, 54():86 390, 008. N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE ransactions on Information heory, 50(9): , September 004. M. Collins. Discriminative training methods for hidden Markov models: heory and experiments with perceptron algorithms. In Conference on Empirical Methods in Natural Language Processing, 00. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. Journal of Machine Learning Research, 7:55 585, Mar 006. David A. Freedman. On tail probabilities for martingales. he Annals of Probability, 3():00 8, Feb
8 E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning heory, 006. S. Kakade and S. Shalev-Shwartz. Mind the duality gap: Logarithmic regret algorithms for online optimization. Advances in Neural Information Processing Systems, 008. N. Littlestone. Mistake bounds and logarithmic linear-threshold learning algorithms. PhD thesis, U. C. Santa Cruz, March 989. Nathan Ratliff, James (Drew) Bagnell, and Martin Zinkevich. (online) subgradient methods for structured prediction. In Eleventh International Conference on Artificial Intelligence and Statistics (AIStats), March 007. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-gradient SOlver for SVM. In Proceedings of the wenty-fourth International Conference on Machine Learning (ICML), pages , Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the Eighteenth Annual Conference on Computational Learning heory, pages 73 87,
9 Appendix Proof of Lemma 3. Note that a crude upper bound on Var t X t is b. hus, σ b. We choose a discretization 0 = α < α 0 <... < α l such that α i+ = rα i for i 0 and α l b. We will specify the choice of α 0 and r shortly. We then have, for any c > 0,! Prob = j=0 j=0 X t X t > c max{rσ, α 0} p ln(/δ) P Prob t Xt > c max{rσ, α0}p «ln(/δ) & α j < σ α j P Prob t Xt > p «cαj ln(/δ) & αj < V αj X t > cα j p ln(/δ) & V α j! Prob X j=0 t 0 ( ) c αj ln(/δ) j=0 αj + p A cα 3 j ln(/δ) b 0 = c α j ln(/δ) j=0 α j + c p A ln(/δ) b 3 where the inequality ( ) follows from Freedman s inequality. If we now choose α 0 = bc ln(/δ) then α j bc ) ln(/δ) for all j and hence every term in the above summation is bounded by exp which is less then δ if we choose c = 5/3. Set r = /c = 6/5. We want ( c ln(/δ) +/3 α 0 r l b. Since c ln(/δ), choosing l = log r ( ) ensures that. hus we have ( Prob X t > 5 3 max{6 5 σ, 5 3 b ln(/δ)} ) ln(/δ) ( = Prob X t > c max{rσ, α 0 } ) ln(/δ) t (l + )δ = (log 6/5 ( ) + )δ (6 ln( ) + )δ 4 ln( )δ. ( 3) Proof of Lemma 4. he assumption of the lemma implies that one of the following inequalities holds: In the second case, we have s r 6b s r 4 ds. (7) ( s ) (4 d ) s r 0 which means that s should be smaller than the larger root of the above quadratic. his gives us, s = ( ( s) d + ) 4d + r 4d + 4d + r + 4 4d 4 + d r [ x + y x + y] 8d + r + 8d + 4 dr Combining (7) and (8) finishes the proof. r + 4 dr + 6d. (8) 9
Learnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationConvex Repeated Games and Fenchel Duality
Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationConvex Repeated Games and Fenchel Duality
Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationGeneralization Bounds for Online Learning Algorithms with Pairwise Loss Functions
JMLR: Workshop and Conference Proceedings vol 3 0 3. 3. 5th Annual Conference on Learning Theory Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions Yuyang Wang Roni Khardon
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationSubgradient Methods for Maximum Margin Structured Learning
Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationEfficient Bandit Algorithms for Online Multiclass Prediction
Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are
More informationA simpler unified analysis of Budget Perceptrons
Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:
More informationOnline Learning meets Optimization in the Dual
Online Learning meets Optimization in the Dual Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel 2 Google Inc., 1600 Amphitheater
More informationEfficient Bandit Algorithms for Online Multiclass Prediction
Sham M. Kakade Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute, 427 East 60th Street, Chicago, Illinois 60637, USA sham@tti-c.org shai@tti-c.org tewari@tti-c.org Keywords: Online learning,
More informationOnline multiclass learning with bandit feedback under a Passive-Aggressive approach
ESANN 205 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 22-24 April 205, i6doc.com publ., ISBN 978-28758704-8. Online
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationGeneralized Boosting Algorithms for Convex Optimization
Alexander Grubb agrubb@cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 523 USA Abstract Boosting is a popular way to derive powerful
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationBeating SGD: Learning SVMs in Sublinear Time
Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological
More informationNew bounds on the price of bandit feedback for mistake-bounded online multiclass learning
Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre
More informationAccelerated Training of Max-Margin Markov Networks with Kernels
Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and
More informationThe Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904,
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationFast Rate Analysis of Some Stochastic Optimization Algorithms
Fast Rate Analysis of Some Stochastic Optimization Algorithms Chao Qu Department of Mechanical Engineering, National University of Singapore Huan Xu Department of Industrial and Systems Engineering, National
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationMachine Learning Lecture 6 Note
Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationAn Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback
Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationImproved Risk Tail Bounds for On-Line Algorithms *
Improved Risk Tail Bounds for On-Line Algorithms * Nicolo Cesa-Bianchi DSI, Universita di Milano via Comelico 39 20135 Milano, Italy cesa-bianchi@dsi.unimi.it Claudio Gentile DICOM, Universita dell'insubria
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationInverse Time Dependency in Convex Regularized Learning
Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationAccelerating Online Convex Optimization via Adaptive Prediction
Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationLower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization
JMLR: Workshop and Conference Proceedings vol 40:1 16, 2015 28th Annual Conference on Learning Theory Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization Mehrdad
More informationDynamic Regret of Strongly Adaptive Methods
Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationAdaptive Online Learning in Dynamic Environments
Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn
More informationTutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.
Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationAdvanced Machine Learning
Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationOn the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationLecture 23: Online convex optimization Online convex optimization: generalization of several algorithms
EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected
More informationThe Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm
he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,
More informationThe FTRL Algorithm with Strongly Convex Regularizers
CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked
More informationConvergence rate of SGD
Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected
More informationWorst-Case Analysis of the Perceptron and Exponentiated Update Algorithms
Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationThis article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing
More informationOptimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationThe Moment Method; Convex Duality; and Large/Medium/Small Deviations
Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationStochastic and Adversarial Online Learning without Hyperparameters
Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationNo-Regret Algorithms for Unconstrained Online Convex Optimization
No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationConstrained Stochastic Gradient Descent for Large-scale Least Squares Problem
Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem ABSRAC Yang Mu University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA, US 015 yangmu@cs.umb.edu ianyi Zhou University
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationAn Online Convex Optimization Approach to Blackwell s Approachability
Journal of Machine Learning Research 17 (2016) 1-23 Submitted 7/15; Revised 6/16; Published 8/16 An Online Convex Optimization Approach to Blackwell s Approachability Nahum Shimkin Faculty of Electrical
More informationImitation Learning by Coaching. Abstract
Imitation Learning by Coaching He He Hal Daumé III Department of Computer Science University of Maryland College Park, MD 20740 {hhe,hal}@cs.umd.edu Abstract Jason Eisner Department of Computer Science
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationA survey: The convex optimization approach to regret minimization
A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationFast Stochastic AUC Maximization with O(1/n)-Convergence Rate
Fast Stochastic Maximization with O1/n)-Convergence Rate Mingrui Liu 1 Xiaoxuan Zhang 1 Zaiyi Chen Xiaoyu Wang 3 ianbao Yang 1 Abstract In this paper we consider statistical learning with area under ROC
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationTime Series Prediction & Online Learning
Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationIndividual Sequence Prediction using Memory-efficient Context Trees
Individual Sequence Prediction using Memory-efficient Context rees Ofer Dekel, Shai Shalev-Shwartz, and Yoram Singer Abstract Context trees are a popular and effective tool for tasks such as compression,
More information