On the Generalization Ability of Online Strongly Convex Programming Algorithms

Size: px
Start display at page:

Download "On the Generalization Ability of Online Strongly Convex Programming Algorithms"

Transcription

1 On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL sham@tti-c.org Ambuj ewari I Chicago Chicago, IL tewari@tti-c.org Abstract his paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. his allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of PEGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. Introduction Online regret minimizing algorithms provide some of the most successful algorithms for many machine learning problems, both in terms of the speed of optimization and the quality of generalization. Notable examples include efficient learning algorithms for structured prediction [Collins, 00] (an algorithm now widely used) and for ranking problems [Crammer et al., 006] (providing competitive results with a fast implementation). Online convex optimization is a sequential paradigm in which at each round, the learner predicts a vector w t S R n, nature responds with a convex loss function, l t, and the learner suffers loss l t (w t ). In this setting, the goal of the learner is to minimize the regret: l t (w t ) min w S l t (w) which is the difference between his cumulative loss and the cumulative loss of the optimal fixed vector. ypically, these algorithms are used to train a learning algorithm incrementally, by sequentially feeding the algorithm a data sequence, (X, Y ),..., (X, Y ) (generated in an i.i.d. manner). In essence, the loss function used in the above paradigm at time t is l(w; (X t, Y t )), and this leads to a guaranteed bound on the regret: Reg = l(w t ; (X t, Y t )) min w S l(w; (X t, Y t )) However, in the batch setting, we are typically interested in finding a parameter ŵ with good generalization ability, i.e. we would like: R(ŵ) min w S R(w) to be small, where R(w) := E [l(w; (X, Y ))] is the risk.

2 Intuitively, it seems plausible that low regret on an i.i.d. sequence, should imply good generalization performance. In fact, for most of the empirically successful online algorithms, we have a set of techniques to understand the generalization performance of these algorithms on new data via online to batch conversions the conversions relate the regret of the algorithm (on past data) to the generalization performance (on future data). hese include cases which are tailored to general convex functions [Cesa-Bianchi et al., 004] (whose regret is O( )) and mistake bound settings [Cesa- Bianchi and Gentile, 008] (where the the regret could be O() under separability assumptions). In these conversions, we typically choose ŵ to be the average of the w t produced by our online algorithm. Recently, there has been a growing body of work providing online algorithms for strongly convex loss functions (i.e. l t is strongly convex), with regret guarantees that are merely O(ln ). Such algorithms have the potential to be highly applicable since many machine learning optimization problems are in fact strongly convex either with strongly convex loss functions (e.g. log loss, square loss) or, indirectly, via strongly convex regularizers (e.g. L or KL based regularization). Note that in the latter case, the loss function itself may only be just convex but a strongly convex regularizer effectively makes this a strongly convex optimization problem; e.g. the SVM optimization problem uses the hinge loss with L regularization. In fact, for this case, the PEGASOS algorithm of Shalev-Shwartz et al. [007] based on the online strongly convex programming algorithm of Hazan et al. [006] is a state-of-the-art SVM solver. Also, Ratliff et al. [007] provide a similar subgradient method for max-margin based structured prediction, which also has favorable empirical performance. he aim of this paper is to examine the generalization properties of online convex programming algorithms when the loss function is strongly convex (where strong convexity can be defined in a general sense, with respect to some arbitrary norm ). Suppose we have an online algorithm which has some guaranteed cumulative regret bound Reg (e.g. say Reg ln with samples). hen a corollary of our main result shows that with probability greater than δ ln, we obtain a parameter ŵ from our online algorithm such that: R(ŵ) min R(w) Reg w + O Reg ln δ + ln δ Here, the constants hidden in the O-notation are determined by the Lipschitz constant and the strong convexity parameter of the loss l. Importantly, note that the correction term is of lower order than the regret if the regret is ln then the additional penalty is O( ln ). If one naively uses the Hoeffding-Azuma methods in Cesa-Bianchi et al. [004], one would obtain a significantly worse penalty of O(/ ). his result solves an open problem in Shalev-Shwartz et al. [007], which was on characterizing the convergence rate of the PEGASOS algorithm, with high probability. PEGASOS is an online strongly convex programming algorithm for the SVM objective function it repeatedly (and randomly) subsamples the training set in order to minimize the empirical SVM objective function. A corollary to this work essentially shows the convergence rate of PEGASOS (as a randomized optimization algorithm) is concentrated rather sharply. Ratliff et al. [007] also provide an online algorithm (based on Hazan et al. [006]) for max-margin based structured prediction. Our results are also directly applicable in providing a sharper concentration result in their setting (In particular, see the regret bound in Equation 5, for which our results can be applied to). his paper continues the line of research initiated by several researchers [Littlestone, 989, Cesa- Bianchi et al., 004, Zhang, 005, Cesa-Bianchi and Gentile, 008] which looks at how to convert online algorithms into batch algorithms with provable guarantees. Cesa-Bianchi and Gentile [008] prove faster rates in the case when the cumulative loss of the online algorithm is small. Here, we are interested in the case where the cumulative regret is small. he work of Zhang [005] is closest to ours. Zhang [005] explicitly goes via the exponential moment method to derive sharper concentration results. In particular, for the regression problem with squared loss, Zhang [005] gives a result similar to ours (see heorem 8 therein). he present work can also be seen as generalizing his result to the case where we have strong convexity with respect to a general norm. Coupled with.

3 recent advances in low regret algorithms in this setting, we are able to provide a result that holds more generally. Our key technical tool is a probabilistic inequality due to Freedman [Freedman, 975]. his, combined with a variance bound (Lemma ) that follows from our assumptions about the loss function, allows us to derive our main result (heorem ). We then apply it to statistical learning with bounded loss, and to PEGASOS in Section 4. Setting Fix a compact convex subset S of some space equipped with a norm. Let be the dual norm defined by v := sup w : w v w. Let Z be a random variable taking values in some space Z. Our goal is to minimize F (w) := E [f(w; Z)] over w S. Here, f : S Z [0, B] is some function satisfying the following assumption. Assumption LIS. (LIpschitz and Srongly convex assumption) For all z Z, the function f z (w) = f(w; z) is convex in w and satisfies:. f z has Lipschitz constant L w.r.t. to the norm, i.e. w S, f z (w) ( f z denotes the subdifferential of f z ), L. Note that this assumption implies w, w S, f z (w) f z (w ) L w w.. f z is ν-strongly convex w.r.t., i.e. θ [0, ], w, w S, f z (θw + ( θ)w ) θf z (w) + ( θ)f z (w ) ν θ( θ) w w. Denote the minimizer of F by w, w := arg min w S F (w). We consider an online setting in which independent (but not necessarily identically distributed) random variables Z,..., Z become available to us in that order. hese have the property that t, w S, E [f(w; Z t )] = F (w). Now consider an algorithm that starts out with some w and at time t, having seen Z t, updates the parameter w t to w t+. Let E t [ ] denote conditional expectation w.r.t. Z,..., Z t. Note that w t is measurable w.r.t. Z,..., Z t and hence E t [f(w t ; Z t )] = F (w t ). Define the statistics, Reg := Diff := f(w t ; Z t ) min w S f(w; Z t ), (F (w t ) F (w )) = Define the sequence of random variables F (w t ) F (w ). ξ t := F (w t ) F (w ) (f(w t ; Z t ) f(w ; Z t )). () Since E t [f(w t ; Z t )] = F (w t ) and E t [f(w ; Z t )] = F (w ), ξ t is a martingale difference sequence. his definition needs some explanation as it is important to look at the right martingale difference sequence to derive the results we want. Even under assumption LIS, t f(w t; Z t ) and t f(w ; Z t ) will not be concentrated around t F (w t) and F (w ) respectively at a rate better then O(/ ) in general. But if we look at the difference, we are able to get sharper concentration. 3 A General Online to Batch Conversion he following simple lemma is crucial for us. It says that under assumption LIS, the variance of the increment in the regret f(w t ; Z t ) f(w ; Z t ) is bounded by its (conditional) expectation 3

4 F (w t ) F (w ). Such a control on the variance is often the main ingredient in obtaining sharper concentration results. Lemma. Suppose assumption LIS holds and let ξ t be the martingale difference sequence defined in (). Let Var t ξ t := E t [ ξ t ] be the conditional variance of ξ t given Z,..., Z t. hen, under assumption LIS, we have, Var t ξ t 4L ν (F (w t) F (w )). he variance bound given by the above lemma allows us to prove our main theorem. heorem. Under assumption LIS, we have, with probability at least 4 ln( )δ, F (w t ) F (w ) Reg { } L ln(/δ) Reg 6L ln(/δ) max ν ν, 6B Further, using Jensen s inequality, 3. Proofs Proof of Lemma. We have, t F (w t) can be replaced by F ( w) where w := t w t. Var t ξ t E t [ (f(w t ; Z t ) f(w ; Z t )) ] [ Assumption LIS, part ] E t [ L w t w ] = L w t w. () On the other hand, using part of assumption LIS, we also have for any w, w S, f(w; Z) + f(w ( ) ; Z) w + w f ; Z + ν 8 w w. aking expectation this gives, for any w, w S, F (w) + F (w ( ) ) w + w F + ν 8 w w. Now using this with w = w t, w = w, we get F (w t ) + F (w ( ) wt + w ) F + ν 8 w t w [ w minimizes F ] F (w ) + ν 8 w t w. his implies that Combining () and (3) we get, w t w 4(F (w t) F (w )) ν (3) Var t ξ t 4L ν (F (w t) F (w )) he proof of heorem relies on the following inequality for martingales which is an easy consequence of Freedman s inequality [Freedman, 975, heorem.6]. he proof of this lemma can be found in the appendix. 4

5 Lemma 3. Suppose X,..., X is a martingale difference sequence with X t b. Let Var t X t = Var (X t X,..., X t ). Let V = Var tx t be the sum of conditional variances of X t s. Further, let σ = V. hen we have, for any δ < /e and 3, ( { Prob X t > max σ, 3b ) } ln(/δ) ln(/δ) 4 ln( )δ. Proof of heorem. By Lemma, we have σ := 4L Var t ξ t ν Diff. Note that ξ t B because our f has range [0, B]. herefore, Lemma 3 gives us that with probability at least 4 ln( )δ, we have By definition of Reg, { ξ t max σ, 6B ln(/δ)} ln(/δ). Diff Reg and therefore, with probability, 4 ln( )δ, we have { L Diff Reg max 4 ν Diff, 6B } ln(/δ) ln(/δ). Using Lemma 4 below to solve the above quadratic inequality for Diff, gives ξ t F (w t) F (w ) Reg + 4 L ln(/δ) Reg ν { 6L + max ν } ln(/δ), 6B he following elementary lemma was required to solve a recursive inequality in the proof of the above theorem. Its proof can be found in the appendix. Lemma 4. Suppose s, r, d, b, 0 and we have hen, it follows that 4 Applications s r max{4 ds, 6b }. s r + 4 dr + max{6d, 6b}. 4. Online to Batch Conversion for Learning with Bounded Loss Suppose (X, Y ),..., (X, Y ) are drawn i.i.d. from a distribution. he pairs (X i, Y i ) belong to X Y and our algorithm are allowed to make predictions in a space D Y. A loss function l : D Y [0, ] measures quality of predictions. Fix a convex set S of some normed space and a function h : X S D. Let our hypotheses class be {x h(x; w) w S}. On input x, the hypothesis parameterized by w predicts h(x; w) and incurs loss l(h(x; w), y) if the correct prediction is y. he risk of w is defined by R(w) := E [l(h(x; w), Y )] 5

6 and let w := arg min w S R(w) denote the (parameter for) the hypothesis with minimum risk. It is easy to see that this setting falls under the general framework given above by thinking of the pair (X, Y ) as Z and setting f(w; Z) = f(w; (X, Y )) to be l(h(x; w), Y ). Note that F (w) becomes the risk R(w). he range of f is [0, ] by our assumption about the loss functions so B =. Suppose we run an online algorithm on our data that generates a sequence of hypotheses w 0,..., w such that w t is measurable w.r.t. X <t, Y <t. Define the statistics, Reg := Diff := l(h(x t ; w t ), Y t ) min w S (R(w t ) R(w )) = l(h(x t ; w), Y t ), R(w t ) R(w ). At the end, we output w := ( w t)/. he following corollary then follows immediately from heorem. It bounds the excess risk R( w) R(w ). Corollary 5. Suppose assumption LIS is satisfied for f(w; (x, y)) := l(h(x; w), y). hen we have, with probability at least 4 ln( )δ, R( w) R(w ) Reg + 4 L ln(/δ) Reg ν { } 6L ln(/δ) + max ν, 6 Recently, it has been proved [Kakade and Shalev-Shwartz, 008] that if assumption LIS is satisfied for w l(h(x; w), y) then there is an online algorithm that generates w,..., w such that Reg L ( + ln ) ν Plugging it in the corollary above gives the following result. Corollary 6. Suppose assumption LIS is satisfied for f(w; (x, y)) := l(h(x; w), y). hen there is an online algorithm that generates w,..., w and in the end outputs w such that, with probability at least 4 ln( )δ, for any 3. R( w) R(w ) L ln ν + 4L ln ν ln. ( ) { } 6L ln(/δ) + max δ ν, 6, 4. High Probability Bound for PEGASOS PEGASOS [Shalev-Shwartz et al., 007] is a recently proposed method for solving the primal SVM problem. Recall that in the SVM optimization problem we are given m example, label pairs (x i, y i ) R d {±}. Assume that x i R for all i where is the standard L norm. Let F (w) = w + m l(w; (x i, y i )) (4) m be the SVM objective function. he loss function l(w; (x, y)) = [ y(w x)] + is the hinge loss. At time t, PEGASOS takes a (random) approximation i= f(w; Z t ) = w + l(w; (x, y)), k (x,y) Z t of the SVM objective function to estimate the gradient and updates the current weight vector w t to w t+. Here Z t is a random subset of the data set of size k. Note that F (w) can be written as [ ] F (w) = E w + l(w; Z) 6

7 where Z is an example (x i, y i ) drawn uniformly at random from the m data points. It is also easy to verify that w, E [f(w; Z t )] = F (w). It can be shown that w := arg min F (w) will satisfy w / so we set { S = w R d : w }. For any z that is a subset of the data set, the function w f(w; z) = w + z (x,y) z l(w; (x, y)) is Lipschitz on S with Lipschitz constant L = + R and is -strongly convex. Also f(w; z) [0, 3/ + R/ ]. So, the PEGASOS setting falls under our general framework and satisfies assumption LIS. heorem in Shalev-Shwartz et al. [007] says, for any w, 3, f(w t ; Z t ) f(w; Z t ) + L ln, (5) where L = + R. It was noted in that paper that plugging in w = w and taking expectations, we easily get [ ] E Z,...,Z F (w t ) F (w ) + L ln. Here we use heorem to prove an inequality that holds with high probability, not just in expectation. Corollary 7. Let F be the SVM objective function defined in (4) and w,..., w be the sequence of weight vectors generated by the PEGASOS algorithm. Further, let w denote the minimizer of the SVM objective. hen, with probability 4δ ln( ), we have F (w t ) F (w ) L ln + 4L ln ln ( ) { 6L +max δ, 9 + 6R } ( ) ln, (6) δ for any 3. herefore, assuming R =, we have, for small enough, with probability at least δ, ( ) ln F (w t ) F (w δ ) = O. Proof. Note that (5) implies that Reg L ln. he corollary then follows immediately from heorem by plugging in ν = and B = 3/ + R/. References N. Cesa-Bianchi and C. Gentile. Improved risk tail bounds for on-line algorithms. IEEE ransactions on Information heory, 54():86 390, 008. N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE ransactions on Information heory, 50(9): , September 004. M. Collins. Discriminative training methods for hidden Markov models: heory and experiments with perceptron algorithms. In Conference on Empirical Methods in Natural Language Processing, 00. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. Journal of Machine Learning Research, 7:55 585, Mar 006. David A. Freedman. On tail probabilities for martingales. he Annals of Probability, 3():00 8, Feb

8 E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning heory, 006. S. Kakade and S. Shalev-Shwartz. Mind the duality gap: Logarithmic regret algorithms for online optimization. Advances in Neural Information Processing Systems, 008. N. Littlestone. Mistake bounds and logarithmic linear-threshold learning algorithms. PhD thesis, U. C. Santa Cruz, March 989. Nathan Ratliff, James (Drew) Bagnell, and Martin Zinkevich. (online) subgradient methods for structured prediction. In Eleventh International Conference on Artificial Intelligence and Statistics (AIStats), March 007. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-gradient SOlver for SVM. In Proceedings of the wenty-fourth International Conference on Machine Learning (ICML), pages , Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the Eighteenth Annual Conference on Computational Learning heory, pages 73 87,

9 Appendix Proof of Lemma 3. Note that a crude upper bound on Var t X t is b. hus, σ b. We choose a discretization 0 = α < α 0 <... < α l such that α i+ = rα i for i 0 and α l b. We will specify the choice of α 0 and r shortly. We then have, for any c > 0,! Prob = j=0 j=0 X t X t > c max{rσ, α 0} p ln(/δ) P Prob t Xt > c max{rσ, α0}p «ln(/δ) & α j < σ α j P Prob t Xt > p «cαj ln(/δ) & αj < V αj X t > cα j p ln(/δ) & V α j! Prob X j=0 t 0 ( ) c αj ln(/δ) j=0 αj + p A cα 3 j ln(/δ) b 0 = c α j ln(/δ) j=0 α j + c p A ln(/δ) b 3 where the inequality ( ) follows from Freedman s inequality. If we now choose α 0 = bc ln(/δ) then α j bc ) ln(/δ) for all j and hence every term in the above summation is bounded by exp which is less then δ if we choose c = 5/3. Set r = /c = 6/5. We want ( c ln(/δ) +/3 α 0 r l b. Since c ln(/δ), choosing l = log r ( ) ensures that. hus we have ( Prob X t > 5 3 max{6 5 σ, 5 3 b ln(/δ)} ) ln(/δ) ( = Prob X t > c max{rσ, α 0 } ) ln(/δ) t (l + )δ = (log 6/5 ( ) + )δ (6 ln( ) + )δ 4 ln( )δ. ( 3) Proof of Lemma 4. he assumption of the lemma implies that one of the following inequalities holds: In the second case, we have s r 6b s r 4 ds. (7) ( s ) (4 d ) s r 0 which means that s should be smaller than the larger root of the above quadratic. his gives us, s = ( ( s) d + ) 4d + r 4d + 4d + r + 4 4d 4 + d r [ x + y x + y] 8d + r + 8d + 4 dr Combining (7) and (8) finishes the proof. r + 4 dr + 6d. (8) 9

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions

Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions JMLR: Workshop and Conference Proceedings vol 3 0 3. 3. 5th Annual Conference on Learning Theory Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions Yuyang Wang Roni Khardon

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are

More information

A simpler unified analysis of Budget Perceptrons

A simpler unified analysis of Budget Perceptrons Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:

More information

Online Learning meets Optimization in the Dual

Online Learning meets Optimization in the Dual Online Learning meets Optimization in the Dual Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel 2 Google Inc., 1600 Amphitheater

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Sham M. Kakade Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute, 427 East 60th Street, Chicago, Illinois 60637, USA sham@tti-c.org shai@tti-c.org tewari@tti-c.org Keywords: Online learning,

More information

Online multiclass learning with bandit feedback under a Passive-Aggressive approach

Online multiclass learning with bandit feedback under a Passive-Aggressive approach ESANN 205 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 22-24 April 205, i6doc.com publ., ISBN 978-28758704-8. Online

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Generalized Boosting Algorithms for Convex Optimization

Generalized Boosting Algorithms for Convex Optimization Alexander Grubb agrubb@cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 523 USA Abstract Boosting is a popular way to derive powerful

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Beating SGD: Learning SVMs in Sublinear Time

Beating SGD: Learning SVMs in Sublinear Time Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Accelerated Training of Max-Margin Markov Networks with Kernels

Accelerated Training of Max-Margin Markov Networks with Kernels Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and

More information

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904,

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Fast Rate Analysis of Some Stochastic Optimization Algorithms

Fast Rate Analysis of Some Stochastic Optimization Algorithms Fast Rate Analysis of Some Stochastic Optimization Algorithms Chao Qu Department of Mechanical Engineering, National University of Singapore Huan Xu Department of Industrial and Systems Engineering, National

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Improved Risk Tail Bounds for On-Line Algorithms *

Improved Risk Tail Bounds for On-Line Algorithms * Improved Risk Tail Bounds for On-Line Algorithms * Nicolo Cesa-Bianchi DSI, Universita di Milano via Comelico 39 20135 Milano, Italy cesa-bianchi@dsi.unimi.it Claudio Gentile DICOM, Universita dell'insubria

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization JMLR: Workshop and Conference Proceedings vol 40:1 16, 2015 28th Annual Conference on Learning Theory Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization Mehrdad

More information

Dynamic Regret of Strongly Adaptive Methods

Dynamic Regret of Strongly Adaptive Methods Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected

More information

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

This article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

The Moment Method; Convex Duality; and Large/Medium/Small Deviations Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem ABSRAC Yang Mu University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA, US 015 yangmu@cs.umb.edu ianyi Zhou University

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

An Online Convex Optimization Approach to Blackwell s Approachability

An Online Convex Optimization Approach to Blackwell s Approachability Journal of Machine Learning Research 17 (2016) 1-23 Submitted 7/15; Revised 6/16; Published 8/16 An Online Convex Optimization Approach to Blackwell s Approachability Nahum Shimkin Faculty of Electrical

More information

Imitation Learning by Coaching. Abstract

Imitation Learning by Coaching. Abstract Imitation Learning by Coaching He He Hal Daumé III Department of Computer Science University of Maryland College Park, MD 20740 {hhe,hal}@cs.umd.edu Abstract Jason Eisner Department of Computer Science

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Fast Stochastic Maximization with O1/n)-Convergence Rate Mingrui Liu 1 Xiaoxuan Zhang 1 Zaiyi Chen Xiaoyu Wang 3 ianbao Yang 1 Abstract In this paper we consider statistical learning with area under ROC

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Individual Sequence Prediction using Memory-efficient Context Trees

Individual Sequence Prediction using Memory-efficient Context Trees Individual Sequence Prediction using Memory-efficient Context rees Ofer Dekel, Shai Shalev-Shwartz, and Yoram Singer Abstract Context trees are a popular and effective tool for tasks such as compression,

More information