Perceptron Mistake Bounds
|
|
- Scarlett Owen
- 6 years ago
- Views:
Transcription
1 Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce novel bounds for the Perceptron or the kernel Perceptron algorithm. Our novel bounds generalize beyond standard margin-loss type bounds, allow for any convex and Lipschitz loss function, and admit a very simple proof. Introduction The Perceptron algorithm belongs to the broad family of on-line learning algorithms (see Cesa-Bianchi and Lugosi [006] for a survey) and admits a large number of variants. The algorithm learns a linear separator by processing the training sample in an on-line fashion, examining a single example at each iteration [Rosenblatt, 958]. At each round, the current hypothesis is updated if it makes a mistake, that is if it incorrectly classifies the new training point processed. The full pseudocode of the algorithm is provided in Figure. In what follows, we will assume that w 0 = 0 and η = for simplicity of presentation, however, the more general case also allows for similar guarantees which can be derived following the same methods we are presenting. This paper briefly surveys some existing mistake bounds for the Perceptron algorithm and introduces new ones which can be used to derive generalization bounds in a stochastic setting. A mistake bound is an upper bound on the number of updates, or the number of mistakes, made by the Perceptron algorithm when processing a sequence of training examples. Here, the bound will be expressed in terms of the performance of any linear separator, including the best. Such mistake bounds can be directly used to derive generalization guarantees for a combined hypothesis, using existing on-line-to-batch techniques. Separable case The seminal work of Novikoff [96] gave the first margin-based bound for the Perceptron algorithm, one of the early results in learning theory and probably one of the first based on the notion of margin. Assuming that the data is separable with some margin ρ, Novikoff showed that the number of mistakes made by the Perceptron algorithm can be bounded as a function of the normalized margin ρ/r, where R is the radius of the sphere containing the training instances. We start with a Lemma that can be used to prove Novikoff s theorem and that will be used throughout.
2 Perceptron(w 0) w w 0 typically w 0 = 0 for t to T do 3 Receive(x t) 4 by t sgn(w t x t) 5 Receive(y t) 6 if (by t y t) then 7 w t w t y tx t more generally ηy tx t, η > 0. 8 else w t w t 9 return w T Fig.. Perceptron algorithm [Rosenblatt, 958]. Lemma. Let I denote the set of rounds at which the Perceptron algorithm makes an update when processing a sequence of training instances x,..., x T R N. Then, the following inequality holds: s y tx t x t. Proof. The inequality holds using the following sequence of observations, y tx t = (w t w t) (definition of updates) = w T (telescoping sum, w 0 = 0) s = w t w t (telescoping sum, w 0 = 0) s = w t y tx t w t v = u t y tw t x t x t {z } 0 s x t. (definition of updates) The final inequality uses the fact that an update is made at round t only when the current hypothesis makes a mistake, that is, y t(w t x t) 0. The lemma can be used straightforwardly to derive the following mistake bound for the separable setting. Theorem ([Novikoff, 96]). Let x,..., x T R N be a sequence of T points with x t r for all t [, T ], for some r > 0. Assume that
3 there exist ρ > 0 and v R N, v 0, such that for all t [, T ], ρ. Then, the number of updates made by the Perceptron algorithm y t (v x t ) v when processing x,..., x T is bounded by r /ρ. Proof. Let I denote the subset of the T rounds at which there is an update, and let M be the total number of updates, i.e., I = M. Summing up the inequalities yields: Mρ v P ytxt s y tx t x t v Mr, where the second inequality holds by the Cauchy-Schwarz inequality, the third by Lemma and the final one by assumption. Comparing the leftand right-hand sides gives M r/ρ, that is, M r /ρ. 3 Non-separable case In real-world problems, the training sample processed by the Perceptron algorithm is typically not linearly separable. Nevertheless, it is possible to give a margin-based mistake bound in that general case in terms of the radius of the sphere containing the sample and the margin-based loss of an arbitrary weight vector. We present two different types of bounds: first, a bound that depends on the L -norm of the vector of ρ-margin hinge losses, or the vector of more general losses that we will describe, next a bound that depends on the L -norm of the vector of margin losses, which extends the original results presented by Freund and Schapire [999]. 3. L -norm mistake bounds We first present a simple proof of a mistake bound for the Perceptron algorithm that depends on the L -norm of the losses incurred by an arbitrary weight vector, for a general definition of the loss function that covers the ρ-margin hinge loss. The family of admissible loss functions is quite general and defined as follows. Definition (γ-admissible loss function). A γ-admissible loss function φ γ : R R satisfies the following conditions:. The function φ γ is convex.. φ γ is non-negative: x R, φ γ(x) At zero, the φ γ is strictly positive: φ γ(0) > φ γ is γ-lipschitz: φ γ(x) φ γ(y) γ x y, for some γ > 0. These are mild conditions satisfied by many loss functions including the hinge-loss, the squared hinge-loss, the Huber loss and general p-norm losses over bounded domains. Theorem. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any vector u R N with u and any γ-admissible loss function φ γ, consider the vector of losses incurred by
4 u: L φγ (u) = ˆφ γ(y t(u x t)). Then, the number of updates MT = I made by the Perceptron algorithm can be bounded as follows: M T γ>0, u φ γ(0) L φ γ (u) γ φ γ(0) s x t. () If we further assume that x t r for all t [, T ], for some r > 0, this implies s «γr M T γ>0, u φ L φγ (u). () γ(0) φ γ(0) Proof. For all γ > 0 and u with u, the following statements hold. By convexity of φ γ we have M T P φγ(ytu xt) φγ(u z), where z = M T P ytxt. Then, by using the Lipschitz property of φγ we have, φ γ(u z) = φ γ(u z) φ γ(0) φ γ(0) = φγ(0) φ γ(u z) φγ(0) γ u z φ γ(0). Combining the two inequalities above and multiplying both sides by M T implies Mφ γ(0) φ γ(y tu x t) γ y tu x t. Finally, using the Cauchy-Schwartz inequality and Lemma yields s y tu x t = u y tx t u y tx t x t, which completes the proof of the first statement after re-arranging terms. If it is further assumed that x t r for all t I, then this implies Mφ γ(0) r M P φγ(ytu xt) 0. Solving this quadratic expression in terms of M proves the second statement. It is straightforward to see that the ρ-margin hinge loss φ ρ(x) = ( x/ρ) is (/ρ)-admissible with φ ρ(0) = for all ρ, which gives the following corollary. Corollary. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any ρ > 0 and any u R N with u, consider the vector of ρ-hinge losses incurred by u: L ρ(u) = ˆ( y t (u x t ) ) ρ. Then, the number of updates MT = I made by the Perceptron algorithm can be bounded as follows: M T ρ>0 u Lρ(u) xt. (3) ρ If we further assume that x t r for all t [, T ], for some r > 0, this implies r M T ρ>0, u ρ p «L ρ(u). (4)
5 The mistake bound (3) appears already in Cesa-Bianchi et al. [004] but we could not find its proof either in that paper or in those it references for this bound. Another application of Theorem is to the squared-hinge loss φ ρ(x) = ( x/ρ). Assume that x r, then the inequality y(u x) u x r implies that the derivative of the hinge-loss is also bounded, achieving a maximum absolute value φ ρ(r) = ( r ρ ρ the ρ-margin squared hinge loss is (r/ρ )-admissible with φ ρ(0) = for all ρ. This leads to the following corollary. ) r ρ. Thus, Corollary. Let I denote the set of rounds at which the Perceptron x,..., x T R N with x t r for all t [, T ]. For any ρ > 0 and any u R N with u, consider the vector of ρ-margin squared hinge losses incurred by u: L ρ(u) = ˆ( y t(u x t ) ) ρ. Then, the number of updates M T = I made by the Perceptron algorithm can be bounded as follows: M T This also implies M T r ρ>0 u Lρ(u) xt. (5) ρ r ρ>0, u ρ p «L ρ(u). (6) Theorem can be similarly used to derive mistake bounds in terms of other admissible losses. 3. L -norm mistake bounds The original results of this section are due to Freund and Schapire [999]. Here, we extend their proof to derive finer mistake bounds for the Perceptron algorithm in terms of the L -norm of the vector of hinge losses of an arbitrary weight vector at points where an update is made. Theorem 3. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any ρ > 0 and any u R N with u, consider the vector of ρ-hinge losses incurred by u: L ρ(u) = ˆ( y t (u x t ) ) ρ. Then, the number of updates MT = I made by the Perceptron algorithm can be bounded as follows: 0 v M T ρ>0, u B L u t L ρ(u) 4 xt ρ C A. (7) If we further assume that x t r for all t [, T ], for some r > 0, this implies «r M T ρ>0, u ρ Lρ(u). (8)
6 Proof. We first reduce the problem to the separable case by mapping each input vector x t R N to a vector in x t R NT as follows: x t = 6 4 x t,. x t,n x t = " xt,... x t,n {z} #, (N t)th component where the first N components of x t coincide with those of x and the only other non-zero component is the (N t)th component which is set to, a parameter whose value will be determined later. Define l t by l t = ( y tu x t ) ρ. Then, the vector u is replaced by the vector u defined by u = ˆ u Z... u NZ y l ρ Z... y T l T ρ Z The first N components of u are equal to the components of u/z and the remaining T components are functions of the labels and hinge losses. q The normalization factor Z is chosen to guarantee that u = : Z = ρ L ρ(u). Since the additional coordinates of the instances are non-zero exactly once, the predictions made by the Perceptron algorithm for x t, t [, T ] coincide with those made in the original space for x t, t [, T ]. In particular, a change made to the additional coordinates of w does no affect any subsequent prediction. Furthermore, by definition of u and x t, we can write for any t I: y t(u x u xt t) = y t Z ytu xt = ltρ Z Z ytu xt Z ytltρ Z ρ yt(u xt) Z. = ρ Z, where the inequality results from the definition of l t. Summing up the inequalities for all t I and using Lemma yields M T ρ Z P yt(u x t) x t. Substituting the value of Z and re-writing in terms of x implies: MT ρ Lρ(u) R M T = R ρ R L ρ(u) MT M T L ρ(u), ρ where R = xt. Now, solving for to minimize this bound gives = ρ Lρ(u) R MT and further simplifies the bound MT R ρ MT L ρ(u) R M T L ρ(u) ρ = R ρ M T L ρ(u).
7 Solving the second-degree inequality M T M T L ρ(u) R 0 proves ρ the first statement of the theorem. The second theorem is obtained by first bounding R with r M T and then solving the second-degree inequality. 3.3 Discussion One natural question this survey raises is the respective quality of the L - and L -norm bounds. The comparison of (4) and (8) for the ρ-margin hinge loss shows that, for a fixed ρ, the bounds differ only by the following two quantities: min L ρ(u) = min u u ( y t`u xt)/ρ min L ρ(u) = min ( y t`u xt)/ρ u u These two quantities are data-dependent and in general not comparable. For a vector u for which the individual losses ( y t`u xt)) are all less than one, we have L ρ(u) L ρ(u), while the contrary holds if the individual losses are larger than one. 4 Generalization Bounds In this section, we consider the case where the training sample processed is drawn according to some distribution D. Under some mild conditions on the loss function, the hypotheses returned by an on-line learning algorithm can then be combined to define a hypothesis whose generalization error can be bounded in terms of its regret. Such a hypothesis can be determined via cross-validation Littlestone [989] or using the online-tobatch theorem of Cesa-Bianchi et al. [004]. The latter can be combined with any of the mistake bounds presented in the previous section to derive generalization bounds for the Perceptron predictor. Given δ > 0, a sequence of labeled examples (x, y ),..., (y T, x T ), a sequence of hypotheses h,..., h T, and a loss function L, define the penalized risk minimizing hypothesis as b h = h i with i = argmin i [,T ] T i. s T T (T ) log δ L(y th i(x t)) (T i ). t=i The following theorem gives a bound on the expected loss of b h on future examples. Theorem 4 (Cesa-Bianchi et al. [004]). Let S be a labeled sample ((x, y ),..., (x T, y T )) drawn i.i.d. according to D, L a loss function bounded by one, and h,..., h T the sequence of hypotheses generated by an on-line algorithm A sequentially processing S. Then, for any δ > 0, with probability at least δ, the following holds: E [L(yb h(x))] r T (T ) L(y ih i(x i)) 6 log. (9) (x,y) D T T δ i=
8 KernelPerceptron(α 0) α α 0 typically α 0 = 0 for t to T do 3 Receive(x t) 4 by t sgn( P T s= 5 Receive(y t) 6 if (by t y t) then 7 α t α t 8 return α αsysk(xs, xt)) Fig.. Kernel Perceptron algorithm for PDS kernel K. Note that this theorem does not require the loss function to be convex. Thus, if L is the zero-one loss, then the empirical loss term is precisely the average number of mistakes made by the algorithm. Plugging in any of the mistake bounds from the previous sections then gives us a learning guarantee with respect to the performance of the best hypothesis as measured by a margin-loss (or any γ-admissible loss if using Theorem ). Let bw denote the weight vector corresponding to the penalized risk minimizing Perceptron hypothesis chosen from all the intermediate hypotheses generated by the algorithm. Then, in view of Theorem, the following corollary holds. Corollary 3. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any vector u R N with u and any γ-admissible loss function φ γ, consider the vector of losses incurred by u: L φγ (u) = ˆφ γ(y t(u x t)). Then, for any δ > 0, with probability at least δ, the following generalization bound holds for the penalized risk minimizing Perceptron hypothesis bw: Pr [y( bw x) < 0] (x,y) D P r L φγ (u) γq xt (T ) 6 log. γ>0, u φ γ(0)t φ γ(0)t T δ Any γ-admissible loss can be used to derive a more explicit form of this bound in special cases, in particular the hinge loss or the squared hinge loss. Using Theorem 3, we obtain the following L -norm generalization bound. Corollary 4. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any ρ > 0 and any u R N with u, consider the vector of ρ-hinge losses incurred by u: L ρ(u) = ˆ( y t (u x t ) ) ρ. Then, for any δ > 0, with probability at least δ, the
9 following generalization bound holds for the penalized risk minimizing Perceptron hypothesis bw: Pr [y( bw x) < 0] (x,y) D 0 ρ>0, u T L ρ(u) v q u P t L ρ(u) xt C 4 ρ A 5 Kernel Perceptron algorithm r 6 T log (T ) δ The Perceptron algorithm of Figure can be straightforwardly extended to define a non-linear separator using a positive definite kernel K [Aizerman et al., 964]. Figure gives the pseudocode of that algorithm known as the kernel Perceptron algorithm. The classifier sgn(h) learned by the algorithm is defined by h: x P T t= αtytk(xt, x). The results of the previous sections apply similarly to the kernel perceptron algorithm with x t replaced with K(x t, x t). In particular, the quantity xt appearing in several of the learning guarantees can be replaced with the familiar trace Tr[K] of the kernel matrix K = [K(x i, x j)] i,j I over the set of points at which an update is made, which is a standard term appearing in margin bounds for kernel-based hypothesis sets..
10 Bibliography Mark A. Aizerman, E. M. Braverman, and Lev I. Rozonoèr. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 5:8 837, 964. Nicolò Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 006. Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9): , 004. Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37:77 96, 999. Nick Littlestone. From on-line to batch learning. In COLT, pages 69 84, 989. Albert B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume, pages 65 6, 96. Frank Rosenblatt. The perceptron: A probabilistic model for ormation storage and organization in the brain. Psychological Review, 65(6):386, 958.
Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationAdvanced Machine Learning
Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationDownload the following data sets from the UC Irvine ML repository:
Mehryar Mohri Introduction to Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 Solution (written by Andres Muñoz) Perceptron Algorithm Download the following data sets
More informationTime Series Prediction & Online Learning
Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.
More informationMistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron
Stat 928: Statistical Learning Theory Lecture: 18 Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Instructor: Sham Kakade 1 Introduction This course will be divided into 2 parts.
More informationAdvanced Machine Learning
Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal
More informationOnline Learning for Time Series Prediction
Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.
More information1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationRademacher Complexity Bounds for Non-I.I.D. Processes
Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationOnline Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over
More informationA New Perspective on an Old Perceptron Algorithm
A New Perspective on an Old Perceptron Algorithm Shai Shalev-Shwartz 1,2 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc, 1600 Amphitheater
More informationOnline Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm
More informationi=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationThe Perceptron Algorithm
The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationLearning Kernels -Tutorial Part III: Theoretical Guarantees.
Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationThe Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904,
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationLearning with Imperfect Data
Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationDomain Adaptation for Regression
Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationLarge Margin Classification Using the Perceptron Algorithm
Machine Learning, 37(3):277-296, 1999. Large Margin Classification Using the Perceptron Algorithm YOAV FREUND yoav@research.att.com AT&T Labs, Shannon Laboratory, 180 Park Avenue, Room A205, Florham Park,
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationPerceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015
Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015 So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationA simpler unified analysis of Budget Perceptrons
Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:
More informationi=1 = H t 1 (x) + α t h t (x)
AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationThe Perceptron Algorithm, Margins
The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationThe Perceptron Algorithm 1
CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron
More informationGeneralization Bounds for Online Learning Algorithms with Pairwise Loss Functions
JMLR: Workshop and Conference Proceedings vol 3 0 3. 3. 5th Annual Conference on Learning Theory Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions Yuyang Wang Roni Khardon
More informationRademacher Bounds for Non-i.i.d. Processes
Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationAccelerating Online Convex Optimization via Adaptive Prediction
Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework
More informationThis article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing
More informationSample Selection Bias Correction
Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationOn-Line Learning with Path Experts and Non-Additive Losses
On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More informationCOMS 4771 Lecture Boosting 1 / 16
COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?
More informationAdvanced Machine Learning
Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.
More informationMachine Learning: The Perceptron. Lecture 06
Machine Learning: he Perceptron Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu 1 McCulloch-Pitts Neuron Function 0 1 w 0 activation / output function 1 w 1 w w
More informationLearning with Large Number of Experts: Component Hedge Algorithm
Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationWorst-Case Analysis of the Perceptron and Exponentiated Update Algorithms
Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by
More informationPerceptron (Theory) + Linear Regression
10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationOnline Learning with Abstention. Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 2016a).
+ + + + + + Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 206a). A. Further Related Work Learning with abstention is a useful paradigm in applications where the cost
More informationDeep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.
Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationName (NetID): (1 Point)
CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationBoosting Ensembles of Structured Prediction Rules
Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY
More informationOn a Theory of Learning with Similarity Functions
Maria-Florina Balcan NINAMF@CS.CMU.EDU Avrim Blum AVRIM@CS.CMU.EDU School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213-3891 Abstract Kernel functions have become
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationName (NetID): (1 Point)
CS446: Machine Learning Fall 2016 October 25 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains four
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationLearning Methods for Linear Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationOnline prediction with expert advise
Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationThe FTRL Algorithm with Strongly Convex Regularizers
CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked
More informationOnline Learning with Abstention
Corinna Cortes 1 Giulia DeSalvo 1 Claudio Gentile 1 2 Mehryar Mohri 3 1 Scott Yang 4 Abstract We present an extensive study of a key problem in online learning where the learner can opt to abstain from
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October
More informationRegression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!
Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1 Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient
More informationOptimal Teaching for Online Perceptrons
Optimal Teaching for Online Perceptrons Xuezhou Zhang Hrag Gorune Ohannessian Ayon Sen Scott Alfeld Xiaojin Zhu Department of Computer Sciences, University of Wisconsin Madison {zhangxz1123, gorune, ayonsn,
More information