Perceptron Mistake Bounds

Size: px
Start display at page:

Download "Perceptron Mistake Bounds"

Transcription

1 Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce novel bounds for the Perceptron or the kernel Perceptron algorithm. Our novel bounds generalize beyond standard margin-loss type bounds, allow for any convex and Lipschitz loss function, and admit a very simple proof. Introduction The Perceptron algorithm belongs to the broad family of on-line learning algorithms (see Cesa-Bianchi and Lugosi [006] for a survey) and admits a large number of variants. The algorithm learns a linear separator by processing the training sample in an on-line fashion, examining a single example at each iteration [Rosenblatt, 958]. At each round, the current hypothesis is updated if it makes a mistake, that is if it incorrectly classifies the new training point processed. The full pseudocode of the algorithm is provided in Figure. In what follows, we will assume that w 0 = 0 and η = for simplicity of presentation, however, the more general case also allows for similar guarantees which can be derived following the same methods we are presenting. This paper briefly surveys some existing mistake bounds for the Perceptron algorithm and introduces new ones which can be used to derive generalization bounds in a stochastic setting. A mistake bound is an upper bound on the number of updates, or the number of mistakes, made by the Perceptron algorithm when processing a sequence of training examples. Here, the bound will be expressed in terms of the performance of any linear separator, including the best. Such mistake bounds can be directly used to derive generalization guarantees for a combined hypothesis, using existing on-line-to-batch techniques. Separable case The seminal work of Novikoff [96] gave the first margin-based bound for the Perceptron algorithm, one of the early results in learning theory and probably one of the first based on the notion of margin. Assuming that the data is separable with some margin ρ, Novikoff showed that the number of mistakes made by the Perceptron algorithm can be bounded as a function of the normalized margin ρ/r, where R is the radius of the sphere containing the training instances. We start with a Lemma that can be used to prove Novikoff s theorem and that will be used throughout.

2 Perceptron(w 0) w w 0 typically w 0 = 0 for t to T do 3 Receive(x t) 4 by t sgn(w t x t) 5 Receive(y t) 6 if (by t y t) then 7 w t w t y tx t more generally ηy tx t, η > 0. 8 else w t w t 9 return w T Fig.. Perceptron algorithm [Rosenblatt, 958]. Lemma. Let I denote the set of rounds at which the Perceptron algorithm makes an update when processing a sequence of training instances x,..., x T R N. Then, the following inequality holds: s y tx t x t. Proof. The inequality holds using the following sequence of observations, y tx t = (w t w t) (definition of updates) = w T (telescoping sum, w 0 = 0) s = w t w t (telescoping sum, w 0 = 0) s = w t y tx t w t v = u t y tw t x t x t {z } 0 s x t. (definition of updates) The final inequality uses the fact that an update is made at round t only when the current hypothesis makes a mistake, that is, y t(w t x t) 0. The lemma can be used straightforwardly to derive the following mistake bound for the separable setting. Theorem ([Novikoff, 96]). Let x,..., x T R N be a sequence of T points with x t r for all t [, T ], for some r > 0. Assume that

3 there exist ρ > 0 and v R N, v 0, such that for all t [, T ], ρ. Then, the number of updates made by the Perceptron algorithm y t (v x t ) v when processing x,..., x T is bounded by r /ρ. Proof. Let I denote the subset of the T rounds at which there is an update, and let M be the total number of updates, i.e., I = M. Summing up the inequalities yields: Mρ v P ytxt s y tx t x t v Mr, where the second inequality holds by the Cauchy-Schwarz inequality, the third by Lemma and the final one by assumption. Comparing the leftand right-hand sides gives M r/ρ, that is, M r /ρ. 3 Non-separable case In real-world problems, the training sample processed by the Perceptron algorithm is typically not linearly separable. Nevertheless, it is possible to give a margin-based mistake bound in that general case in terms of the radius of the sphere containing the sample and the margin-based loss of an arbitrary weight vector. We present two different types of bounds: first, a bound that depends on the L -norm of the vector of ρ-margin hinge losses, or the vector of more general losses that we will describe, next a bound that depends on the L -norm of the vector of margin losses, which extends the original results presented by Freund and Schapire [999]. 3. L -norm mistake bounds We first present a simple proof of a mistake bound for the Perceptron algorithm that depends on the L -norm of the losses incurred by an arbitrary weight vector, for a general definition of the loss function that covers the ρ-margin hinge loss. The family of admissible loss functions is quite general and defined as follows. Definition (γ-admissible loss function). A γ-admissible loss function φ γ : R R satisfies the following conditions:. The function φ γ is convex.. φ γ is non-negative: x R, φ γ(x) At zero, the φ γ is strictly positive: φ γ(0) > φ γ is γ-lipschitz: φ γ(x) φ γ(y) γ x y, for some γ > 0. These are mild conditions satisfied by many loss functions including the hinge-loss, the squared hinge-loss, the Huber loss and general p-norm losses over bounded domains. Theorem. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any vector u R N with u and any γ-admissible loss function φ γ, consider the vector of losses incurred by

4 u: L φγ (u) = ˆφ γ(y t(u x t)). Then, the number of updates MT = I made by the Perceptron algorithm can be bounded as follows: M T γ>0, u φ γ(0) L φ γ (u) γ φ γ(0) s x t. () If we further assume that x t r for all t [, T ], for some r > 0, this implies s «γr M T γ>0, u φ L φγ (u). () γ(0) φ γ(0) Proof. For all γ > 0 and u with u, the following statements hold. By convexity of φ γ we have M T P φγ(ytu xt) φγ(u z), where z = M T P ytxt. Then, by using the Lipschitz property of φγ we have, φ γ(u z) = φ γ(u z) φ γ(0) φ γ(0) = φγ(0) φ γ(u z) φγ(0) γ u z φ γ(0). Combining the two inequalities above and multiplying both sides by M T implies Mφ γ(0) φ γ(y tu x t) γ y tu x t. Finally, using the Cauchy-Schwartz inequality and Lemma yields s y tu x t = u y tx t u y tx t x t, which completes the proof of the first statement after re-arranging terms. If it is further assumed that x t r for all t I, then this implies Mφ γ(0) r M P φγ(ytu xt) 0. Solving this quadratic expression in terms of M proves the second statement. It is straightforward to see that the ρ-margin hinge loss φ ρ(x) = ( x/ρ) is (/ρ)-admissible with φ ρ(0) = for all ρ, which gives the following corollary. Corollary. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any ρ > 0 and any u R N with u, consider the vector of ρ-hinge losses incurred by u: L ρ(u) = ˆ( y t (u x t ) ) ρ. Then, the number of updates MT = I made by the Perceptron algorithm can be bounded as follows: M T ρ>0 u Lρ(u) xt. (3) ρ If we further assume that x t r for all t [, T ], for some r > 0, this implies r M T ρ>0, u ρ p «L ρ(u). (4)

5 The mistake bound (3) appears already in Cesa-Bianchi et al. [004] but we could not find its proof either in that paper or in those it references for this bound. Another application of Theorem is to the squared-hinge loss φ ρ(x) = ( x/ρ). Assume that x r, then the inequality y(u x) u x r implies that the derivative of the hinge-loss is also bounded, achieving a maximum absolute value φ ρ(r) = ( r ρ ρ the ρ-margin squared hinge loss is (r/ρ )-admissible with φ ρ(0) = for all ρ. This leads to the following corollary. ) r ρ. Thus, Corollary. Let I denote the set of rounds at which the Perceptron x,..., x T R N with x t r for all t [, T ]. For any ρ > 0 and any u R N with u, consider the vector of ρ-margin squared hinge losses incurred by u: L ρ(u) = ˆ( y t(u x t ) ) ρ. Then, the number of updates M T = I made by the Perceptron algorithm can be bounded as follows: M T This also implies M T r ρ>0 u Lρ(u) xt. (5) ρ r ρ>0, u ρ p «L ρ(u). (6) Theorem can be similarly used to derive mistake bounds in terms of other admissible losses. 3. L -norm mistake bounds The original results of this section are due to Freund and Schapire [999]. Here, we extend their proof to derive finer mistake bounds for the Perceptron algorithm in terms of the L -norm of the vector of hinge losses of an arbitrary weight vector at points where an update is made. Theorem 3. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any ρ > 0 and any u R N with u, consider the vector of ρ-hinge losses incurred by u: L ρ(u) = ˆ( y t (u x t ) ) ρ. Then, the number of updates MT = I made by the Perceptron algorithm can be bounded as follows: 0 v M T ρ>0, u B L u t L ρ(u) 4 xt ρ C A. (7) If we further assume that x t r for all t [, T ], for some r > 0, this implies «r M T ρ>0, u ρ Lρ(u). (8)

6 Proof. We first reduce the problem to the separable case by mapping each input vector x t R N to a vector in x t R NT as follows: x t = 6 4 x t,. x t,n x t = " xt,... x t,n {z} #, (N t)th component where the first N components of x t coincide with those of x and the only other non-zero component is the (N t)th component which is set to, a parameter whose value will be determined later. Define l t by l t = ( y tu x t ) ρ. Then, the vector u is replaced by the vector u defined by u = ˆ u Z... u NZ y l ρ Z... y T l T ρ Z The first N components of u are equal to the components of u/z and the remaining T components are functions of the labels and hinge losses. q The normalization factor Z is chosen to guarantee that u = : Z = ρ L ρ(u). Since the additional coordinates of the instances are non-zero exactly once, the predictions made by the Perceptron algorithm for x t, t [, T ] coincide with those made in the original space for x t, t [, T ]. In particular, a change made to the additional coordinates of w does no affect any subsequent prediction. Furthermore, by definition of u and x t, we can write for any t I: y t(u x u xt t) = y t Z ytu xt = ltρ Z Z ytu xt Z ytltρ Z ρ yt(u xt) Z. = ρ Z, where the inequality results from the definition of l t. Summing up the inequalities for all t I and using Lemma yields M T ρ Z P yt(u x t) x t. Substituting the value of Z and re-writing in terms of x implies: MT ρ Lρ(u) R M T = R ρ R L ρ(u) MT M T L ρ(u), ρ where R = xt. Now, solving for to minimize this bound gives = ρ Lρ(u) R MT and further simplifies the bound MT R ρ MT L ρ(u) R M T L ρ(u) ρ = R ρ M T L ρ(u).

7 Solving the second-degree inequality M T M T L ρ(u) R 0 proves ρ the first statement of the theorem. The second theorem is obtained by first bounding R with r M T and then solving the second-degree inequality. 3.3 Discussion One natural question this survey raises is the respective quality of the L - and L -norm bounds. The comparison of (4) and (8) for the ρ-margin hinge loss shows that, for a fixed ρ, the bounds differ only by the following two quantities: min L ρ(u) = min u u ( y t`u xt)/ρ min L ρ(u) = min ( y t`u xt)/ρ u u These two quantities are data-dependent and in general not comparable. For a vector u for which the individual losses ( y t`u xt)) are all less than one, we have L ρ(u) L ρ(u), while the contrary holds if the individual losses are larger than one. 4 Generalization Bounds In this section, we consider the case where the training sample processed is drawn according to some distribution D. Under some mild conditions on the loss function, the hypotheses returned by an on-line learning algorithm can then be combined to define a hypothesis whose generalization error can be bounded in terms of its regret. Such a hypothesis can be determined via cross-validation Littlestone [989] or using the online-tobatch theorem of Cesa-Bianchi et al. [004]. The latter can be combined with any of the mistake bounds presented in the previous section to derive generalization bounds for the Perceptron predictor. Given δ > 0, a sequence of labeled examples (x, y ),..., (y T, x T ), a sequence of hypotheses h,..., h T, and a loss function L, define the penalized risk minimizing hypothesis as b h = h i with i = argmin i [,T ] T i. s T T (T ) log δ L(y th i(x t)) (T i ). t=i The following theorem gives a bound on the expected loss of b h on future examples. Theorem 4 (Cesa-Bianchi et al. [004]). Let S be a labeled sample ((x, y ),..., (x T, y T )) drawn i.i.d. according to D, L a loss function bounded by one, and h,..., h T the sequence of hypotheses generated by an on-line algorithm A sequentially processing S. Then, for any δ > 0, with probability at least δ, the following holds: E [L(yb h(x))] r T (T ) L(y ih i(x i)) 6 log. (9) (x,y) D T T δ i=

8 KernelPerceptron(α 0) α α 0 typically α 0 = 0 for t to T do 3 Receive(x t) 4 by t sgn( P T s= 5 Receive(y t) 6 if (by t y t) then 7 α t α t 8 return α αsysk(xs, xt)) Fig.. Kernel Perceptron algorithm for PDS kernel K. Note that this theorem does not require the loss function to be convex. Thus, if L is the zero-one loss, then the empirical loss term is precisely the average number of mistakes made by the algorithm. Plugging in any of the mistake bounds from the previous sections then gives us a learning guarantee with respect to the performance of the best hypothesis as measured by a margin-loss (or any γ-admissible loss if using Theorem ). Let bw denote the weight vector corresponding to the penalized risk minimizing Perceptron hypothesis chosen from all the intermediate hypotheses generated by the algorithm. Then, in view of Theorem, the following corollary holds. Corollary 3. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any vector u R N with u and any γ-admissible loss function φ γ, consider the vector of losses incurred by u: L φγ (u) = ˆφ γ(y t(u x t)). Then, for any δ > 0, with probability at least δ, the following generalization bound holds for the penalized risk minimizing Perceptron hypothesis bw: Pr [y( bw x) < 0] (x,y) D P r L φγ (u) γq xt (T ) 6 log. γ>0, u φ γ(0)t φ γ(0)t T δ Any γ-admissible loss can be used to derive a more explicit form of this bound in special cases, in particular the hinge loss or the squared hinge loss. Using Theorem 3, we obtain the following L -norm generalization bound. Corollary 4. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any ρ > 0 and any u R N with u, consider the vector of ρ-hinge losses incurred by u: L ρ(u) = ˆ( y t (u x t ) ) ρ. Then, for any δ > 0, with probability at least δ, the

9 following generalization bound holds for the penalized risk minimizing Perceptron hypothesis bw: Pr [y( bw x) < 0] (x,y) D 0 ρ>0, u T L ρ(u) v q u P t L ρ(u) xt C 4 ρ A 5 Kernel Perceptron algorithm r 6 T log (T ) δ The Perceptron algorithm of Figure can be straightforwardly extended to define a non-linear separator using a positive definite kernel K [Aizerman et al., 964]. Figure gives the pseudocode of that algorithm known as the kernel Perceptron algorithm. The classifier sgn(h) learned by the algorithm is defined by h: x P T t= αtytk(xt, x). The results of the previous sections apply similarly to the kernel perceptron algorithm with x t replaced with K(x t, x t). In particular, the quantity xt appearing in several of the learning guarantees can be replaced with the familiar trace Tr[K] of the kernel matrix K = [K(x i, x j)] i,j I over the set of points at which an update is made, which is a standard term appearing in margin bounds for kernel-based hypothesis sets..

10 Bibliography Mark A. Aizerman, E. M. Braverman, and Lev I. Rozonoèr. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 5:8 837, 964. Nicolò Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 006. Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9): , 004. Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37:77 96, 999. Nick Littlestone. From on-line to batch learning. In COLT, pages 69 84, 989. Albert B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume, pages 65 6, 96. Frank Rosenblatt. The perceptron: A probabilistic model for ormation storage and organization in the brain. Psychological Review, 65(6):386, 958.

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Download the following data sets from the UC Irvine ML repository:

Download the following data sets from the UC Irvine ML repository: Mehryar Mohri Introduction to Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 Solution (written by Andres Muñoz) Perceptron Algorithm Download the following data sets

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Stat 928: Statistical Learning Theory Lecture: 18 Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Instructor: Sham Kakade 1 Introduction This course will be divided into 2 parts.

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over

More information

A New Perspective on an Old Perceptron Algorithm

A New Perspective on an Old Perceptron Algorithm A New Perspective on an Old Perceptron Algorithm Shai Shalev-Shwartz 1,2 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc, 1600 Amphitheater

More information

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm

More information

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904,

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Large Margin Classification Using the Perceptron Algorithm

Large Margin Classification Using the Perceptron Algorithm Machine Learning, 37(3):277-296, 1999. Large Margin Classification Using the Perceptron Algorithm YOAV FREUND yoav@research.att.com AT&T Labs, Shannon Laboratory, 180 Park Avenue, Room A205, Florham Park,

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015 Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015 So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

A simpler unified analysis of Budget Perceptrons

A simpler unified analysis of Budget Perceptrons Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:

More information

i=1 = H t 1 (x) + α t h t (x)

i=1 = H t 1 (x) + α t h t (x) AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

The Perceptron Algorithm, Margins

The Perceptron Algorithm, Margins The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

The Perceptron Algorithm 1

The Perceptron Algorithm 1 CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron

More information

Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions

Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions JMLR: Workshop and Conference Proceedings vol 3 0 3. 3. 5th Annual Conference on Learning Theory Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions Yuyang Wang Roni Khardon

More information

Rademacher Bounds for Non-i.i.d. Processes

Rademacher Bounds for Non-i.i.d. Processes Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

This article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

On-Line Learning with Path Experts and Non-Additive Losses

On-Line Learning with Path Experts and Non-Additive Losses On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.

More information

Machine Learning: The Perceptron. Lecture 06

Machine Learning: The Perceptron. Lecture 06 Machine Learning: he Perceptron Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu 1 McCulloch-Pitts Neuron Function 0 1 w 0 activation / output function 1 w 1 w w

More information

Learning with Large Number of Experts: Component Hedge Algorithm

Learning with Large Number of Experts: Component Hedge Algorithm Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Perceptron (Theory) + Linear Regression

Perceptron (Theory) + Linear Regression 10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

Online Learning with Abstention. Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 2016a).

Online Learning with Abstention. Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 2016a). + + + + + + Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 206a). A. Further Related Work Learning with abstention is a useful paradigm in applications where the cost

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

On a Theory of Learning with Similarity Functions

On a Theory of Learning with Similarity Functions Maria-Florina Balcan NINAMF@CS.CMU.EDU Avrim Blum AVRIM@CS.CMU.EDU School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213-3891 Abstract Kernel functions have become

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning Fall 2016 October 25 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains four

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Online prediction with expert advise

Online prediction with expert advise Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Online Learning with Abstention

Online Learning with Abstention Corinna Cortes 1 Giulia DeSalvo 1 Claudio Gentile 1 2 Mehryar Mohri 3 1 Scott Yang 4 Abstract We present an extensive study of a key problem in online learning where the learner can opt to abstain from

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Regression and Classification with Linear Models CMPSCI 383 Nov 15, 2011! Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1 Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient

More information

Optimal Teaching for Online Perceptrons

Optimal Teaching for Online Perceptrons Optimal Teaching for Online Perceptrons Xuezhou Zhang Hrag Gorune Ohannessian Ayon Sen Scott Alfeld Xiaojin Zhu Department of Computer Sciences, University of Wisconsin Madison {zhangxz1123, gorune, ayonsn,

More information