Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Similar documents
Perceptron Mistake Bounds

Advanced Machine Learning

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Time Series Prediction & Online Learning

Online Learning, Mistake Bounds, Perceptron Algorithm

Agnostic Online learnability

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Advanced Machine Learning

Advanced Machine Learning

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Lecture 16: Perceptron and Exponential Weights Algorithm

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Foundations of Machine Learning

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Logistic Regression Logistic

Advanced Machine Learning

Download the following data sets from the UC Irvine ML repository:

Advanced Machine Learning

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Large Margin Classification Using the Perceptron Algorithm

Warm up: risk prediction with logistic regression

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

The Perceptron Algorithm

Littlestone s Dimension and Online Learnability

Full-information Online Learning

Exponentiated Gradient Descent

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Learning with Large Number of Experts: Component Hedge Algorithm

Online Learning for Time Series Prediction

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Discriminative Models

Lecture 8. Instructor: Haipeng Luo

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Learning, Games, and Networks

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Support Vector Machines for Classification: A Statistical Portrait


Exponential Weights on the Hypercube in Polynomial Time

Using Additive Expert Ensembles to Cope with Concept Drift

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

A simpler unified analysis of Budget Perceptrons

1 Review of the Perceptron Algorithm

Online Convex Optimization

Ad Placement Strategies

Perceptron (Theory) + Linear Regression

Online Passive-Aggressive Algorithms

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Online Passive-Aggressive Algorithms

Computational Learning Theory

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Online Learning and Sequential Decision Making

Online Learning with Experts & Multiplicative Weights Algorithms

Better Algorithms for Selective Sampling

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Kernelized Perceptron Support Vector Machines

Discriminative Models

On-Line Learning with Path Experts and Non-Additive Losses

COMS 4771 Introduction to Machine Learning. Nakul Verma

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

The Perceptron Algorithm 1

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Generalization bounds

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Name (NetID): (1 Point)

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Support Vector Machines. Machine Learning Fall 2017

On the Generalization Ability of Online Strongly Convex Programming Algorithms

THE WEIGHTED MAJORITY ALGORITHM

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Structured Prediction

No-Regret Algorithms for Unconstrained Online Convex Optimization

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Introduction to Support Vector Machines

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

More about the Perceptron

Online Learning with Feedback Graphs

Learning Theory: Basic Guarantees

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Linear Discrimination Functions

Machine Learning Lecture 7

Advanced Machine Learning

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Online Bounds for Bayesian Algorithms

COMS 4771 Lecture Boosting 1 / 16

From Bandits to Experts: A Tale of Domination and Independence

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Online prediction with expert advise

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

The Perceptron algorithm

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Linear & nonlinear classifiers

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

Introduction to Logistic Regression and Support Vector Machine

Empirical Risk Minimization

Transcription:

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Motivation PAC learning: distribution fixed over time (training and test). IID assumption. On-line learning: no distributional assumption. worst-case analysis (adversarial). mixed training and test. Performance measure: mistake model, regret. 2

Prediction with expert advice Linear classification This Lecture 3

General On-Line Setting For t=1 to T do receive instance predict receive label incur loss Classification: Regression: y t Y. y t Y. L(y t,y t ). x t X. Y ={0, 1},L(y, y )= y Y R,L(y, y )=(y y) 2. Objective: minimize total loss y. T t=1 L( y t,y t ). 4

Prediction with Expert Advice For t=1 to T do receive instance predict receive label incur loss y t Y. and advice Objective: minimize regret, i.e., difference of total loss incurred and that of best expert. Regret(T )= y t Y. L(y t,y t ). TX t=1 x t X y t,i Y,i [1,N]. L(by t,y t ) N min i=1 TX L(y t,i,y t ). t=1 5

Mistake Bound Model Definition: the maximum number of mistakes a learning algorithm Lmakes to learn c is defined by M L (c) = max mistakes(l, c). x 1,...,x T Definition: for any concept class C the maximum number of mistakes a learning algorithm L makes is M L (C) =max M L(c). c C A mistake bound is a bound M on M L (C). 6

Halving Algorithm see (Mitchell, 1997) Halving(H) 1 H 1 H 2 for t 1 to T do 3 Receive(x t ) 4 y t MajorityVote(H t,x t ) 5 Receive(y t ) 6 if y t = y t then 7 H t+1 {c H t : c(x t )=y t } 8 return H T +1 7

Halving Algorithm - Bound Theorem: Let H be a finite hypothesis set, then M Halving(H) log 2 H. Proof: At each mistake, the hypothesis set is reduced at least by half. (Littlestone, 1988) 8

VC Dimension Lower Bound Theorem: Let for H. Then, opt(h) (Littlestone, 1988) be the optimal mistake bound VCdim(H) opt(h) M Halving(H) log 2 H. Proof: for a fully shattered set, form a complete binary tree of the mistakes with height VCdim(H). 9

Weighted Majority Algorithm Weighted-Majority(N experts) 1 for i 1 to N do 2 w 1,i 1 3 for t 1 to T do 4 Receive(x t ) 5 y t 1 P N yt,i =1 w t P N y t,i =0 w t 6 Receive(y t ) 7 if y t = y t then 8 for i 1 to N do 9 if (y t,i = y t ) then 10 w t+1,i w t,i 11 else w t+1,i w t,i 12 return w T +1 (Littlestone and Warmuth, 1988) y t,y t,i {0, 1}. [0, 1). weighted majority vote 10

Weighted Majority - Bound Theorem: Let m t be the number of mistakes made by the WM algorithm till time t and m t that of the best expert. Then, for all t, Thus, Realizable case: m t log N + m t log 1 log 2 1+ m t O(log N)+constant best expert. m t Halving algorithm: =0. O(log N).. 11

Weighted Majority - Proof Potential: Upper bound: after each error, Thus, t+1 apple 1 t t = Lower bound: for any expert i, Comparison: N i=1 w t,i. 2 + 1 2 t = m 1+ t N. 2 m t 1+ 2 apple 1+ 2 m t N t. t w t,i = m t,i. m t log log N + m t log 1+ 2 2 m t log 1+ log N + m t log 1. 12

Weighted Majority - Notes Advantage: remarkable bound requiring no assumption. Disadvantage: no deterministic algorithm can achieve a regret R T = o(t ) with the binary loss. better guarantee with randomized WM. better guarantee for WM with convex losses. 13

Exponential Weighted Average Algorithm: weight update: prediction: y t = P N i=1 w t,iy t,i P N i=1 w t,i Theorem: assume that Lis convex in its first argument and takes values in. Then, for any and any sequence y 1,...,y T Y, the regret at satisfies Regret(T ) For = 8 log N/T, total loss incurred by expert i up to time t w t+1,i w t,i e L(y t,i,y t ) = e L t,i.. [0, 1] >0 T log N + T 8. Regret(T ) (T/2) log N. 14

Exponential Weighted Avg - Proof Potential: Upper bound: t = log N i=1 w t,i. t t 1 = log P N i=1 w t 1,i e L(y t,i,y t ) P N i=1 w t 1,i = log E [e L(y t,i,y t ) ] wt 1 apple = log exp E w t 1 L(y t,i,y t ) E [L(y t,i,y t )] wt 1 E [L(y t,i,y t )] wt 1 apple E wt apple L( E wt 1 [L(y t,i,y t )] + 2 8 1 [y t,i ],y t )+ 2 8 = L(by t,y t )+ 2 8. (Hoeffding s ineq.) (convexity of first arg. of L) 15

Exponential Weighted Avg - Proof T Upper bound: summing up the inequalities yields Lower bound: 0 = log Comparison: T t=1 T N i=1 e 0 T t=1 L T,i log N log N max i=1 e L T,i log N N min i=1 L T,i log N L( y t,y t ) L( y t,y t )+ = T t=1 2 T 8. N min i=1 L T,i log N. L( y t,y t )+ N min i=1 L T,i log N + T 8. 16 2 T 8

Exponential Weighted Avg - Notes Advantage: bound on regret per bound is of the form R T log(n). T = O T Disadvantage: choice of requires knowledge of horizon T. 17

Doubling Trick Idea: divide time into periods with k =0,...,n, T 2 n 1, and choose in each period. of length [2 k, 2 k+1 1] 2 k k = 8logN 2 k Theorem: with the same assumptions as before, for any T, the following holds: Regret(T ) 2 2 1 (T/2) log N + log N/2. 18

Doubling Trick - Proof By the previous theorem, for any, I k =[2 k, 2 k+1 1] L Ik N min i=1 L I k,i 2 k /2 log N. Thus, with L T = n k=0 L Ik n k=0 N min i=1 L I k,i + N min i=1 L T,i + n k=0 n k=0 2 k (log N)/2 2 k 2 (log N)/2. n n+1 2 k 2 2 1 = = 2(n+1)/2 1 2 1 2 1 2 T +1 1 2 1 2( T +1) 1 2 1 2 T 2 1 +1. i=0 19

Notes Doubling trick used in a variety of other contexts and proofs. More general method, learning parameter function of time: t = (8 log N)/t. Constant factor improvement: Regret(T ) 2 (T/2) log N + (1/8) log N. 20

Prediction with expert advice Linear classification This Lecture 21

Perceptron Algorithm (Rosenblatt, 1958) Perceptron(w 0 ) 1 w 1 w 0 typically w 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn(w t x t ) 5 Receive(y t ) 6 if (y t = y t ) then 7 w t+1 w t + y t x t more generally y t x t, >0 8 else w t+1 w t 9 return w T +1 22

Separating Hyperplane Margin and errors w x=0 w x=0 ρ ρ y i (w x i ) w 23

Perceptron = Stochastic Gradient Descent with Objective function: convex but not differentiable. F (w) = 1 T Stochastic gradient: for each Here: T t=1 f(w, x) =max 0, y(w x). max 0, y t (w x t ) = E [f(w, x)] x D b x t, the update is w t w f(w t, x t ) if di erentiable w t+1 w t otherwise, where >0 is a learning rate parameter. w t + y t x t if y t (w t x t ) < 0 w t+1 w t otherwise. 24

Perceptron Algorithm - Bound (Novikoff, 1962) Theorem: Assume that x t R for all t [1,T] and that for some >0 and v R N, for all t [1,T], y t (v x t ) v Then, the number of mistakes made by the perceptron algorithm is bounded by R 2 / 2. Proof: Let I be the set of ts at which there is an update and let M be the total number of updates.. 25

Summing up the assumption inequalities gives: M v t I y tx t v = v t I (w t+1 w t ) v = v w T +1 v w T +1 (definition of updates) (Cauchy-Schwarz ineq.) = w tm + y tm x tm (t m largest t in I) = w tm 2 + x tm 2 +2y tm w tm x tm w tm 2 + R 2 1/2 0 1/2 MR 2 1/2 = MR. (applying the same to previous ts ini) 26

Notes: bound independent of dimension and tight. convergence can be slow for small margin, it can be in (2 N ). among the many variants: voted perceptron algorithm. Predict according to sign ( where c t is the number of iterations w t survives. {x t : t I} perceptron algorithm. are the support vectors for the non-separable case: does not converge. t I c t w t ) x, 27

Perceptron - Leave-One-Out Analysis Theorem: Let h S be the hypothesis returned by the perceptron algorithm for sample S =(x 1,...,x T ) and let M(S) be the number of updates defining h S. Then, min(m(s),r E S D m[r(h S)] E m+1/ 2 2 m+1) S D m+1 m +1 Proof: Let S D m+1 be a sample linearly separable and let x S. If h S {x} misclassifies x, then x must be a support vector for (update at x). Thus, h S R loo (perceptron) M(S) m +1.. D 28

Perceptron - Non-Separable Bound Theorem: let I denote the set of rounds at which the Perceptron algorithm makes an update when processing x 1,...,x T and let M T = I. Then, M T apple inf >0,kuk 2 apple1 (MM and Rostamizadeh, 2013) appleq L (u)+ R 2, where R = max t2i kx t k L (u) = P t2i 1 y t (u x t ) +. 29

Proof: for any,, t 1 y t (u x t ) up these inequalities for by upper-bounding for the separable case. yields: solving the second-degree inequality gives M T apple X t2i 1 apple L (u)+ M T apple L (u)+ p MT apple q R + R 2 2 t y t (u x t ) p MT R, p MT R, 2 + I 1 + X t2i t I (y tu x t ) +4L (u) y t (u x t ) + y t (u x t ) apple R + ql (u). summing as in the proof 30

Non-Separable Case - L2 Bound (Freund and Schapire, 1998; MM and Rostamizadeh, 2013) Theorem: let I denote the set of rounds at which the Perceptron algorithm makes an update when processing x 1,...,x T and let M T = I. Then, M T inf >0, u 2 1 L (u) 2 2 + L (u) 2 2 4 + t I x t 2 2. when x t R for all t I, this implies M T inf >0, u 2 1 R + L (u) 2 2, where L (u) = 1 y t (u x t ) + t I. 31

Proof: Reduce problem to separable case in higher y dimension. Let l t = 1 t u x t, for. + 1 t I t [1,T] Mapping (similar to trivial mapping): (N +t)th component x t = x t,1. x t,n x t = x t,1. x t,n 0.. 0 0. 0 u u = u 1 Ẓ.. u NZ y 1 l 1 Z y T. l T Z u =1 = Z = 1+ 2 L (u) 2 2. 32

Observe that the Perceptron algorithm makes the same predictions and makes updates at the same rounds when processing x 1,...,x T. For any t I, y t (u x t )=y t = y tu x t Z u x t Z + y t l t Z + l t Z = 1 Z y tu x t +[ y t (u x t )] + Z. Summing up and using the proof in the separable case yields: M T Z y t (u x t ) x t 2. t I t I 33

The inequality can be rewritten as M 2 T 1 2 + L (u) 2 r 2 +M 2 2 T = r2 2 + r2 L (u) 2 + M 2 T +M 2 2 T L (u) 2, where r = t I x t 2. Selecting to minimize the bound gives and leads to 2 = L (u) 2r M T M 2 T r 2 2 +2 M T L (u) r + M T L (u) 2 =( r + M T L (u) 2 ) 2. Solving the second-degree inequality M T M T L (u) 2 r yields directly the first statement. The second one results from replacing r with M T R. 0 34

Dual Perceptron Algorithm Dual-Perceptron( 0 ) 1 0 typically 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn( 5 Receive(y t ) 6 if (y t = y t ) then 7 t t +1 8 return T s=1 sy s (x s x t )) 35

Kernel Perceptron Algorithm K PDS kernel. Kernel-Perceptron( 0 ) 1 0 typically 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn( 5 Receive(y t ) 6 if (y t = y t ) then 7 t t +1 8 return T s=1 sy s K(x s,x t )) (Aizerman et al., 1964) 36

Winnow Algorithm Winnow( ) 1 w 1 1/N 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn(w t x t ) y t { 1, +1} 5 Receive(y t ) 6 if (y t = y t ) then 7 Z t N i=1 w t,i exp( y t x t,i ) 8 for i 1 to N do 9 w t+1,i w t,i exp( y t x t,i ) Z t 10 else w t+1 w t 11 return w T +1 (Littlestone, 1988) 37

Notes Winnow = weighted majority: for y t,i =x t,i { 1, +1}, sgn(w t x t ) coincides with the majority vote. multiplying by e or e the weight of correct or incorrect experts, is equivalent to multiplying by =e 2 the weight of incorrect ones. Relationships with other algorithms: e.g., boosting and Perceptron (Winnow and Perceptron can be viewed as special instances of a general family). 38

Winnow Algorithm - Bound Theorem: Assume that x t R for all t [1,T] and that for some >0 and v R N, v 0 for all t [1,T], y t (v x t ) v 1. Then, the number of mistakes made by the Winnow algorithm is bounded by 2(R 2 / 2 )logn. Proof: Let I be the set of ts at which there is an update and let M be the total number of updates. 39

Notes Comparison with perceptron bound: dual norms: norms for x t and v. similar bounds with different norms. each advantageous in different cases: Winnow bound favorable when a sparse set of experts can predict well. For example, if and, log N vs N. x t {±1} N v =e 1 Perceptron favorable in opposite situation. 40

Winnow Algorithm - Bound Potential: t = Upper bound: for each t in I, N i=1 t+1 t = P N i=1 = P N i=1 v i v log v i/ v w t,i. v i kvk 1 log w t,i w t+1,i v i kvk 1 log = log Z t P N i=1 Z t exp( y t x t,i ) v i kvk 1 y t x t,i (relative entropy) apple log P N i=1 w t,i exp( y t x t,i ) 1 = log E wt exp( yt x t ) 1 (Hoe ding) apple log exp( 2 (2R 1 ) 2 /8) + y t w t x {z } t apple 2 R1/2 2 1. apple0 1 41

Winnow Algorithm - Bound Upper bound: summing up the inequalities yields Lower bound: note that 1 = N i=1 and for all, T +1 1 M( 2 R 2 /2 ). v i v 1 log v i/ v 1 1/N =logn + N t t 0 i=1 v i v 1 log v i v 1 (property of relative entropy). log N Thus, T +1 1 0 log N = log N. Comparison: we obtain For log N M( 2 R 2 /2 ). = R 2 M 2logN R2 2. 42

Conclusion On-line learning: wide and fast-growing literature. many related topics, e.g., game theory, text compression, convex optimization. online to batch bounds and techniques. online version of batch algorithms, e.g., regression algorithms (see regression lecture). 43

References Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837. Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On- Line Learning Algorithms. IEEE Transactions on Information Theory 50(9): 2050-2057. 2004. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. Yoav Freund and Robert Schapire. Large margin classification using the perceptron algorithm. In Proceedings of COLT 1998. ACM Press, 1998. Nick Littlestone. From On-Line to Batch Learning. COLT 1989: 269-284. Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linearthreshold Algorithm" Machine Learning 285-318(2). 1988. 44

References Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989: 256-261. Tom Mitchell. Machine Learning, McGraw Hill, 1997. Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arxiv:1305.0208, 2013. Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn. Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386-408, 1958. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. 45

Mehryar Mohri - Foudations of Machine Learning Appendix

SVMs - Leave-One-Out Analysis Theorem: let h S be the optimal hyperplane for a sample S and let N SV (S) be the number of support vectors defining. Then, h S min(n SV (S),R E S D m[r(h S)] E m+1/ 2 2 m+1) S D m+1 m +1 Proof: one part proven in lecture 4. The other part due to for x i misclassified by SVMs. i 1/R 2 m+1 (Vapnik, 1995). 47

Comparison Bounds on expected error, not high probability statements. Leave-one-out bounds not sufficient to distinguish SVMs and perceptron algorithm. Note however: same maximum margin m+1 can be used in both. but different radius R m+1 of support vectors. Difference: margin distribution. 48