Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Motivation PAC learning: distribution fixed over time (training and test). IID assumption. On-line learning: no distributional assumption. worst-case analysis (adversarial). mixed training and test. Performance measure: mistake model, regret. 2

Prediction with expert advice Linear classification This Lecture 3

General On-Line Setting For t=1 to T do receive instance predict receive label incur loss Classification: Regression: y t Y. y t Y. L(y t,y t ). x t X. Y ={0, 1},L(y, y )= y Y R,L(y, y )=(y y) 2. Objective: minimize total loss y. T t=1 L( y t,y t ). 4

Prediction with Expert Advice For t=1 to T do receive instance predict receive label incur loss y t Y. and advice Objective: minimize regret, i.e., difference of total loss incurred and that of best expert. Regret(T )= y t Y. L(y t,y t ). TX t=1 x t X y t,i Y,i [1,N]. L(by t,y t ) N min i=1 TX L(y t,i,y t ). t=1 5

Mistake Bound Model Definition: the maximum number of mistakes a learning algorithm Lmakes to learn c is defined by M L (c) = max mistakes(l, c). x 1,...,x T Definition: for any concept class C the maximum number of mistakes a learning algorithm L makes is M L (C) =max M L(c). c C A mistake bound is a bound M on M L (C). 6

Halving Algorithm see (Mitchell, 1997) Halving(H) 1 H 1 H 2 for t 1 to T do 3 Receive(x t ) 4 y t MajorityVote(H t,x t ) 5 Receive(y t ) 6 if y t = y t then 7 H t+1 {c H t : c(x t )=y t } 8 return H T +1 7

Halving Algorithm - Bound Theorem: Let H be a finite hypothesis set, then M Halving(H) log 2 H. Proof: At each mistake, the hypothesis set is reduced at least by half. (Littlestone, 1988) 8

VC Dimension Lower Bound Theorem: Let for H. Then, opt(h) (Littlestone, 1988) be the optimal mistake bound VCdim(H) opt(h) M Halving(H) log 2 H. Proof: for a fully shattered set, form a complete binary tree of the mistakes with height VCdim(H). 9

Weighted Majority Algorithm Weighted-Majority(N experts) 1 for i 1 to N do 2 w 1,i 1 3 for t 1 to T do 4 Receive(x t ) 5 y t 1 P N yt,i =1 w t P N y t,i =0 w t 6 Receive(y t ) 7 if y t = y t then 8 for i 1 to N do 9 if (y t,i = y t ) then 10 w t+1,i w t,i 11 else w t+1,i w t,i 12 return w T +1 (Littlestone and Warmuth, 1988) y t,y t,i {0, 1}. [0, 1). weighted majority vote 10

Weighted Majority - Bound Theorem: Let m t be the number of mistakes made by the WM algorithm till time t and m t that of the best expert. Then, for all t, Thus, Realizable case: m t log N + m t log 1 log 2 1+ m t O(log N)+constant best expert. m t Halving algorithm: =0. O(log N).. 11

Weighted Majority - Proof Potential: Upper bound: after each error, Thus, t+1 apple 1 t t = Lower bound: for any expert i, Comparison: N i=1 w t,i. 2 + 1 2 t = m 1+ t N. 2 m t 1+ 2 apple 1+ 2 m t N t. t w t,i = m t,i. m t log log N + m t log 1+ 2 2 m t log 1+ log N + m t log 1. 12

Weighted Majority - Notes Advantage: remarkable bound requiring no assumption. Disadvantage: no deterministic algorithm can achieve a regret R T = o(t ) with the binary loss. better guarantee with randomized WM. better guarantee for WM with convex losses. 13

Exponential Weighted Average Algorithm: weight update: prediction: y t = P N i=1 w t,iy t,i P N i=1 w t,i Theorem: assume that Lis convex in its first argument and takes values in. Then, for any and any sequence y 1,...,y T Y, the regret at satisfies Regret(T ) For = 8 log N/T, total loss incurred by expert i up to time t w t+1,i w t,i e L(y t,i,y t ) = e L t,i.. [0, 1] >0 T log N + T 8. Regret(T ) (T/2) log N. 14

Exponential Weighted Avg - Proof Potential: Upper bound: t = log N i=1 w t,i. t t 1 = log P N i=1 w t 1,i e L(y t,i,y t ) P N i=1 w t 1,i = log E [e L(y t,i,y t ) ] wt 1 apple = log exp E w t 1 L(y t,i,y t ) E [L(y t,i,y t )] wt 1 E [L(y t,i,y t )] wt 1 apple E wt apple L( E wt 1 [L(y t,i,y t )] + 2 8 1 [y t,i ],y t )+ 2 8 = L(by t,y t )+ 2 8. (Hoeffding s ineq.) (convexity of first arg. of L) 15

Exponential Weighted Avg - Proof T Upper bound: summing up the inequalities yields Lower bound: 0 = log Comparison: T t=1 T N i=1 e 0 T t=1 L T,i log N log N max i=1 e L T,i log N N min i=1 L T,i log N L( y t,y t ) L( y t,y t )+ = T t=1 2 T 8. N min i=1 L T,i log N. L( y t,y t )+ N min i=1 L T,i log N + T 8. 16 2 T 8

Exponential Weighted Avg - Notes Advantage: bound on regret per bound is of the form R T log(n). T = O T Disadvantage: choice of requires knowledge of horizon T. 17

Doubling Trick Idea: divide time into periods with k =0,...,n, T 2 n 1, and choose in each period. of length [2 k, 2 k+1 1] 2 k k = 8logN 2 k Theorem: with the same assumptions as before, for any T, the following holds: Regret(T ) 2 2 1 (T/2) log N + log N/2. 18

Doubling Trick - Proof By the previous theorem, for any, I k =[2 k, 2 k+1 1] L Ik N min i=1 L I k,i 2 k /2 log N. Thus, with L T = n k=0 L Ik n k=0 N min i=1 L I k,i + N min i=1 L T,i + n k=0 n k=0 2 k (log N)/2 2 k 2 (log N)/2. n n+1 2 k 2 2 1 = = 2(n+1)/2 1 2 1 2 1 2 T +1 1 2 1 2( T +1) 1 2 1 2 T 2 1 +1. i=0 19

Notes Doubling trick used in a variety of other contexts and proofs. More general method, learning parameter function of time: t = (8 log N)/t. Constant factor improvement: Regret(T ) 2 (T/2) log N + (1/8) log N. 20

Prediction with expert advice Linear classification This Lecture 21

Perceptron Algorithm (Rosenblatt, 1958) Perceptron(w 0 ) 1 w 1 w 0 typically w 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn(w t x t ) 5 Receive(y t ) 6 if (y t = y t ) then 7 w t+1 w t + y t x t more generally y t x t, >0 8 else w t+1 w t 9 return w T +1 22

Separating Hyperplane Margin and errors w x=0 w x=0 ρ ρ y i (w x i ) w 23

Perceptron = Stochastic Gradient Descent with Objective function: convex but not differentiable. F (w) = 1 T Stochastic gradient: for each Here: T t=1 f(w, x) =max 0, y(w x). max 0, y t (w x t ) = E [f(w, x)] x D b x t, the update is w t w f(w t, x t ) if di erentiable w t+1 w t otherwise, where >0 is a learning rate parameter. w t + y t x t if y t (w t x t ) < 0 w t+1 w t otherwise. 24

Perceptron Algorithm - Bound (Novikoff, 1962) Theorem: Assume that x t R for all t [1,T] and that for some >0 and v R N, for all t [1,T], y t (v x t ) v Then, the number of mistakes made by the perceptron algorithm is bounded by R 2 / 2. Proof: Let I be the set of ts at which there is an update and let M be the total number of updates.. 25

Summing up the assumption inequalities gives: M v t I y tx t v = v t I (w t+1 w t ) v = v w T +1 v w T +1 (definition of updates) (Cauchy-Schwarz ineq.) = w tm + y tm x tm (t m largest t in I) = w tm 2 + x tm 2 +2y tm w tm x tm w tm 2 + R 2 1/2 0 1/2 MR 2 1/2 = MR. (applying the same to previous ts ini) 26

Notes: bound independent of dimension and tight. convergence can be slow for small margin, it can be in (2 N ). among the many variants: voted perceptron algorithm. Predict according to sign ( where c t is the number of iterations w t survives. {x t : t I} perceptron algorithm. are the support vectors for the non-separable case: does not converge. t I c t w t ) x, 27

Perceptron - Leave-One-Out Analysis Theorem: Let h S be the hypothesis returned by the perceptron algorithm for sample S =(x 1,...,x T ) and let M(S) be the number of updates defining h S. Then, min(m(s),r E S D m[r(h S)] E m+1/ 2 2 m+1) S D m+1 m +1 Proof: Let S D m+1 be a sample linearly separable and let x S. If h S {x} misclassifies x, then x must be a support vector for (update at x). Thus, h S R loo (perceptron) M(S) m +1.. D 28

Perceptron - Non-Separable Bound Theorem: let I denote the set of rounds at which the Perceptron algorithm makes an update when processing x 1,...,x T and let M T = I. Then, M T apple inf >0,kuk 2 apple1 (MM and Rostamizadeh, 2013) appleq L (u)+ R 2, where R = max t2i kx t k L (u) = P t2i 1 y t (u x t ) +. 29

Proof: for any,, t 1 y t (u x t ) up these inequalities for by upper-bounding for the separable case. yields: solving the second-degree inequality gives M T apple X t2i 1 apple L (u)+ M T apple L (u)+ p MT apple q R + R 2 2 t y t (u x t ) p MT R, p MT R, 2 + I 1 + X t2i t I (y tu x t ) +4L (u) y t (u x t ) + y t (u x t ) apple R + ql (u). summing as in the proof 30

Non-Separable Case - L2 Bound (Freund and Schapire, 1998; MM and Rostamizadeh, 2013) Theorem: let I denote the set of rounds at which the Perceptron algorithm makes an update when processing x 1,...,x T and let M T = I. Then, M T inf >0, u 2 1 L (u) 2 2 + L (u) 2 2 4 + t I x t 2 2. when x t R for all t I, this implies M T inf >0, u 2 1 R + L (u) 2 2, where L (u) = 1 y t (u x t ) + t I. 31

Proof: Reduce problem to separable case in higher y dimension. Let l t = 1 t u x t, for. + 1 t I t [1,T] Mapping (similar to trivial mapping): (N +t)th component x t = x t,1. x t,n x t = x t,1. x t,n 0.. 0 0. 0 u u = u 1 Ẓ.. u NZ y 1 l 1 Z y T. l T Z u =1 = Z = 1+ 2 L (u) 2 2. 32

Observe that the Perceptron algorithm makes the same predictions and makes updates at the same rounds when processing x 1,...,x T. For any t I, y t (u x t )=y t = y tu x t Z u x t Z + y t l t Z + l t Z = 1 Z y tu x t +[ y t (u x t )] + Z. Summing up and using the proof in the separable case yields: M T Z y t (u x t ) x t 2. t I t I 33

The inequality can be rewritten as M 2 T 1 2 + L (u) 2 r 2 +M 2 2 T = r2 2 + r2 L (u) 2 + M 2 T +M 2 2 T L (u) 2, where r = t I x t 2. Selecting to minimize the bound gives and leads to 2 = L (u) 2r M T M 2 T r 2 2 +2 M T L (u) r + M T L (u) 2 =( r + M T L (u) 2 ) 2. Solving the second-degree inequality M T M T L (u) 2 r yields directly the first statement. The second one results from replacing r with M T R. 0 34

Dual Perceptron Algorithm Dual-Perceptron( 0 ) 1 0 typically 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn( 5 Receive(y t ) 6 if (y t = y t ) then 7 t t +1 8 return T s=1 sy s (x s x t )) 35

Kernel Perceptron Algorithm K PDS kernel. Kernel-Perceptron( 0 ) 1 0 typically 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn( 5 Receive(y t ) 6 if (y t = y t ) then 7 t t +1 8 return T s=1 sy s K(x s,x t )) (Aizerman et al., 1964) 36

Winnow Algorithm Winnow( ) 1 w 1 1/N 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn(w t x t ) y t { 1, +1} 5 Receive(y t ) 6 if (y t = y t ) then 7 Z t N i=1 w t,i exp( y t x t,i ) 8 for i 1 to N do 9 w t+1,i w t,i exp( y t x t,i ) Z t 10 else w t+1 w t 11 return w T +1 (Littlestone, 1988) 37

Notes Winnow = weighted majority: for y t,i =x t,i { 1, +1}, sgn(w t x t ) coincides with the majority vote. multiplying by e or e the weight of correct or incorrect experts, is equivalent to multiplying by =e 2 the weight of incorrect ones. Relationships with other algorithms: e.g., boosting and Perceptron (Winnow and Perceptron can be viewed as special instances of a general family). 38

Winnow Algorithm - Bound Theorem: Assume that x t R for all t [1,T] and that for some >0 and v R N, v 0 for all t [1,T], y t (v x t ) v 1. Then, the number of mistakes made by the Winnow algorithm is bounded by 2(R 2 / 2 )logn. Proof: Let I be the set of ts at which there is an update and let M be the total number of updates. 39

Notes Comparison with perceptron bound: dual norms: norms for x t and v. similar bounds with different norms. each advantageous in different cases: Winnow bound favorable when a sparse set of experts can predict well. For example, if and, log N vs N. x t {±1} N v =e 1 Perceptron favorable in opposite situation. 40

Winnow Algorithm - Bound Potential: t = Upper bound: for each t in I, N i=1 t+1 t = P N i=1 = P N i=1 v i v log v i/ v w t,i. v i kvk 1 log w t,i w t+1,i v i kvk 1 log = log Z t P N i=1 Z t exp( y t x t,i ) v i kvk 1 y t x t,i (relative entropy) apple log P N i=1 w t,i exp( y t x t,i ) 1 = log E wt exp( yt x t ) 1 (Hoe ding) apple log exp( 2 (2R 1 ) 2 /8) + y t w t x {z } t apple 2 R1/2 2 1. apple0 1 41

Winnow Algorithm - Bound Upper bound: summing up the inequalities yields Lower bound: note that 1 = N i=1 and for all, T +1 1 M( 2 R 2 /2 ). v i v 1 log v i/ v 1 1/N =logn + N t t 0 i=1 v i v 1 log v i v 1 (property of relative entropy). log N Thus, T +1 1 0 log N = log N. Comparison: we obtain For log N M( 2 R 2 /2 ). = R 2 M 2logN R2 2. 42

Conclusion On-line learning: wide and fast-growing literature. many related topics, e.g., game theory, text compression, convex optimization. online to batch bounds and techniques. online version of batch algorithms, e.g., regression algorithms (see regression lecture). 43

References Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837. Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On- Line Learning Algorithms. IEEE Transactions on Information Theory 50(9): 2050-2057. 2004. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. Yoav Freund and Robert Schapire. Large margin classification using the perceptron algorithm. In Proceedings of COLT 1998. ACM Press, 1998. Nick Littlestone. From On-Line to Batch Learning. COLT 1989: 269-284. Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linearthreshold Algorithm" Machine Learning 285-318(2). 1988. 44

References Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989: 256-261. Tom Mitchell. Machine Learning, McGraw Hill, 1997. Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arxiv:1305.0208, 2013. Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn. Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386-408, 1958. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. 45

Mehryar Mohri - Foudations of Machine Learning Appendix

SVMs - Leave-One-Out Analysis Theorem: let h S be the optimal hyperplane for a sample S and let N SV (S) be the number of support vectors defining. Then, h S min(n SV (S),R E S D m[r(h S)] E m+1/ 2 2 m+1) S D m+1 m +1 Proof: one part proven in lecture 4. The other part due to for x i misclassified by SVMs. i 1/R 2 m+1 (Vapnik, 1995). 47

Comparison Bounds on expected error, not high probability statements. Leave-one-out bounds not sufficient to distinguish SVMs and perceptron algorithm. Note however: same maximum margin m+1 can be used in both. but different radius R m+1 of support vectors. Difference: margin distribution. 48