Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
|
|
- Magnus Cummings
- 6 years ago
- Views:
Transcription
1 Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research
2 Motivation PAC learning: distribution fixed over time (training and test). IID assumption. On-line learning: no distributional assumption. worst-case analysis (adversarial). mixed training and test. Performance measure: mistake model, regret. 2
3 Prediction with expert advice Linear classification This Lecture 3
4 General On-Line Setting For t=1 to T do receive instance predict receive label incur loss Classification: Regression: y t Y. y t Y. L(y t,y t ). x t X. Y ={0, 1},L(y, y )= y Y R,L(y, y )=(y y) 2. Objective: minimize total loss y. T t=1 L( y t,y t ). 4
5 Prediction with Expert Advice For t=1 to T do receive instance predict receive label incur loss y t Y. and advice Objective: minimize regret, i.e., difference of total loss incurred and that of best expert. Regret(T )= y t Y. L(y t,y t ). TX t=1 x t X y t,i Y,i [1,N]. L(by t,y t ) N min i=1 TX L(y t,i,y t ). t=1 5
6 Mistake Bound Model Definition: the maximum number of mistakes a learning algorithm Lmakes to learn c is defined by M L (c) = max mistakes(l, c). x 1,...,x T Definition: for any concept class C the maximum number of mistakes a learning algorithm L makes is M L (C) =max M L(c). c C A mistake bound is a bound M on M L (C). 6
7 Halving Algorithm see (Mitchell, 1997) Halving(H) 1 H 1 H 2 for t 1 to T do 3 Receive(x t ) 4 y t MajorityVote(H t,x t ) 5 Receive(y t ) 6 if y t = y t then 7 H t+1 {c H t : c(x t )=y t } 8 return H T +1 7
8 Halving Algorithm - Bound Theorem: Let H be a finite hypothesis set, then M Halving(H) log 2 H. Proof: At each mistake, the hypothesis set is reduced at least by half. (Littlestone, 1988) 8
9 VC Dimension Lower Bound Theorem: Let for H. Then, opt(h) (Littlestone, 1988) be the optimal mistake bound VCdim(H) opt(h) M Halving(H) log 2 H. Proof: for a fully shattered set, form a complete binary tree of the mistakes with height VCdim(H). 9
10 Weighted Majority Algorithm Weighted-Majority(N experts) 1 for i 1 to N do 2 w 1,i 1 3 for t 1 to T do 4 Receive(x t ) 5 y t 1 P N yt,i =1 w t P N y t,i =0 w t 6 Receive(y t ) 7 if y t = y t then 8 for i 1 to N do 9 if (y t,i = y t ) then 10 w t+1,i w t,i 11 else w t+1,i w t,i 12 return w T +1 (Littlestone and Warmuth, 1988) y t,y t,i {0, 1}. [0, 1). weighted majority vote 10
11 Weighted Majority - Bound Theorem: Let m t be the number of mistakes made by the WM algorithm till time t and m t that of the best expert. Then, for all t, Thus, Realizable case: m t log N + m t log 1 log 2 1+ m t O(log N)+constant best expert. m t Halving algorithm: =0. O(log N).. 11
12 Weighted Majority - Proof Potential: Upper bound: after each error, Thus, t+1 apple 1 t t = Lower bound: for any expert i, Comparison: N i=1 w t,i t = m 1+ t N. 2 m t 1+ 2 apple 1+ 2 m t N t. t w t,i = m t,i. m t log log N + m t log m t log 1+ log N + m t log 1. 12
13 Weighted Majority - Notes Advantage: remarkable bound requiring no assumption. Disadvantage: no deterministic algorithm can achieve a regret R T = o(t ) with the binary loss. better guarantee with randomized WM. better guarantee for WM with convex losses. 13
14 Exponential Weighted Average Algorithm: weight update: prediction: y t = P N i=1 w t,iy t,i P N i=1 w t,i Theorem: assume that Lis convex in its first argument and takes values in. Then, for any and any sequence y 1,...,y T Y, the regret at satisfies Regret(T ) For = 8 log N/T, total loss incurred by expert i up to time t w t+1,i w t,i e L(y t,i,y t ) = e L t,i.. [0, 1] >0 T log N + T 8. Regret(T ) (T/2) log N. 14
15 Exponential Weighted Avg - Proof Potential: Upper bound: t = log N i=1 w t,i. t t 1 = log P N i=1 w t 1,i e L(y t,i,y t ) P N i=1 w t 1,i = log E [e L(y t,i,y t ) ] wt 1 apple = log exp E w t 1 L(y t,i,y t ) E [L(y t,i,y t )] wt 1 E [L(y t,i,y t )] wt 1 apple E wt apple L( E wt 1 [L(y t,i,y t )] [y t,i ],y t )+ 2 8 = L(by t,y t ) (Hoeffding s ineq.) (convexity of first arg. of L) 15
16 Exponential Weighted Avg - Proof T Upper bound: summing up the inequalities yields Lower bound: 0 = log Comparison: T t=1 T N i=1 e 0 T t=1 L T,i log N log N max i=1 e L T,i log N N min i=1 L T,i log N L( y t,y t ) L( y t,y t )+ = T t=1 2 T 8. N min i=1 L T,i log N. L( y t,y t )+ N min i=1 L T,i log N + T T 8
17 Exponential Weighted Avg - Notes Advantage: bound on regret per bound is of the form R T log(n). T = O T Disadvantage: choice of requires knowledge of horizon T. 17
18 Doubling Trick Idea: divide time into periods with k =0,...,n, T 2 n 1, and choose in each period. of length [2 k, 2 k+1 1] 2 k k = 8logN 2 k Theorem: with the same assumptions as before, for any T, the following holds: Regret(T ) (T/2) log N + log N/2. 18
19 Doubling Trick - Proof By the previous theorem, for any, I k =[2 k, 2 k+1 1] L Ik N min i=1 L I k,i 2 k /2 log N. Thus, with L T = n k=0 L Ik n k=0 N min i=1 L I k,i + N min i=1 L T,i + n k=0 n k=0 2 k (log N)/2 2 k 2 (log N)/2. n n+1 2 k = = 2(n+1)/ T ( T +1) T i=0 19
20 Notes Doubling trick used in a variety of other contexts and proofs. More general method, learning parameter function of time: t = (8 log N)/t. Constant factor improvement: Regret(T ) 2 (T/2) log N + (1/8) log N. 20
21 Prediction with expert advice Linear classification This Lecture 21
22 Perceptron Algorithm (Rosenblatt, 1958) Perceptron(w 0 ) 1 w 1 w 0 typically w 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn(w t x t ) 5 Receive(y t ) 6 if (y t = y t ) then 7 w t+1 w t + y t x t more generally y t x t, >0 8 else w t+1 w t 9 return w T +1 22
23 Separating Hyperplane Margin and errors w x=0 w x=0 ρ ρ y i (w x i ) w 23
24 Perceptron = Stochastic Gradient Descent with Objective function: convex but not differentiable. F (w) = 1 T Stochastic gradient: for each Here: T t=1 f(w, x) =max 0, y(w x). max 0, y t (w x t ) = E [f(w, x)] x D b x t, the update is w t w f(w t, x t ) if di erentiable w t+1 w t otherwise, where >0 is a learning rate parameter. w t + y t x t if y t (w t x t ) < 0 w t+1 w t otherwise. 24
25 Perceptron Algorithm - Bound (Novikoff, 1962) Theorem: Assume that x t R for all t [1,T] and that for some >0 and v R N, for all t [1,T], y t (v x t ) v Then, the number of mistakes made by the perceptron algorithm is bounded by R 2 / 2. Proof: Let I be the set of ts at which there is an update and let M be the total number of updates.. 25
26 Summing up the assumption inequalities gives: M v t I y tx t v = v t I (w t+1 w t ) v = v w T +1 v w T +1 (definition of updates) (Cauchy-Schwarz ineq.) = w tm + y tm x tm (t m largest t in I) = w tm 2 + x tm 2 +2y tm w tm x tm w tm 2 + R 2 1/2 0 1/2 MR 2 1/2 = MR. (applying the same to previous ts ini) 26
27 Notes: bound independent of dimension and tight. convergence can be slow for small margin, it can be in (2 N ). among the many variants: voted perceptron algorithm. Predict according to sign ( where c t is the number of iterations w t survives. {x t : t I} perceptron algorithm. are the support vectors for the non-separable case: does not converge. t I c t w t ) x, 27
28 Perceptron - Leave-One-Out Analysis Theorem: Let h S be the hypothesis returned by the perceptron algorithm for sample S =(x 1,...,x T ) and let M(S) be the number of updates defining h S. Then, min(m(s),r E S D m[r(h S)] E m+1/ 2 2 m+1) S D m+1 m +1 Proof: Let S D m+1 be a sample linearly separable and let x S. If h S {x} misclassifies x, then x must be a support vector for (update at x). Thus, h S R loo (perceptron) M(S) m +1.. D 28
29 Perceptron - Non-Separable Bound Theorem: let I denote the set of rounds at which the Perceptron algorithm makes an update when processing x 1,...,x T and let M T = I. Then, M T apple inf >0,kuk 2 apple1 (MM and Rostamizadeh, 2013) appleq L (u)+ R 2, where R = max t2i kx t k L (u) = P t2i 1 y t (u x t ) +. 29
30 Proof: for any,, t 1 y t (u x t ) up these inequalities for by upper-bounding for the separable case. yields: solving the second-degree inequality gives M T apple X t2i 1 apple L (u)+ M T apple L (u)+ p MT apple q R + R 2 2 t y t (u x t ) p MT R, p MT R, 2 + I 1 + X t2i t I (y tu x t ) +4L (u) y t (u x t ) + y t (u x t ) apple R + ql (u). summing as in the proof 30
31 Non-Separable Case - L2 Bound (Freund and Schapire, 1998; MM and Rostamizadeh, 2013) Theorem: let I denote the set of rounds at which the Perceptron algorithm makes an update when processing x 1,...,x T and let M T = I. Then, M T inf >0, u 2 1 L (u) L (u) t I x t 2 2. when x t R for all t I, this implies M T inf >0, u 2 1 R + L (u) 2 2, where L (u) = 1 y t (u x t ) + t I. 31
32 Proof: Reduce problem to separable case in higher y dimension. Let l t = 1 t u x t, for. + 1 t I t [1,T] Mapping (similar to trivial mapping): (N +t)th component x t = x t,1. x t,n x t = x t,1. x t,n u u = u 1 Ẓ.. u NZ y 1 l 1 Z y T. l T Z u =1 = Z = 1+ 2 L (u)
33 Observe that the Perceptron algorithm makes the same predictions and makes updates at the same rounds when processing x 1,...,x T. For any t I, y t (u x t )=y t = y tu x t Z u x t Z + y t l t Z + l t Z = 1 Z y tu x t +[ y t (u x t )] + Z. Summing up and using the proof in the separable case yields: M T Z y t (u x t ) x t 2. t I t I 33
34 The inequality can be rewritten as M 2 T L (u) 2 r 2 +M 2 2 T = r2 2 + r2 L (u) 2 + M 2 T +M 2 2 T L (u) 2, where r = t I x t 2. Selecting to minimize the bound gives and leads to 2 = L (u) 2r M T M 2 T r M T L (u) r + M T L (u) 2 =( r + M T L (u) 2 ) 2. Solving the second-degree inequality M T M T L (u) 2 r yields directly the first statement. The second one results from replacing r with M T R. 0 34
35 Dual Perceptron Algorithm Dual-Perceptron( 0 ) 1 0 typically 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn( 5 Receive(y t ) 6 if (y t = y t ) then 7 t t +1 8 return T s=1 sy s (x s x t )) 35
36 Kernel Perceptron Algorithm K PDS kernel. Kernel-Perceptron( 0 ) 1 0 typically 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn( 5 Receive(y t ) 6 if (y t = y t ) then 7 t t +1 8 return T s=1 sy s K(x s,x t )) (Aizerman et al., 1964) 36
37 Winnow Algorithm Winnow( ) 1 w 1 1/N 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn(w t x t ) y t { 1, +1} 5 Receive(y t ) 6 if (y t = y t ) then 7 Z t N i=1 w t,i exp( y t x t,i ) 8 for i 1 to N do 9 w t+1,i w t,i exp( y t x t,i ) Z t 10 else w t+1 w t 11 return w T +1 (Littlestone, 1988) 37
38 Notes Winnow = weighted majority: for y t,i =x t,i { 1, +1}, sgn(w t x t ) coincides with the majority vote. multiplying by e or e the weight of correct or incorrect experts, is equivalent to multiplying by =e 2 the weight of incorrect ones. Relationships with other algorithms: e.g., boosting and Perceptron (Winnow and Perceptron can be viewed as special instances of a general family). 38
39 Winnow Algorithm - Bound Theorem: Assume that x t R for all t [1,T] and that for some >0 and v R N, v 0 for all t [1,T], y t (v x t ) v 1. Then, the number of mistakes made by the Winnow algorithm is bounded by 2(R 2 / 2 )logn. Proof: Let I be the set of ts at which there is an update and let M be the total number of updates. 39
40 Notes Comparison with perceptron bound: dual norms: norms for x t and v. similar bounds with different norms. each advantageous in different cases: Winnow bound favorable when a sparse set of experts can predict well. For example, if and, log N vs N. x t {±1} N v =e 1 Perceptron favorable in opposite situation. 40
41 Winnow Algorithm - Bound Potential: t = Upper bound: for each t in I, N i=1 t+1 t = P N i=1 = P N i=1 v i v log v i/ v w t,i. v i kvk 1 log w t,i w t+1,i v i kvk 1 log = log Z t P N i=1 Z t exp( y t x t,i ) v i kvk 1 y t x t,i (relative entropy) apple log P N i=1 w t,i exp( y t x t,i ) 1 = log E wt exp( yt x t ) 1 (Hoe ding) apple log exp( 2 (2R 1 ) 2 /8) + y t w t x {z } t apple 2 R1/ apple0 1 41
42 Winnow Algorithm - Bound Upper bound: summing up the inequalities yields Lower bound: note that 1 = N i=1 and for all, T +1 1 M( 2 R 2 /2 ). v i v 1 log v i/ v 1 1/N =logn + N t t 0 i=1 v i v 1 log v i v 1 (property of relative entropy). log N Thus, T log N = log N. Comparison: we obtain For log N M( 2 R 2 /2 ). = R 2 M 2logN R
43 Conclusion On-line learning: wide and fast-growing literature. many related topics, e.g., game theory, text compression, convex optimization. online to batch bounds and techniques. online version of batch algorithms, e.g., regression algorithms (see regression lecture). 43
44 References Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On- Line Learning Algorithms. IEEE Transactions on Information Theory 50(9): Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, Yoav Freund and Robert Schapire. Large margin classification using the perceptron algorithm. In Proceedings of COLT ACM Press, Nick Littlestone. From On-Line to Batch Learning. COLT 1989: Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linearthreshold Algorithm" Machine Learning (2)
45 References Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989: Tom Mitchell. Machine Learning, McGraw Hill, Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arxiv: , Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, Polytechnic Institute of Brooklyn. Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp , Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York,
46 Mehryar Mohri - Foudations of Machine Learning Appendix
47 SVMs - Leave-One-Out Analysis Theorem: let h S be the optimal hyperplane for a sample S and let N SV (S) be the number of support vectors defining. Then, h S min(n SV (S),R E S D m[r(h S)] E m+1/ 2 2 m+1) S D m+1 m +1 Proof: one part proven in lecture 4. The other part due to for x i misclassified by SVMs. i 1/R 2 m+1 (Vapnik, 1995). 47
48 Comparison Bounds on expected error, not high probability statements. Leave-one-out bounds not sufficient to distinguish SVMs and perceptron algorithm. Note however: same maximum margin m+1 can be used in both. but different radius R m+1 of support vectors. Difference: margin distribution. 48
Perceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationAdvanced Machine Learning
Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a
More informationOnline Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over
More informationClassification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13
Classification Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY Slides adapted from Mohri Jordan Boyd-Graber UMD Classification 1 / 13 Beyond Binary Classification Before we ve talked about
More informationTime Series Prediction & Online Learning
Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More information1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationTutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.
Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationWorst-Case Analysis of the Perceptron and Exponentiated Update Algorithms
Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationCS 446: Machine Learning Lecture 4, Part 2: On-Line Learning
CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning 0.1 Linear Functions So far, we have been looking at Linear Functions { as a class of functions which can 1 if W1 X separate some data and not
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationAdvanced Machine Learning
Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal
More informationDownload the following data sets from the UC Irvine ML repository:
Mehryar Mohri Introduction to Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 Solution (written by Andres Muñoz) Perceptron Algorithm Download the following data sets
More informationAdvanced Machine Learning
Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.
More informationMistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron
Stat 928: Statistical Learning Theory Lecture: 18 Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Instructor: Sham Kakade 1 Introduction This course will be divided into 2 parts.
More informationLarge Margin Classification Using the Perceptron Algorithm
Machine Learning, 37(3):277-296, 1999. Large Margin Classification Using the Perceptron Algorithm YOAV FREUND yoav@research.att.com AT&T Labs, Shannon Laboratory, 180 Park Avenue, Room A205, Florham Park,
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationThe Perceptron Algorithm
The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationPerceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015
Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015 So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor
More informationLearning with Large Number of Experts: Component Hedge Algorithm
Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T
More informationOnline Learning for Time Series Prediction
Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationLearning, Games, and Networks
Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationThis article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationUsing Additive Expert Ensembles to Cope with Concept Drift
Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the
More informationLearning Kernels -Tutorial Part III: Theoretical Guarantees.
Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationA simpler unified analysis of Budget Perceptrons
Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:
More information1 Review of the Perceptron Algorithm
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #15 Scribe: (James) Zhen XIANG April, 008 1 Review of the Perceptron Algorithm In the last few lectures, we talked about various kinds
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationPerceptron (Theory) + Linear Regression
10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationKernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin
Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationComputational Learning Theory
09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More informationOnline Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationOnline Learning with Experts & Multiplicative Weights Algorithms
Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationOn-Line Learning with Path Experts and Non-Additive Losses
On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationThe Perceptron Algorithm 1
CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron
More informationCOMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37
COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationLinear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.
Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also
More informationName (NetID): (1 Point)
CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationThe Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904,
More informationSupport Vector Machines. Machine Learning Fall 2017
Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationTHE WEIGHTED MAJORITY ALGORITHM
THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!
More informationTheory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!
Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationNo-Regret Algorithms for Unconstrained Online Convex Optimization
No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationi=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.
More informationMore about the Perceptron
More about the Perceptron CMSC 422 MARINE CARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal Daume III Recap: Perceptron for binary classification Classifier = hyperplane that separates positive
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationLearning Theory: Basic Guarantees
Learning Theory: Basic Guarantees Daniel Khashabi Fall 2014 Last Update: April 28, 2016 1 Introduction The Learning Theory 1, is somewhat least practical part of Machine Learning, which is most about the
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationPreliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1
90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression
More informationOnline Bounds for Bayesian Algorithms
Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present
More informationCOMS 4771 Lecture Boosting 1 / 16
COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?
More informationFrom Bandits to Experts: A Tale of Domination and Independence
From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationOnline prediction with expert advise
Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with
More informationNew bounds on the price of bandit feedback for mistake-bounded online multiclass learning
Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationOnline Learning: Random Averages, Combinatorial Parameters, and Learnability
Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationReferences. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule
References Lecture 7: Support Vector Machines Isabelle Guyon guyoni@inf.ethz.ch An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 992 http://www.clopinet.com/isabelle/p apers/colt92.ps.z
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More information