Online Learning with Experts & Multiplicative Weights Algorithms
|
|
- Flora Bernadette Fleming
- 5 years ago
- Views:
Transcription
1 Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech
2 Table of contents 1. Online Learning with Experts With a perfect expert Without perfect experts 2. Multiplicative Weights algorithms Regret of Multiplicative Weights 3. Learning with Experts revisited Continuous vector predictions Subtlety in the discrete case 2
3 Today What should you take away from this lecture? How should you base your prediction on expert predictions? What are the characteristics of multiplicative weights algorithms? 3
4 References Online Learning [Ch 2-4], Gabor Bartok, David Pal, Csaba Szepesvari, and Istvan Szita. The Multiplicative Weights Update Method: A Meta-Algorithm and Applications, Sanjeev Arora, Elad Hazan, and Satyen Kale. Theory of Computing, 8, ,
5 Online Learning with Experts
6 Leo at the Oscars: an online binary prediction problem Will Leo win an Oscar this year? (running question since ) 6
7 Leo at the Oscars: an online binary prediction problem... No 2005 No 2006 No 2007 No 2008 No 2009 No 2010 No 7
8 Leo at the Oscars: an online binary prediction problem 2011 No 2012 No 2013 No 2014 No 2015 No 2016 Yes! 8
9 Leo at the Oscars: an online binary prediction problem A frivolous extrapolation: 2016 Yes! 2017 No 2018 Yes! 2019 No 2020 Yes 2021 Yes 9
10 Leo at the Oscars: an online binary prediction problem Imagine you re a showbiz fan and want to predict the answer every year t. But, you don t know all the ins and outs of Hollywood. Instead, you are lucky enough to have access to experts {i} i I, that each make a (possibly wrong!) prediction p i,t. How do you use the experts for your own prediction? How do you incorporate the feedback from Nature? How do we achieve sublinear regret? 10
11 Formal description 11
12 Online Learning with Experts Formal setup: there are T rounds in which we predict a binary label ˆp t Y = {0, 1}. At every timestep t: 1. Each expert i = 1... n predicts p i,t Y 2. You make a prediction ˆp t Y 3. Nature reveals y t Y 4. We suffer a loss l(ˆp t, y t ) = 1 [ˆp t y t ] 5. Each expert suffers a loss l(p i,t, y t) = 1 [p i,t y t] Loss: ˆL T = T l(ˆpt, yt), L i,t = T l(p i,t, y t) # of mistakes made Regret: R T = ˆL T min i L i,t How do you decide what ˆp t to predict? How do you incorporate the feedback from Nature? How do we achieve sublinear regret? 12
13 Online Learning with Experts Experts is a way to abstract a hypothesis class. For the most part, we ll deal with a finite, discrete number of experts, because that s easier to analyze. In general, there can be a continuous space of experts = using a standard hypothesis class. Boosting uses a collection of n weak classifiers as experts. At every time t it adds 1 weak classifier with weight α t to the ensemble classifier h 1:t. The prediction ˆp t is based on the ensemble h 1:t that we have collected so far. 13
14 Weighted Majority: the halving algorithm Let s assume that there is a perfect expert i, that is guaranteed to know the right answer (say a mind-reader that can read the minds of the voting Academy members). That is, t : p i,t = y t and l i,t = 0. Keep regret (= # mistakes) small find the perfect expert quickly with few mistakes. Eliminate experts that make a mistake Take a majority vote of alive experts Let s define some groups of experts: The alive set: E t = {i : i did not make a mistake until time t}, E 0 = {1,..., n} The nay-sayers: E 0 t 1 = {i E t 1 : p i,t = 0} The yay-sayers: E 1 t 1 = {i E t 1 : p i,t = 1} 14
15 Weighted Majority: the halving algorithm t ˆlt y t ˆp t p 1,t p 2,t p 3,t p 4,t p 5,t p 6,t p 7,t p 8,t p 9,t p 10,t p 11,t p 12,t No No No No No Yes! No Yes! No Yes Yes 15
16 Weighted Majority: the halving algorithm The halving algorithm 1: for t = 1... T do 2: Receive expert predictions (p 1,t... p n,t ) 3: Split E t 1 into E 0 t 1 E 1 t 1 4: if E 1 t 1 > E 0 t 1 then follow the majority 5: Predict ˆp t = 1 6: else 7: Predict ˆp t = 0 8: end if 9: Receive Nature s answer y t (and incur loss l(ˆp t, y t )) 10: Update E t with experts that continue to be right 11: end for 16
17 Weighted Majority: the halving algorithm Notice that if halving makes a mistake, then at least half of the experts in E t 1 were wrong: W t = E t = E y t t 1 E t 1 /2. Since there is always a perfect expert, the algorithm makes no more than log 2 E 0 = log 2 n mistakes sublinear regret. Qualitatively speaking: W t multiplicatively decreases when halving makes a mistake. If W t doesn t shrink too much, then halving can t make too many mistakes. There is a lower bound on W t for all t, since there is an expert. W t can t shrink too much starting from its initial value. We ll see similar behavior later. 17
18 Weighted Majority So what should we do if we don t know anything about the experts? The majority vote might be always wrong! Give each of them a weight w i,t, initialized as w i,1 = 1. Decide based on a weighted sum of experts: ˆp t = 1 [weighted sum of yay-sayers > weighted sum of nay-sayers] [ n ˆp t = 1 w i,t 1 p i,t > i=1 ] n w i,t 1 (1 p i,t ) i=1 18
19 Weighted Majority Recall: with a perfect expert we had The halving algorithm 1: for t = 1... T do 2: Receive expert predictions (p 1,t... p n,t ) 3: Split E t 1 into E 0 t 1 E 1 t 1 4: if E 1 t 1 > E 0 t 1 then follow the majority 5: Predict ˆp t = 1 6: else 7: Predict ˆp t = 0 8: end if 9: Receive Nature s answer y t (and incur loss l(ˆp t, y t )) 10: Update E t with experts that continue to be right 11: end for 19
20 Weighted Majority Without knowledge about the experts: choose a decay factor β [0, 1) The Weighted Majority algorithm 1: Initialize w i,1 = 1, W 1 = n. 2: for t = 1... T do 3: Receive expert predictions (p 1,t... p n,t ) 4: if n i=1 w i,t 1p i,t > n i=1 w i,t 1(1 p i,t ) then 5: Predict ˆp t = 1 6: else 7: Predict ˆp t = 0 8: end if 9: Receive Nature s answer y t (and incur loss l(ˆp t, y t )) 10: w i,t w i,t 1 β 1(pi,t y t) 11: end for 20
21 Weighted Majority halving is equivalent to choosing w i,t = 1 if i is alive (β = 0). What happens as β 1? Faulty experts are not punished that much. 21
22 Weighted Majority Define the potential W t = n i=1 w i,t (the weighted size of the set of experts). Theorem 1: W t W t 1 If ˆp t y t then W t 1+β 2 W t 1. Compare with before: W t multiplicatively decreases when halving makes a mistake. Is there always a lower non-zero bound on W t for all t? Yes, with finite T. No, infinite time all weights could decay towards 0. Note that we didn t normalize the w i,t - we ll fix this in the next part. What about the regret behavior? Is it sublinear? We ll leave this for now. 22
23 Multiplicative Weights algorithms
24 Multiplicative Weights High-level intuition: More general case: choose from n options. Stochastic strategies allowed. Not always experts present. 24
25 Multiplicative Weights Setup: at each timestep t = 1... T: 1. you want to take a decision d D = {1,..., n} 2. you choose a distribution p t over D and sample randomly from it. 3. Nature reveals the cost vector c t, where each cost c i,t is bounded in [ 1, 1] 4. the expected cost is then E i pt [c t ] = c t p t. Goal: minimize regret with respect to the best decision in hindsight, min i T c i,t. In halving, decision = choose expert, cost = mistake, and p t was 1 for the majority of alive experts (note that p is not prediction here). 25
26 The Multiplicative Weights algorithm Comes in many forms! The multiplicative weights algorithm 1: Fix η : Initialize w i,1 = 1 3: Initialize W 1 = n 4: for t = 1... T do 5: W t = i w i,t. 6: Choose a decision i p t = w i,t W t 7: Nature reveals c t 8: w i,t+1 w i,t (1 ηc i,t ) 9: end for 26
27 Regret of Multiplicative Weights Regret of multiplicative weights Assume cost c i,t is bounded in [ 1, 1] and η 1. Then after T rounds, for 2 any decision i: c t p t c i,t + η c i,t + log n η 27
28 Regret of Multiplicative Weights Regret of multiplicative weights Assume cost c i,t is bounded in [ 1, 1] and η 1. Then after T rounds, for 2 any decision i: c t p t c i,t + η c i,t + log n η The cost of any fixed decision i Relates to how quickly the MWA updates itself after observing the costs at time t How conservative the MWA should be - if η is too small, we overfit too quickly 28
29 Regret of Multiplicative Weights Regret of multiplicative weights Assume cost c i,t is bounded in [ 1, 1] and η 1. Then after T rounds, for 2 any decision i: c t p t c i,t + η c i,t + log n η η 1 is needed to make some inequalities work. 2 Generality: no assumptions about the sequence of events (may be correlated, costs may be adversarial)! Holds for all i: taking a min over decisions, we see: R T log n η + η min i c i,t 29
30 Regret of Multiplicative Weights What is the regret behavior? R T log n η + η min i c i,t Since the sum is O(T), it s sub-linear with appropriate choice η Then ( ) R T O T log n What is the issue with this? We need to know the horizon T up front! ( ) log n If T is unknown, use η t min, 1 or the doubling trick. t 2 log n T. 30
31 Regret of Multiplicative Weights What if we don t choose η R T log n η 1 T? What if η is constant w.r.t. T? What if η < 1 T? + η min i c i,t Fundamental tension between fitting to expert and learning from Nature Optimality of MW: in the online setting you can t do better: the regret is ( ) lower bounded as R T Ω T log n. See Theorem 4.1 in Arora. 31
32 Regret of Multiplicative Weights R T log n η + η min i c i,t Why can t you achieve O(log T) regret as with FTL in this case? Recall for Follow The Leader with strongly convex loss, the difference between the two consecutive decisions and losses scaled as p t p t 1 = O( 1 t ) In MW, the loss-differential scales with p t p t 1 = O(η), which is O( 1 t ), so the MW algorithm needs to be prepared to make much bigger changes than FTL over time. 32
33 Proof of MW The steps in the proof are: 1. Upper bound W t in terms of cumulative decay factors 2. Lower bound W t by using convexity arguments 3. Combine the upper and lower bounds on W t to get the answer Let s go through the proof carefully. 33
34 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 34
35 Proof of MW: Getting the upper bound on W t W t+1 = i = i w i,t+1 w i,t (1 ηc i,t ) 34
36 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 = w i,t (1 ηc i,t ) i = W t ηw t c i,t p i,t i w i,t = W tp i,t 34
37 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 = w i,t (1 ηc i,t ) i = W t ηw t c i,t p i,t = W t (1 ηc t p t ) i w i,t = W tp i,t 34
38 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 = w i,t (1 ηc i,t ) i = W t ηw t c i,t p i,t = W t (1 ηc t p t ) W t exp ( ηc t p t ) i w i,t = W tp i,t convexity 34
39 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 = w i,t (1 ηc i,t ) i = W t ηw t c i,t p i,t = W t (1 ηc t p t ) W t exp ( ηc t p t ) Recursively using this inequality: ( ) W T+1 W 1 exp η c t p t ( = n exp η i ) c t p t w i,t = W tp i,t convexity x R : 1 x e x 34
40 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 35
41 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 = t T (1 ηc i,t ) multipl. updates and w i,1 = 1 35
42 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 = (1 ηc i,t ) multipl. updates and w i,1 = 1 t T = (1 ηc i,t ) (1 ηc i,t ) split positive + negative costs t:c i,t 0 t:c i,t <0 35
43 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 = (1 ηc i,t ) multipl. updates and w i,1 = 1 t T = (1 ηc i,t ) (1 ηc i,t ) split positive + negative costs t:c i,t 0 t:c i,t <0 Now we use the fact that: x [0, 1] : (1 η) x (1 ηx) x [ 1, 0] : (1 + η) x (1 ηx) 35
44 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 = (1 ηc i,t ) multipl. updates and w i,1 = 1 t T = (1 ηc i,t ) (1 ηc i,t ) split positive + negative costs t:c i,t 0 t:c i,t <0 Now we use the fact that: x [0, 1] : (1 η) x (1 ηx) x [ 1, 0] : (1 + η) x (1 ηx) Since c i,t [ 1, 1], we can apply this to each factor in the product above: W T+1 (1 η) 0 c i,t (1 + η) <0 c i,t This is the lower bound we want. 35
45 Proof of MW: Combine the upper and lower bounds We get: ( (1 η) 0 c i,t (1 + η) <0 c i,t W T+1 n exp η Take logs: c i,t log (1 η) c i,t log (1 + η) log n η <0 0 ) c t p t c t p t 36
46 Proof of MW: Combine the upper and lower bounds c i,t log (1 η) c i,t log (1 + η) log n η <0 0 Now we ll massage this into the form we want: c t p t 37
47 Proof of MW: Combine the upper and lower bounds c i,t log (1 η) c i,t log (1 + η) log n η <0 0 Now we ll massage this into the form we want: η c t p t c t p t log n c i,t log (1 η) + c i,t log (1 + η) 0 <0 37
48 Proof of MW: Combine the upper and lower bounds c i,t log (1 η) c i,t log (1 + η) log n η <0 0 Now we ll massage this into the form we want: η c t p t c t p t log n c i,t log (1 η) + c i,t log (1 + η) 0 <0 = log n + ( ) 1 c i,t log + c i,t log (1 + η) 1 η 0 <0 37
49 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t 38
50 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t log n + 0 ( ) 1 c i,t log + c i,t log (1 + η) 1 η <0 38
51 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t log n + 0 log n + 0 ( ) 1 c i,t log + c i,t log (1 + η) 1 η <0 ( c i,t η + η 2) + c i,t (η η 2) use inequalities <0 38
52 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t log n + ( ) 1 c i,t log + c i,t log (1 + η) 1 η 0 <0 log n + ( c i,t η + η 2) + c i,t (η η 2) use inequalities 0 <0 = log n + η c i,t + η 2 c i,t η 2 c i,t 0 <0 38
53 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t log n + ( ) 1 c i,t log + c i,t log (1 + η) 1 η 0 <0 log n + ( c i,t η + η 2) + c i,t (η η 2) use inequalities 0 <0 = log n + η c i,t + η 2 c i,t η 2 c i,t 0 <0 = log n + η c i,t + η 2 c i,t combine sums 38
54 Proof of MW: Combine the upper and lower bounds Dividing by η, we get what we want: c t p t log n η + c i,t + η c i,t 39
55 Some remarks Matrix form of MW Gains instead of losses See Arora s paper for more details 40
56 Learning with Experts revisited
57 Multiplicative Weights with continuous label spaces. 42
58 Exponential Weighted Average: continuous prediction case Formal setup: there are T rounds in which you predict a label ˆp t C. C is a convex subset of some vector space. At every timestep t: 1. Each expert i = 1... n predicts p i,t C 2. You make a prediction ˆp t C 3. Nature reveals y t Y 4. We suffer a loss l(ˆp t, y t ) 5. Each expert suffers a loss l(p i,t, y t ) Loss: ˆL t = T l(ˆpt, yt), L i,t = T l(p i,t, y t) Regret: R t = ˆL t min i L i,t 43
59 Exponential Weighted Average: continuous prediction case We assume that the loss l(ˆp t, y t ) as a function l : C Y R: is bounded: p C, y Y : l(p, y) [0, 1] l(., y) is convex for any fixed y Y 44
60 Exponential Weighted Average This brings us to a familiar algorithm. Choose an η > 0 (we ll make this more directed later). Exponential Weighted Average algorithm 1: Initialize w i,0 = 1, W 0 = n. 2: for t = 1... T do 3: Receive expert predictions (p 1,t... p N,t ) Ni=1 w i,t 1 p i,t 4: Predict ˆp t = W t 1 5: Receive Nature s answer y t (and incur loss l(ˆp t, y t)) 6: w i,t w i,t 1 e ηl(p i,t,y t) 7: W t N i=1 w i,t 8: end for 45
61 Exponential Weighted Average Regret of EWA Assume that the loss l is a function l : C Y [0, 1], where C is convex and l(., y) is convex for any fixed y Y. Then R T = ˆL T min L i,t log n + ηt i η 8 8 log n T Hence, if we choose η =, the regret R T T log n = O( T). 2 The proof follows from an application of Jensen s inequality and Hoeffding s lemma. See chapter 3 of Bartok for details. Again note that we need to know what T is in advance. 46
62 So how about the regret in the discrete case? 47
63 Weighted Majority Regret of Weighted Majority Let C = Y, Y have at least 2 elements and l(p, y) = 1(p y). Let L i,t = min i L i,t. ( ) ( ) ) 1 2 (log 2 log β 2 L 1+β i,t + log 2 N R T = ˆL T min L i,t ( ) i 2 log 2 1+β This bound is of the form R T = al i,t + b = O(T)! The discrete case is harder than the continuous case if we stick to deterministic algorithms. The worst-case regret is achieved by a set of 2 experts and an outcome sequence y t such that ˆL T = T. See chapter 4 of Bartok for details. 48
64 Today What should you (at least) take away from this lecture? How should you base your prediction on expert predictions? Can use a weighted majority algorithm, which is a type of MWA. General framework for online learning with experts (e.g. boosting). What are the characteristics of multiplicative weights algorithms?... Few assumptions on costs / Nature (can be adversarial), therefore broadly applicable. Fundamental tension between fitting to decision and learning from Nature. Tends to have instantenous loss-differential O( 1 t ), worse than the version of FTL O( 1 ) that we saw in lecture 1. t 49
65 Questions? 50
THE WEIGHTED MAJORITY ALGORITHM
THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationOnline learning CMPUT 654. October 20, 2011
Online learning CMPUT 654 Gábor Bartók Dávid Pál Csaba Szepesvári István Szita October 20, 2011 Contents 1 Shooting Game 4 1.1 Exercises...................................... 6 2 Weighted Majority Algorithm
More information0.1 Motivating example: weighted majority algorithm
princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationWeighted Majority and the Online Learning Approach
Statistical Techniques in Robotics (16-81, F12) Lecture#9 (Wednesday September 26) Weighted Majority and the Online Learning Approach Lecturer: Drew Bagnell Scribe:Narek Melik-Barkhudarov 1 Figure 1: Drew
More informationThe Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm
he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,
More informationLecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are
More informationCS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm
CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the
More informationIntroduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16
600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationTutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.
Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More information1 Review and Overview
DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last
More informationOnline Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over
More informationCS261: Problem Set #3
CS261: Problem Set #3 Due by 11:59 PM on Tuesday, February 23, 2016 Instructions: (1) Form a group of 1-3 students. You should turn in only one write-up for your entire group. (2) Submission instructions:
More informationLecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More informationMistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron
Stat 928: Statistical Learning Theory Lecture: 18 Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Instructor: Sham Kakade 1 Introduction This course will be divided into 2 parts.
More informationDistributed online optimization over jointly connected digraphs
Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationDistributed online optimization over jointly connected digraphs
Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Southern California Optimization Day UC San
More informationLearning, Games, and Networks
Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to
More informationAnswering Many Queries with Differential Privacy
6.889 New Developments in Cryptography May 6, 2011 Answering Many Queries with Differential Privacy Instructors: Shafi Goldwasser, Yael Kalai, Leo Reyzin, Boaz Barak, and Salil Vadhan Lecturer: Jonathan
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationOnline Prediction: Bayes versus Experts
Marcus Hutter - 1 - Online Prediction Bayes versus Experts Online Prediction: Bayes versus Experts Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928 Manno-Lugano,
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationAlgorithms, Games, and Networks January 17, Lecture 2
Algorithms, Games, and Networks January 17, 2013 Lecturer: Avrim Blum Lecture 2 Scribe: Aleksandr Kazachkov 1 Readings for today s lecture Today s topic is online learning, regret minimization, and minimax
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect
More informationNew Algorithms for Contextual Bandits
New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised
More informationClassification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13
Classification Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY Slides adapted from Mohri Jordan Boyd-Graber UMD Classification 1 / 13 Beyond Binary Classification Before we ve talked about
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationOnline Learning under Full and Bandit Information
Online Learning under Full and Bandit Information Artem Sokolov Computerlinguistik Universität Heidelberg 1 Motivation 2 Adversarial Online Learning Hedge EXP3 3 Stochastic Bandits ε-greedy UCB Real world
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationComputational Learning Theory. Definitions
Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational
More informationA survey: The convex optimization approach to regret minimization
A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization
More informationLecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012
CS-62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning Non-Linear Classifiers In the previous lectures, we have focused on finding linear classifiers,
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we
More informationAdaptivity and Optimism: An Improved Exponentiated Gradient Algorithm
Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)
More information1 Review of the Perceptron Algorithm
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #15 Scribe: (James) Zhen XIANG April, 008 1 Review of the Perceptron Algorithm In the last few lectures, we talked about various kinds
More informationOptimal and Adaptive Online Learning
Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning
More informationCS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs
CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last
More informationCS260: Machine Learning Theory Lecture 12: No Regret and the Minimax Theorem of Game Theory November 2, 2011
CS260: Machine Learning heory Lecture 2: No Regret and the Minimax heorem of Game heory November 2, 20 Lecturer: Jennifer Wortman Vaughan Regret Bound for Randomized Weighted Majority In the last class,
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationTheory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!
Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training
More informationb + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d
CS161, Lecture 4 Median, Selection, and the Substitution Method Scribe: Albert Chen and Juliana Cook (2015), Sam Kim (2016), Gregory Valiant (2017) Date: January 23, 2017 1 Introduction Last lecture, we
More informationCSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010
CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 Computational complexity studies the amount of resources necessary to perform given computations.
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationRegret bounded by gradual variation for online convex optimization
Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date
More informationCS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality
CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality Tim Roughgarden December 3, 2014 1 Preamble This lecture closes the loop on the course s circle of
More informationOnline Convex Optimization Using Predictions
Online Convex Optimization Using Predictions Niangjun Chen Joint work with Anish Agarwal, Lachlan Andrew, Siddharth Barman, and Adam Wierman 1 c " c " (x " ) F x " 2 c ) c ) x ) F x " x ) β x ) x " 3 F
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationHybrid Machine Learning Algorithms
Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 9 Learning Classifiers: Bagging & Boosting Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationBagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7
Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector
More informationFoundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015
10-806 Foundations of Machine Learning and Data Science Lecturer: Avrim Blum Lecture 9: October 7, 2015 1 Computational Hardness of Learning Today we will talk about some computational hardness results
More informationOnline Sparse Linear Regression
JMLR: Workshop and Conference Proceedings vol 49:1 11, 2016 Online Sparse Linear Regression Dean Foster Amazon DEAN@FOSTER.NET Satyen Kale Yahoo Research SATYEN@YAHOO-INC.COM Howard Karloff HOWARD@CC.GATECH.EDU
More informationGame Theory, On-line prediction and Boosting (Freund, Schapire)
Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting,
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationLecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem
Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationLecture 3: Lower Bounds for Bandit Algorithms
CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods
More informationMidterm, Fall 2003
5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are
More informationAn Algorithms-based Intro to Machine Learning
CMU 15451 lecture 12/08/11 An Algorithmsbased Intro to Machine Learning Plan for today Machine Learning intro: models and basic issues An interesting algorithm for combining expert advice Avrim Blum [Based
More information2 Upper-bound of Generalization Error of AdaBoost
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y
More informationTuring Machine Recap
Turing Machine Recap DFA with (infinite) tape. One move: read, write, move, change state. High-level Points Church-Turing thesis: TMs are the most general computing devices. So far no counter example Every
More information15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015
5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and
More informationEmpirical Risk Minimization Algorithms
Empirical Risk Minimization Algorithms Tirgul 2 Part I November 2016 Reminder Domain set, X : the set of objects that we wish to label. Label set, Y : the set of possible labels. A prediction rule, h:
More informationNondeterministic finite automata
Lecture 3 Nondeterministic finite automata This lecture is focused on the nondeterministic finite automata (NFA) model and its relationship to the DFA model. Nondeterminism is an important concept in the
More informationExtracting Certainty from Uncertainty: Regret Bounded by Variation in Costs
Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Yahoo! Research 4301
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationarxiv: v4 [cs.lg] 27 Jan 2016
The Computational Power of Optimization in Online Learning Elad Hazan Princeton University ehazan@cs.princeton.edu Tomer Koren Technion tomerk@technion.ac.il arxiv:1504.02089v4 [cs.lg] 27 Jan 2016 Abstract
More informationNo-Regret Algorithms for Unconstrained Online Convex Optimization
No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.
More informationLecture 9: Generalization
Lecture 9: Generalization Roger Grosse 1 Introduction When we train a machine learning model, we don t just want it to learn to model the training data. We want it to generalize to data it hasn t seen
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationAdvanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12
Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview
More informationBoosting: Foundations and Algorithms. Rob Schapire
Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you
More informationDecision Trees: Overfitting
Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationAn analogy from Calculus: limits
COMP 250 Fall 2018 35 - big O Nov. 30, 2018 We have seen several algorithms in the course, and we have loosely characterized their runtimes in terms of the size n of the input. We say that the algorithm
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationLecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018
CS17 Integrated Introduction to Computer Science Klein Contents Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018 1 Tree definitions 1 2 Analysis of mergesort using a binary tree 1 3 Analysis of
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More information