Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Similar documents
Belief-based Learning

Perceptron Mistake Bounds

The No-Regret Framework for Online Learning

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

The Perceptron algorithm

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

The Perceptron Algorithm 1

First Prev Next Last Go Back Full Screen Close Quit. Game Theory. Giorgio Fagiolo

Linear Discrimination Functions

Online Passive-Aggressive Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Support vector machines Lecture 4

Lecture 16: Perceptron and Exponential Weights Algorithm

The Multi-Arm Bandit Framework

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Online Learning, Mistake Bounds, Perceptron Algorithm

Lecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007

Online Gradient Descent Learning Algorithms

Linear discriminant functions

Logistic Regression Logistic

Empirical Risk Minimization Algorithms

Quantum Games. Quantum Strategies in Classical Games. Presented by Yaniv Carmeli

Online Passive-Aggressive Algorithms

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Hybrid Machine Learning Algorithms

1 Review of the Perceptron Algorithm

Machine Learning for NLP

An Algorithms-based Intro to Machine Learning

THE WEIGHTED MAJORITY ALGORITHM

4. Opponent Forecasting in Repeated Games

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Perceptron (Theory) + Linear Regression

Logistic Regression. Machine Learning Fall 2018

Game-Theoretic Learning:

Online Convex Optimization

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

6.867 Machine learning

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

The Perceptron Algorithm, Margins

Minimax risk bounds for linear threshold functions

Algorithms, Games, and Networks January 17, Lecture 2

Lecture 8. Instructor: Haipeng Luo

On the Generalization Ability of Online Strongly Convex Programming Algorithms

6.036 midterm review. Wednesday, March 18, 15

0.1 Motivating example: weighted majority algorithm

Machine Learning. Lecture 9: Learning Theory. Feng Li.

EE 511 Online Learning Perceptron

Name (NetID): (1 Point)

Optimal and Adaptive Online Learning

Games A game is a tuple = (I; (S i ;r i ) i2i) where ffl I is a set of players (i 2 I) ffl S i is a set of (pure) strategies (s i 2 S i ) Q ffl r i :

Prisoner s Dilemma. Veronica Ciocanel. February 25, 2013

Warm up: risk prediction with logistic regression

Discrete random variables

General-sum games. I.e., pretend that the opponent is only trying to hurt you. If Column was trying to hurt Row, Column would play Left, so

More about the Perceptron

6.891 Games, Decision, and Computation February 5, Lecture 2

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Advanced Machine Learning

CMU-Q Lecture 24:

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

9 Classification. 9.1 Linear Classifiers

Computing Minmax; Dominance

Machine Learning for NLP

Game Theory, Evolutionary Dynamics, and Multi-Agent Learning. Prof. Nicola Gatti

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

MS&E 246: Lecture 12 Static games of incomplete information. Ramesh Johari

Support Vector Machines. Machine Learning Fall 2017

Linear classifiers: Overfitting and regularization

Advanced Machine Learning

Multilayer Perceptron

Linear Classifiers and the Perceptron

Kernelized Perceptron Support Vector Machines

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Linear & nonlinear classifiers

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

1 Machine Learning Concepts (16 points)

Bargaining Efficiency and the Repeated Prisoners Dilemma. Bhaskar Chakravorti* and John Conley**

Iterated Strict Dominance in Pure Strategies

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 4: Linear predictors and the Perceptron

Generative adversarial networks

Machine Learning And Applications: Supervised Learning-SVM

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Online Learning with Experts & Multiplicative Weights Algorithms

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics

Ch 4. Linear Models for Classification

Brown s Original Fictitious Play

Solving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6

Transcription:

Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm and analyze its performance. To discuss convergence results of the classical Perceptron algorithm. To introduce the experts framework and prove mistake bounds in that framework. To show the relationship between online learning and the theory of learning in games.

What is online learning? Sample data are arranged in a sequence. Each time we get a new input, the algorithm tries to predict the corresponding output. As the number of seen samples increases, hopefully the predictions improve.

Assets 1. does not require storing all data samples 2. typically fast algorithms 3. it is possible to give formal guarantees not assuming probabilistic hypotheses (mistakes bounds) but... performance can be worse than best batch algorithms generalization bounds always require some assumption on the generation of sample data

Online setting Sequence of sample data z 1, z 2,..., z n. Each sample is an input-output couple z i = (x i, y i ). x i X IR d, y i Y IR. In the classification case Y = {+1, 1}, in the regression case Y = [ M, M]. Loss function V : IR V (w, y) = 1 yw + ). Y IR + (e.g. E(w, y) = Θ( yw) and Estimators f i : X Y constructed using the first i data samples.

Online setting (cont.) initialization f 0 for i = 1, 2,..., n receive x i predict f i 1 (x i ) receive y i update (f i 1, z i ) f i Note: storing efficiently f i 1 may require much less memory than storing all previous samples z 1, z 2,..., z i 1.

Goals Batch learning: reducing expected loss I[f n ] = IE z V (f n (x), y) Online learning: reducing cumulative loss n i=1 V (f i 1 (x i ), y i )

Online implementation of RLS update rule: For some choice of the sequences of positive parameters γ i and λ i, ( ) f i = f i 1 γ i (f i 1 (x i ) y i )K xi + λ i f i 1, where K : X X IR is a pd kernel and for every x X, K x (x ) = K(x, x ). Note: this rule has a simple justification assuming that the sample points (x i, y i ) are i.i.d. from a probability distribution ρ. Online learning algorithms. Smale, Yao. 05

Interpretation of online RLS For sake of simplicity, let us set λ t = λ > 0. We would like to estimate the ideal regularized least-squares estimator f ρ λ f ρ λ = arg min (f(x) y) 2 dρ(z) + f K. X Y f H K 2 From the definition above it can be showed that f ρ λ also satisfies IE z ρ [(f ρ λ (x) y)k x + λf ρ λ ] = 0, therefore, f λ is also the equilibrium point of the averaged online ρ update equation f i = f i 1 γ i IE zi ρ[(f i 1 (x i ) y i )K xi + λf i 1 ].

Generalization bound for online algorithm Theorem: Let f ρ be the minimizer of the expected squared loss I[f] (i.e. the regression function). Assume K(x, x) κ for some positive constant κ, and L r K f ρ L 2 (X, ρ X ) for some r [1/2, 1]. 2r 1 Then letting γ i = c 1 i 2r+1 and λ i = c 2 i 2r+1 for some constants c 1 and c 2, with probability greater than 1 δ, for all i IN it holds 2r 2r+1 I[f i ] I[f ρ ] + Ci, where C depends on M, κ, r, L r f ρ ρ and δ. K ( ) Note: the rates of convergence O i 2r+1 attainable under these assumptions. 2r are the best theoretically Online learning as stochastic approximations of regularization paths. Tarres, Yao. 05

The Perceptron Algorithm We consider the classification problem: Y = { 1, +1}. We deal with linear estimators f i (x) = ω i x, with ω i IR d. The 0-1 loss E(f i (x), y) = Θ( y(ω i x)) is the natural choice in the classification context. We will also consider the more tractable hinge-loss V (f i (x), y) = 1 y(ω i x) +. Update rule: If E i = E(f i 1 (x i ), y i ) = 0 then ω i = ω i 1, otherwise ω i = ω i 1 + y i x i

The Perceptron Algorithm (cont.) Passive-Aggressive strategy of the update rule. If f i 1 classifies correctly x i, don t move. If f i 1 classifies incorrectly, try to increase the margin y i (ω x i ). In fact, y i (ω i x i ) = y i (ω i 1 x i ) + y i x i > y i (ω i 1 x i ) 2 2

Perceptron Convergence Theorem Theorem: If the samples z 1,..., z n are linearly separable, then presenting them cyclically to the Perceptron algorithm, the sequence of weight vectors ω i will eventually converge. We will proof a more general result encompassing both the separable and the inseparable cases Pattern Classification. Duda, Hart, Stork, 01

Mistakes Bound Theorem: Assume x i R for every i = 1, 2,..., n. Then for every u IR d 2 n M R u + Vˆ 2 i, i=1 where ˆV i = V (u x i, y i ) and M is the total number of mistakes: n M = n i=1 E i = i=1 E(f i 1 (x i ), y i ). Online Passive-Aggressive Algorithms. Crammer et al, 03

Mistakes Bound (cont.) the boundedness conditions x i R is necessary. in the separable case, there exists u inducing margins y i (u x i ) 1, and therefore null batch loss over sample points. The Mistakes Bound becomes M R 2 u 2. in the inseparable case, we can let u be the best possible linear separator. The bound compares the online performance with the best batch performance over a given class of competitors.

Proof The terms ω i u increase as i increases 1. If E i = 0 then ω i u = ω i 1 u 2. If E i = 1, since ˆV i = 1 y i (x i u) +, ω i u = ω i 1 u + y i (x i u) ω i 1 u + 1 ˆV i. 3. Hence, in both cases ω i u ω i 1 u + (1 ˆV i )E i 4. Summing up, ω n u M n ˆ i=1 V i E i.

Proof (cont.) The terms ω i do not increase too quickly 1. If E = 0 then ω i 2 = ω i 1 2 i 2. If E i = 1, since y i (ω i 1 x i ) 0, ω i 2 = (ω i 1 + y i x i ) (ω i 1 + y i x i ) = ω i 1 2 + x i 2 + 2y i (ω i 1 x ω i 1 2 + R 2 i ). 3. Summing up, ω n 2 = MR 2.

Proof (cont.) Using the estimates for ω n u and ω n 2, and applying Cauchy- Schwartz inequality 1. By C-S, ω n u ω n u, hence M n ˆ V i E i ω n u ω n u MR u i=1 n ˆ ˆ i=1 i n M 2. Finally, by C-S, i=1 V i E i n V 2 n i=1 E i 2, hence Vˆ 2 i R u. i=1

The Experts Framework We will focus on the classification case. Suppose we have a pool of prediction strategies, called experts. Denote by E = {E 1,..., E n }. Each expert predicts y i based on x i. We want to combine these experts to produce a single master algorithm for classification and prove bounds on how much worse it is than the best expert.

The Halving Algorithm Suppose all the experts are functions (their predictions for a point in the space do not change over time) and at least one of them is consistent with the data. At each step, predict what the majority of experts that have not made a mistake so far would predict. Note that all inconsistent experts get thrown away! Maximum of log 2 ( E ) errors. But what if there is no consistent function in the pool? (Noise in the data, limited pool, etc.) Barzdin and Freivald, On the prediction of general recursive functions, 1972, Littlestone and Warmuth, The Weighted Majority Algorithm, 1994

The Weighted Majority Algorithm Associate a weight w i with every expert. Initialize all weights to 1. At example t: E q 1 i=1 = w i I[E i predicted y t = 1] E q 1 = w i I[E i predicted y t = 1] i=1 Predict y t = 1 if q 1 > q 1, else predict y t = 1 If the prediction is wrong, multiply the weights of each expert that made a wrong prediction by 0 β < 1. Note that for β = 0 we get the halving algorithm. Littlestone and Warmuth, 1994

Mistake Bound for WM For some example t let W t = E i=1 w i = q 1 + q 1 Then when a mistake occurs W t+1 uw t where u < 1 Therefore W 0 u m W n Or m log(w 0/W n ) log(1/u) log(w 0 /W n ) Then m log(2/(1+β)) (setting u = 1+β 2 )

Mistake Bound for WM (contd.) Why? Because when a mistake is made, the ratio of total weight after the trial to total weight before the trial is at most (1 + β)/2. W.L.o.G. assume WM predicted 1 and the true outcome was +1. Then new weight after trial is: 1+β βq 1 + q 1 βq 1 + q 1 + 1 β 2 (q 1 q 1 ) = 2 (q 1 + q 1. The main theorem (Littlestone & Warmuth): Assume m i is the number of mistakes made by the ith expert on a sequence of n instances and that E = k. Then the WM algorithm makes at most the following number of mistakes: log(k) + m i log(1/β) log(2/(1 + β)) Big fact: Ignoring leading constants, the number of errors of the pooled predictor is bounded by the sum of the number of errors of the best expert in the pool and the log of the number of experts!

Finishing the Proof W 0 = k and W n β m i log(w 0 /W n ) = log(w 0 ) log(w n ) log(w n ) > m i log β, so log(w n ) < m i log(1/β) Therefore log(w 0 ) log(w n ) < log k + m i log(1/β)

A Whirlwind Tour of Game Theory Players choose actions, receive rewards based on their own actions and those of the other players. A strategy is a specification for how to play the game for a player. A pure strategy defines, for every possible choice a player could make, which action the player picks. A mixed strategy is a probability distribution over strategies. A Nash equilibrium is a profile of strategies for all players such that each player s strategy is an optimal response to the other players strategies. Formally, a mixed-strategy profile σ i is a Nash equilibrium if for all players i: u i (σ i, σ i ) u i (s i, σ i i ) s S i

Some Games: Prisoners Dilemma Cooperate Defect Cooperate +3, +3 0, +5 Defect +5, 0 +1, +1 Nash equilibrium: Both players defect!

Some Games: Matching Pennies H T H +1, 1 1, +1 T 1, +1 +1, 1 Nash equilibrium: Both players randomize half and half between actions.

Learning in Games Suppose I don t know what payoffs my opponent will receive. I can try to learn her actions when we play repeatedly (consider 2-player games for simplicity). Fictitious play in two player games. Assumes stationarity of opponent s strategy, and that players do not attempt to influence each others future play. Learn weight functions κ t i (s i ) = κ i t 1 (s i ) + { 1 if s t i 1 = s i 0 otherwise Fudenberg & Levine, The Theory of Learning in Games, 1998

Calculate probabilities of the other player playing various moves as: γ i t (s i κ i ) = t (s i ) s i S i κt i ( s i ) Then choose the best response action.

Fictitious Play (contd.) If fictitious play converges, it converges to a Nash equilibrium. If the two players ever play a (strict) NE at time t, they will play it thereafter. (Proofs omitted) If empirical marginal distributions converge, they converge to NE. But this doesn t mean that play is similar! t Player1 Action Player2 Action κ 1 T κ 2 T 1 T T (1.5, 3) (2, 2.5) 2 T H (2.5, 3) (2, 3.5) 3 T H (3.5, 3) (2, 4.5) 4 H H (4.5, 3) (3, 4.5) 5 H H (5.5, 3) (4, 4.5) 6 H H (6.5, 3) (5, 4.5) 7 H T (6.5, 4) (6, 4.5) Cycling of actions in fictitious play in the matching pennies game

Universal Consistency Persistent miscoordination: Players start with weights of (1, 2) A B A 0, 0 1, 1 B 1, 1 0, 0 A rule ρ i is said to be ǫ-universally consistent if for any ρ i lim supmax u (σ i, γ t i ) T σ i i 1 T t u i (ρ i t (h t 1)) ǫ almost surely under the distribution generated by (ρ i, ρ i ), where h t 1 is the history up to time t 1, available for the decision-making algorithm at time t.

Back to Experts Bayesian learning cannot give good payoff guarantees. Define universal expertise analogously to universal consistency, and bound regret (lost utility) with respect to the best expert, which is a strategy. The best response function is derived by solving the optimization problem i i max I u t + λv i (I ) I i i u i t is the vector of average payoffs player i would receive by using each of the experts I i is a probability distribution over experts

λ is a small positive number. Under technical conditions on v, satisfied by the negative entropy: σ(s) log σ(s) s we retrieve the exponential weighting scheme, and for every ǫ there is a λ such that our procedure is ǫ-universally expert.