Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Similar documents
Learning, Games, and Networks

Hybrid Machine Learning Algorithms

Game Theory, On-line prediction and Boosting (Freund, Schapire)

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Boosting: Foundations and Algorithms. Rob Schapire

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

The No-Regret Framework for Online Learning

Optimal and Adaptive Online Learning

Game Theory, On-line Prediction and Boosting

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Gambling in a rigged casino: The adversarial multi-armed bandit problem

A Game-Theoretic Approach to Apprenticeship Learning

Adaptive Game Playing Using Multiplicative Weights

Totally Corrective Boosting Algorithms that Maximize the Margin

An Algorithms-based Intro to Machine Learning

Advanced Machine Learning

Advanced Machine Learning

Lecture 8. Instructor: Haipeng Luo

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Prediction and Playing Games

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

0.1 Motivating example: weighted majority algorithm

Agnostic Online learnability

Optimal and Adaptive Algorithms for Online Boosting

Ensembles. Léon Bottou COS 424 4/8/2010

Computational and Statistical Learning Theory

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Algorithms, Games, and Networks January 17, Lecture 2

New Algorithms for Contextual Bandits

Theory and Applications of Boosting. Rob Schapire

Online Learning and Sequential Decision Making

CS229 Supplemental Lecture notes

From Bandits to Experts: A Tale of Domination and Independence

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

9 Classification. 9.1 Linear Classifiers

Learning theory. Ensemble methods. Boosting. Boosting: history

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Lecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Game Theory, On-line Prediction and Boosting

Ensemble Methods for Machine Learning

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Lecture 16: Perceptron and Exponential Weights Algorithm

Data Warehousing & Data Mining

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Bandit models: a tutorial

Advanced Machine Learning

Convergence and No-Regret in Multiagent Learning

1 Primals and Duals: Zero Sum Games

A Drifting-Games Analysis for Online Learning and Applications to Boosting

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

The Multi-Arm Bandit Framework

Online Learning with Experts & Multiplicative Weights Algorithms

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

A Multiplayer Generalization of the MinMax Theorem

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Convex Repeated Games and Fenchel Duality

Perceptron Mistake Bounds

Learning with multiple models. Boosting.

CS 188: Artificial Intelligence. Outline

An overview of Boosting. Yoav Freund UCSD

Exponential Weights on the Hypercube in Polynomial Time

Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics

Game Theory: Lecture 3

Yevgeny Seldin. University of Copenhagen

Introduction to Boosting and Joint Boosting

No-Regret Algorithms for Unconstrained Online Convex Optimization

Be able to define the following terms and answer basic questions about them:

Large-Margin Thresholded Ensembles for Ordinal Regression

Experts in a Markov Decision Process

Game-Theoretic Learning:

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

1 Basic Game Modelling

Full-information Online Learning

COMS 4771 Lecture Boosting 1 / 16

Convex Repeated Games and Fenchel Duality

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Decision Trees: Overfitting

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

Using Additive Expert Ensembles to Cope with Concept Drift

CS260: Machine Learning Theory Lecture 12: No Regret and the Minimax Theorem of Game Theory November 2, 2011

A Course in Machine Learning

Blackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks

Portfolio Optimization

Multi-armed bandit models: a tutorial

Machine Learning Algorithms for Classification. Rob Schapire Princeton University. schapire

Transcription:

Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research]

Learning Is (Often) Just a Game some learning problems: learn from training examples to detect faces in images sequentially predict what decision a person will make next (e.g., what link a user will click on) learn to drive toy car by observing how human does it these appear to be very different this talk: can all be viewed as game-playing problems can all be handled as applications of single algorithm for playing repeated games take-home message: game-theoretic algorithms have wide applications in learning reveals deep connections between seemingly unrelated learning problems

Outline how to play repeated games: theory and an algorithm applications: boosting on-line learning apprenticeship learning

How to Play Repeated Games [with Freund]

Games game defined by matrix M: Rock Paper Scissors Rock 1/2 1 0 Paper 0 1/2 1 Scissors 1 0 1/2 row player ( Mindy ) chooses row i column player ( Max ) chooses column j (simultaneously) Mindy s goal: minimize her loss M(i, j) assume (wlog) all entries in [0,1]

Randomized Play usually allow randomized play: Mindy chooses distribution P over rows Max chooses distribution Q over columns (simultaneously) Mindy s (expected) loss = i,j P(i)M(i,j)Q(j) = P MQ M(P,Q) i, j = pure strategies P, Q = mixed strategies m = # rows of M also write M(i,Q) and M(P,j) when one side plays pure and other plays mixed

Sequential Play say Mindy plays before Max if Mindy chooses P then Max will pick Q to maximize M(P,Q) loss will be L(P) max Q M(P,Q) = maxm(p, j) j [note: maximum realized at pure strategy] so Mindy should pick P to minimize L(P) loss will be min P L(P) = min P max Q M(P,Q) similarly, if Max plays first, loss will be max Q min P M(P,Q)

Minmax Theorem playing second (with knowledge of other player s move) cannot be worse than playing first, so: min max M(P,Q) max minm(p,q) P Q Q P }{{}}{{} Mindy plays first Mindy plays second von Neumann s minmax theorem: min maxm(p,q) = max minm(p,q) P Q Q P in words: no advantage to playing second

Optimal Play minmax theorem: min maxm(p,q) = max min M(P,Q) = value v of game P Q Q P optimal strategies: P = arg min P max Q M(P,Q) = minmax strategy Q = arg max Q min P M(P,Q) = maxmin strategy in words: Mindy s minmax strategy P guarantees loss v (regardless of Max s play) optimal because Max has maxmin strategy Q that can force loss v (regardless of Mindy s play) e.g.: in RPS, P = Q = uniform solving game = finding minmax/maxmin strategies

Weaknesses of Classical Theory seems to fully answer how to play games just compute minmax strategy (e.g., using linear programming) weaknesses: game M may be unknown game M may be extremely large opponent may not be fully adversarial may be possible to do better than value v e.g.: Lisa (thinks): Poor predictable Bart, always takes Rock. Bart (thinks): Good old Rock, nothing beats that.

Repeated Play if only playing once, hopeless to overcome ignorance of game M or opponent but if game played repeatedly, may be possible to learn to play well goal: play (almost) as well as if knew game and how opponent would play ahead of time

Repeated Play (cont.) M unknown for t = 1,...,T: want: Mindy chooses P t Max chooses Q t (possibly depending on P t ) Mindy s loss = M(P t,q t ) Mindy observes loss M(i,Q t ) of each pure strategy i 1 T 1 T M(P t,q t ) min M(P,Q t ) + [ small amount ] T P T t=1 t=1 }{{}}{{} actual average loss best loss (in hindsight)

Multiplicative-weights Algorithm (MW) choose η > 0 initialize: P 1 = uniform on round t: P t+1 (i) = P t(i) exp( η M(i,Q t )) normalization idea: decrease weight of strategies suffering the most loss directly generalizes [Littlestone & Warmuth] other algorithms: [Hannan 57] [Blackwell 56] [Foster & Vohra] [Fudenberg & Levine].

Analysis Theorem: can choose η so that, for any game M with m rows, and any opponent, 1 T 1 T M(P t,q t ) min M(P,Q t ) + T T P T t=1 t=1 }{{}}{{} actual average loss best average loss ( v) ( ) lnm where T = O 0 T regret T is: logarithmic in # rows m independent of # columns running time (of MW) is: linear in # rows m independent of # columns therefore, can use when working with very large games

Can Prove Minmax Theorem as Corollary want to prove: min maxm(p,q) max minm(p,q) P Q Q P (already argued ) game M played repeatedly Mindy plays using MW on round t, Max chooses best response: Q t = arg max Q M(P t,q) [note: Q t can be pure] let P = 1 T T P t, t=1 Q = 1 T T t=1 Q t

Proof of Minmax Theorem min P max Q P MQ = max Q max Q P MQ 1 T = 1 T 1 T T t=1 T P t MQ t=1 max Q P t MQ T P t MQ t t=1 1 min P T T P MQ t + T t=1 = minp MQ + T P max min Q P P MQ + T

Solving a Game derivation shows that: max Q M(P,Q) v + T and min P M(P,Q) v T so: P and Q are T -approximate minmax and maxmin strategies further: can choose Q t s to be pure Q = 1 T t Q t will be sparse ( T non-zero entries)

Summary of Game Playing presented MW algorithm for repeatedly playing matrix games plays almost as well as best fixed strategy in hindsight very efficient independent of # columns proved minmax theorem MW can also be used to approximately solve a game

Application to Boosting [with Freund]

Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you review a paper... non-spam From: xa412@hotmail.com Earn money without working!!!!... spam.. goal: have computer learn from examples to distinguish spam from non-spam main observation: easy to find rules of thumb that are often correct If viagra occurs in message, then predict spam hard to find single rule that is very highly accurate.

The Boosting Approach given: training examples access to weak learning algorithm for finding weak classifiers (rules of thumb) repeat: choose distribution on training examples intuition: put most weight on hardest examples train weak learner on distribution to get weak classifier combine weak classifiers into final classifier use (weighted) majority vote weak learning assumption: weak classifiers slightly but consistently better than random want: final classifier to be very accurate

Boosting as a Game Mindy (row player) booster Max (column player) weak learner matrix M: row training example column weak classifier { M(i,j) = 1 if j-th weak classifier correct on i-th training example 0 else encodes which weak classifiers correct on which examples huge # of columns one for every possible weak classifier

Boosting and the Minmax Theorem γ-weak learning assumption: for every distribution on examples can find weak classifier with weighted error 1 2 γ equivalent to: (value of game M) 1 2 + γ by minmax theorem, implies that: some weighted majority classifier that correctly classifies all training examples further, weights are given by maxmin strategy of game M

Idea for Boosting maxmin strategy of M has perfect (training) accuracy find approximately using earlier algorithm for solving a game i.e., apply MW to M yields (variant of) AdaBoost

AdaBoost and Game Theory AdaBoost can be derived as special case of general game playing algorithm (MW) idea of boosting intimately tied to minmax theorem game-theoretic interpretation can further be used to analyze: AdaBoost s convergence properties AdaBoost as a large-margin classifier (margin = measure of confidence, useful for analyzing generalization accuracy)

Application: Detecting Faces [Viola & Jones] problem: find faces in photograph or movie weak classifiers: detect light/dark rectangles in image many clever tricks to make extremely fast and accurate

Application to On-line Learning [with Freund]

Example: Predicting Stock Market want to predict daily if stock market will go up or down every morning: listen to predictions of other forecasters (based on market conditions) formulate own prediction every evening: find out if market actually went up or down want own predictions to be almost as good as best forecaster

On-line Learning given access to pool of predictors could be: people, simple fixed functions, other learning algorithms, etc. for our purposes, think of as fixed functions of observable context on each round: get predictions of all predictors, given current context make own prediction find out correct outcome want: # mistakes close to that of best predictor training and testing occur simultaneously no statistical assumptions examples chosen by adversary

On-line Learning as a Game Mindy (row player) learner Max (column player) adversary matrix M: row predictor column { context 1 if i-th predictor wrong on j-th context M(i,j) = 0 else actually, the same game as in boosting, but with roles of players reversed

Applying MW can apply MW to game M weighted majority algorithm MW analysis gives: [Littlestone & Warmuth] (# mistakes of MW) (# mistakes of best predictor) + [ small amount ] regret only logarithmic in # predictors

Example: Mind-reading Game [with Freund & Doshi] can use to play penny-matching: human and computer each choose a bit if match, computer wins else human wins random play wins with 50% probability very hard for humans to play randomly can do better if can predict what opponent will do algorithm: MW with many fixed predictors each prediction based on recent history of plays e.g.: if human played 0 on last two rounds, then predict next play will be 1 else predict next play will be 0 play at: seed.ucsd.edu/ mindreader

Histogram of Scores number 200 180 160 140 120 100 80 60 40 20 computer wins 0-100 -75-50 -25 0 25 50 75 100 score human wins score = (rounds won by human) (rounds won by computer) based on 11,882 games computer won 86.6% of games average score = 41.0

Slower is Better (for People) 300 average duration (seconds) 250 200 150 100 50 0-100 -75-50 -25 0 25 50 score the faster people play, the lower their score when play fast, probably fall into rhythmic pattern easy for computer to detect/predict

Application to Apprenticeship Learning [with Syed]

Learning How to Behave in Complex Environments example: learning to drive in toy world states: positions of cars actions: steer left/right; speed up; slow down reward: for going fast without crashing or going off road in general, called Markov decision process (MDP) goal: find policy that maximizes expected long-term reward policy = mapping from states to actions

Unknown Reward Function [Abbeel & Ng] when real-valued reward function known, optimal policy can be found using well established techniques however, often reward function not known in driving example: may know reward features : faster is better than slower collisions are bad going off road is bad but may not know how to combine these into single real-valued function we assume reward is unknown convex combination of known real-valued features

Apprenticeship Learning also assume get to observe behavior of expert goal: imitate expert improve on expert, if possible

Game-theoretic Approach want to find policy such that improvement over expert is as large as possible, for all possible reward functions can formulate as game: rows features columns policies (note: extremely large number of columns) optimal (randomized) policy, under our criterion, is exactly maxmin of game can solve (approximately) using MW: to pick best column on each round, use methods for standard MDP s with known reward

Conclusions presented MW algorithm for playing repeated games used to prove minmax theorem can use to approximately solve games efficiently many diverse applications and insights boosting key concepts and algorithms are all game-theoretic on-line learning in essence, the same problem as boosting apprenticeship learning new and richer problem, but solvable as a game

References Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting. In Ninth Annual Conference on Computational Learning Theory, 1996. Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79 103, 1999. Umar Syed and Robert E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems 20, 2008. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. Coming soon (hopefully): Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. MIT Press, 2010(?).