Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research]

Learning Is (Often) Just a Game some learning problems: learn from training examples to detect faces in images sequentially predict what decision a person will make next (e.g., what link a user will click on) learn to drive toy car by observing how human does it these appear to be very different this talk: can all be viewed as game-playing problems can all be handled as applications of single algorithm for playing repeated games take-home message: game-theoretic algorithms have wide applications in learning reveals deep connections between seemingly unrelated learning problems

Outline how to play repeated games: theory and an algorithm applications: boosting on-line learning apprenticeship learning

How to Play Repeated Games [with Freund]

Games game defined by matrix M: Rock Paper Scissors Rock 1/2 1 0 Paper 0 1/2 1 Scissors 1 0 1/2 row player ( Mindy ) chooses row i column player ( Max ) chooses column j (simultaneously) Mindy s goal: minimize her loss M(i, j) assume (wlog) all entries in [0,1]

Randomized Play usually allow randomized play: Mindy chooses distribution P over rows Max chooses distribution Q over columns (simultaneously) Mindy s (expected) loss = i,j P(i)M(i,j)Q(j) = P MQ M(P,Q) i, j = pure strategies P, Q = mixed strategies m = # rows of M also write M(i,Q) and M(P,j) when one side plays pure and other plays mixed

Sequential Play say Mindy plays before Max if Mindy chooses P then Max will pick Q to maximize M(P,Q) loss will be L(P) max Q M(P,Q) = maxm(p, j) j [note: maximum realized at pure strategy] so Mindy should pick P to minimize L(P) loss will be min P L(P) = min P max Q M(P,Q) similarly, if Max plays first, loss will be max Q min P M(P,Q)

Minmax Theorem playing second (with knowledge of other player s move) cannot be worse than playing first, so: min max M(P,Q) max minm(p,q) P Q Q P }{{}}{{} Mindy plays first Mindy plays second von Neumann s minmax theorem: min maxm(p,q) = max minm(p,q) P Q Q P in words: no advantage to playing second

Optimal Play minmax theorem: min maxm(p,q) = max min M(P,Q) = value v of game P Q Q P optimal strategies: P = arg min P max Q M(P,Q) = minmax strategy Q = arg max Q min P M(P,Q) = maxmin strategy in words: Mindy s minmax strategy P guarantees loss v (regardless of Max s play) optimal because Max has maxmin strategy Q that can force loss v (regardless of Mindy s play) e.g.: in RPS, P = Q = uniform solving game = finding minmax/maxmin strategies

Weaknesses of Classical Theory seems to fully answer how to play games just compute minmax strategy (e.g., using linear programming) weaknesses: game M may be unknown game M may be extremely large opponent may not be fully adversarial may be possible to do better than value v e.g.: Lisa (thinks): Poor predictable Bart, always takes Rock. Bart (thinks): Good old Rock, nothing beats that.

Repeated Play if only playing once, hopeless to overcome ignorance of game M or opponent but if game played repeatedly, may be possible to learn to play well goal: play (almost) as well as if knew game and how opponent would play ahead of time

Repeated Play (cont.) M unknown for t = 1,...,T: want: Mindy chooses P t Max chooses Q t (possibly depending on P t ) Mindy s loss = M(P t,q t ) Mindy observes loss M(i,Q t ) of each pure strategy i 1 T 1 T M(P t,q t ) min M(P,Q t ) + [ small amount ] T P T t=1 t=1 }{{}}{{} actual average loss best loss (in hindsight)

Multiplicative-weights Algorithm (MW) choose η > 0 initialize: P 1 = uniform on round t: P t+1 (i) = P t(i) exp( η M(i,Q t )) normalization idea: decrease weight of strategies suffering the most loss directly generalizes [Littlestone & Warmuth] other algorithms: [Hannan 57] [Blackwell 56] [Foster & Vohra] [Fudenberg & Levine].

Analysis Theorem: can choose η so that, for any game M with m rows, and any opponent, 1 T 1 T M(P t,q t ) min M(P,Q t ) + T T P T t=1 t=1 }{{}}{{} actual average loss best average loss ( v) ( ) lnm where T = O 0 T regret T is: logarithmic in # rows m independent of # columns running time (of MW) is: linear in # rows m independent of # columns therefore, can use when working with very large games

Can Prove Minmax Theorem as Corollary want to prove: min maxm(p,q) max minm(p,q) P Q Q P (already argued ) game M played repeatedly Mindy plays using MW on round t, Max chooses best response: Q t = arg max Q M(P t,q) [note: Q t can be pure] let P = 1 T T P t, t=1 Q = 1 T T t=1 Q t

Proof of Minmax Theorem min P max Q P MQ = max Q max Q P MQ 1 T = 1 T 1 T T t=1 T P t MQ t=1 max Q P t MQ T P t MQ t t=1 1 min P T T P MQ t + T t=1 = minp MQ + T P max min Q P P MQ + T

Solving a Game derivation shows that: max Q M(P,Q) v + T and min P M(P,Q) v T so: P and Q are T -approximate minmax and maxmin strategies further: can choose Q t s to be pure Q = 1 T t Q t will be sparse ( T non-zero entries)

Summary of Game Playing presented MW algorithm for repeatedly playing matrix games plays almost as well as best fixed strategy in hindsight very efficient independent of # columns proved minmax theorem MW can also be used to approximately solve a game

Application to Boosting [with Freund]

Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you review a paper... non-spam From: xa412@hotmail.com Earn money without working!!!!... spam.. goal: have computer learn from examples to distinguish spam from non-spam main observation: easy to find rules of thumb that are often correct If viagra occurs in message, then predict spam hard to find single rule that is very highly accurate.

The Boosting Approach given: training examples access to weak learning algorithm for finding weak classifiers (rules of thumb) repeat: choose distribution on training examples intuition: put most weight on hardest examples train weak learner on distribution to get weak classifier combine weak classifiers into final classifier use (weighted) majority vote weak learning assumption: weak classifiers slightly but consistently better than random want: final classifier to be very accurate

Boosting as a Game Mindy (row player) booster Max (column player) weak learner matrix M: row training example column weak classifier { M(i,j) = 1 if j-th weak classifier correct on i-th training example 0 else encodes which weak classifiers correct on which examples huge # of columns one for every possible weak classifier

Boosting and the Minmax Theorem γ-weak learning assumption: for every distribution on examples can find weak classifier with weighted error 1 2 γ equivalent to: (value of game M) 1 2 + γ by minmax theorem, implies that: some weighted majority classifier that correctly classifies all training examples further, weights are given by maxmin strategy of game M

Idea for Boosting maxmin strategy of M has perfect (training) accuracy find approximately using earlier algorithm for solving a game i.e., apply MW to M yields (variant of) AdaBoost

AdaBoost and Game Theory AdaBoost can be derived as special case of general game playing algorithm (MW) idea of boosting intimately tied to minmax theorem game-theoretic interpretation can further be used to analyze: AdaBoost s convergence properties AdaBoost as a large-margin classifier (margin = measure of confidence, useful for analyzing generalization accuracy)

Application: Detecting Faces [Viola & Jones] problem: find faces in photograph or movie weak classifiers: detect light/dark rectangles in image many clever tricks to make extremely fast and accurate

Application to On-line Learning [with Freund]

Example: Predicting Stock Market want to predict daily if stock market will go up or down every morning: listen to predictions of other forecasters (based on market conditions) formulate own prediction every evening: find out if market actually went up or down want own predictions to be almost as good as best forecaster

On-line Learning given access to pool of predictors could be: people, simple fixed functions, other learning algorithms, etc. for our purposes, think of as fixed functions of observable context on each round: get predictions of all predictors, given current context make own prediction find out correct outcome want: # mistakes close to that of best predictor training and testing occur simultaneously no statistical assumptions examples chosen by adversary

On-line Learning as a Game Mindy (row player) learner Max (column player) adversary matrix M: row predictor column { context 1 if i-th predictor wrong on j-th context M(i,j) = 0 else actually, the same game as in boosting, but with roles of players reversed

Applying MW can apply MW to game M weighted majority algorithm MW analysis gives: [Littlestone & Warmuth] (# mistakes of MW) (# mistakes of best predictor) + [ small amount ] regret only logarithmic in # predictors

Example: Mind-reading Game [with Freund & Doshi] can use to play penny-matching: human and computer each choose a bit if match, computer wins else human wins random play wins with 50% probability very hard for humans to play randomly can do better if can predict what opponent will do algorithm: MW with many fixed predictors each prediction based on recent history of plays e.g.: if human played 0 on last two rounds, then predict next play will be 1 else predict next play will be 0 play at: seed.ucsd.edu/ mindreader

Histogram of Scores number 200 180 160 140 120 100 80 60 40 20 computer wins 0-100 -75-50 -25 0 25 50 75 100 score human wins score = (rounds won by human) (rounds won by computer) based on 11,882 games computer won 86.6% of games average score = 41.0

Slower is Better (for People) 300 average duration (seconds) 250 200 150 100 50 0-100 -75-50 -25 0 25 50 score the faster people play, the lower their score when play fast, probably fall into rhythmic pattern easy for computer to detect/predict

Application to Apprenticeship Learning [with Syed]

Learning How to Behave in Complex Environments example: learning to drive in toy world states: positions of cars actions: steer left/right; speed up; slow down reward: for going fast without crashing or going off road in general, called Markov decision process (MDP) goal: find policy that maximizes expected long-term reward policy = mapping from states to actions

Unknown Reward Function [Abbeel & Ng] when real-valued reward function known, optimal policy can be found using well established techniques however, often reward function not known in driving example: may know reward features : faster is better than slower collisions are bad going off road is bad but may not know how to combine these into single real-valued function we assume reward is unknown convex combination of known real-valued features

Apprenticeship Learning also assume get to observe behavior of expert goal: imitate expert improve on expert, if possible

Game-theoretic Approach want to find policy such that improvement over expert is as large as possible, for all possible reward functions can formulate as game: rows features columns policies (note: extremely large number of columns) optimal (randomized) policy, under our criterion, is exactly maxmin of game can solve (approximately) using MW: to pick best column on each round, use methods for standard MDP s with known reward

Conclusions presented MW algorithm for playing repeated games used to prove minmax theorem can use to approximately solve games efficiently many diverse applications and insights boosting key concepts and algorithms are all game-theoretic on-line learning in essence, the same problem as boosting apprenticeship learning new and richer problem, but solvable as a game

References Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting. In Ninth Annual Conference on Computational Learning Theory, 1996. Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79 103, 1999. Umar Syed and Robert E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems 20, 2008. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. Coming soon (hopefully): Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. MIT Press, 2010(?).