Game Theory, On-line prediction and Boosting (Freund, Schapire)

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Game Theory, On-line Prediction and Boosting

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Lecture 8. Instructor: Haipeng Luo

Learning, Games, and Networks

Online Learning, Mistake Bounds, Perceptron Algorithm

Totally Corrective Boosting Algorithms that Maximize the Margin

Agnostic Online learnability

The No-Regret Framework for Online Learning

Computational Game Theory Spring Semester, 2005/6. Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg*

Littlestone s Dimension and Online Learnability

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Convex Repeated Games and Fenchel Duality

Game Theory, On-line Prediction and Boosting

Computational Learning Theory

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

Online Convex Optimization

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Convex Repeated Games and Fenchel Duality

Motivating examples Introduction to algorithms Simplex algorithm. On a particular example General algorithm. Duality An application to game theory

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

10.1 The Formal Model

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Empirical Risk Minimization Algorithms

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

CS446: Machine Learning Spring Problem Set 4

Computational Learning Theory. CS534 - Machine Learning

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Lectures 6, 7 and part of 8

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning

Game Theory: Lecture 3

The sample complexity of agnostic learning with deterministic labels

Generalization bounds

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Review of Zero-Sum Games

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin

1 Primals and Duals: Zero Sum Games

New Algorithms for Contextual Bandits

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

1 Rademacher Complexity Bounds

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Algorithmic Game Theory and Applications. Lecture 4: 2-player zero-sum games, and the Minimax Theorem

Zero-Sum Games Public Strategies Minimax Theorem and Nash Equilibria Appendix. Zero-Sum Games. Algorithmic Game Theory.

Optimal and Adaptive Online Learning

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Computational and Statistical Learning Theory

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Adaptive Game Playing Using Multiplicative Weights

CS260: Machine Learning Theory Lecture 12: No Regret and the Minimax Theorem of Game Theory November 2, 2011

Online Learning with Experts & Multiplicative Weights Algorithms

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Computational Learning Theory

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

1 Review of Winnow Algorithm

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Question 1. (p p) (x(p, w ) x(p, w)) 0. with strict inequality if x(p, w) x(p, w ).

Learning with multiple models. Boosting.

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Computational Learning Theory

The PAC Learning Framework -II

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

The Power of Random Counterexamples

Computing Minmax; Dominance

CS364A: Algorithmic Game Theory Lecture #13: Potential Games; A Hierarchy of Equilibria

Advanced Machine Learning

A simple algorithmic explanation for the concentration of measure phenomenon

Computing Minmax; Dominance

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar Thakoor & Umang Gupta

Online Learning Summer School Copenhagen 2015 Lecture 1

i=1 = H t 1 (x) + α t h t (x)

FINAL: CS 6375 (Machine Learning) Fall 2014

Computational Learning Theory (COLT)

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Algorithms, Games, and Networks January 17, Lecture 2

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Hybrid Machine Learning Algorithms

Deterministic Calibration and Nash Equilibrium

Computational and Statistical Learning Theory

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Optimal and Adaptive Algorithms for Online Boosting

Computational Learning Theory

Algorithmic Game Theory and Applications. Lecture 7: The LP Duality Theorem

Optimization 4. GAME THEORY

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

16.1 L.P. Duality Applied to the Minimax Theorem

Introduction to Machine Learning (67577) Lecture 3

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Interval values for strategic games in which players cooperate

Computational Learning Theory

Voting (Ensemble Methods)

Transcription:

Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting, seemingly unrelated topics. aper outline: Review of game theory. Algorithm for learning to play repeated games (based on on-line prediction methods of Littlestone and Warmuth (LW). New simple proof of Von Neumann s minimax theorem (stem from the analysis of the algorithm above). Method of approximately solving a game (stem from the analysis of the algorithm above). On-line prediction model (obtained by applying the game playing algorithm above to an appropriate choice of game). Boosting algorithm (obtained by applying the same algorithm to the dual of this game) 2 GAME HEORY Example: he loss matrix of Rock, aper, Scissors is: R S R 2 0 0 2 S 0 2

Game setting: We study two person game in normal form. hat is, each game is defined by a matrix M. here are two players called the row player and column player. o play the game, the row player chooses a row i, and simultaneously the column player chooses a column j. Definition: ure strategy is a deterministic single choice of row(column) by the players. Definition: he loss suffered by the row player is M(i, j). layers goals: he row player s goal is to minimize its loss, the goal of the column player is to maximize this loss (zero sum game). However, the results apply when no assumptions are made about the goal or strategy of the column player. Assumptions:. All the losses are in [0,], simple scaling can be used to get more general results. 2. he game matrix is finite (that is, the number of choices available to each player is finite). Most of the results translate with very mild additional assumptions to cases of infinite matrix games. 2. RANDOMIZED LAY Game setting: he row player chooses a distribution over the rows of M and simultaneously the column player chooses a distribution over the columns. Definition: Mixed strategy is a choice of distribution over the matrix rows(columns) by the players. Definition: computed as he expected loss of the row player (the loss from now on) is (i)m(i, j)(j) = M := M(, ) i,j If the row player chooses a mixed strategy but the column player chooses pure strategy, single column j, the loss is (i)m(i, j) := M(, j) he notation M(i, ) is defined analogously. i 2

2.2 SEUENIAL LAY Game Setting: Suppose now that instead of choosing strategies simultaneously the play is sequential. he column player chooses its strategy after the row player has chosen and announced its strategy. layers goals: he column player s goal is to maximize the row player s loss - zero sum game.given, such a worst case / adversarial column player will choose to maximize M(,). he row player s plays payoff will be maxm(, ). Knowing this, the row player should choose to minimize it, and its loss will be min max M(, ). Notice that the row player plays first here. If the column player plays first and the row player can choose its play with the benefit of knowing strategy, its loss will be max min M(, ). 2.3 HE MINMAX HEOREM Definition: M inmax strategy is a mixed strategy * realizing this minimum: min max M(, ). Maxmin strategy is a mixed strategy * realizing this maximum: max min M(, ). heorem: max min M(, ) = min max M(, ) := v. Definition: v is called the value of the game M. Explanations: We expect the player who chooses its strategy last to have the advantage, since it plays knowing its opponent s strategy. hus, max min M(, ) min max M(, ). However the theorem states that playing last doesn t give any advantage. he row player has a (minmax) strategy * such that regardless of the strategy played by the column player, the loss suffered M(, ) will be at most v. Symmetrically, it means that the column player has a (maxmin) strategy * such that, regardless of the strategy played by the row player the loss will be at least v. his means that the strategies * and * are optimal in a strong sense. laying by classical game theory: Given zero sum game M, one should play using a minmax strategy (can be computed as Linear rogramming). roblems with this approach: Explain why these extrema exist 3

he column player may not be truly adversarial and may behave in a manner that admits loss significantly smaller than the game value v. M may be unknown. M may be too large for computing. 2.4 REEAED LAY Motivation: We try to overcome these difficulties by playing the game repeatedly (one shot game is hopeless). Our goal is to learn how to play well against a particular opponent (which is not necessarily adversarial). Game setting: We refer to the row player as the learner and the column player as the environment. Let M be a matrix, possibly unknown to the learner. he game is played repeatedly in a sequence of rounds. On round t =,..., :. he learner chooses mixed strategy t. 2. he environment chooses mixed strategy t (which may be chosen with knowledge of t ). 3. he learner is permitted to observe the loss M(i, t ) for each row i. 4. he learner suffers loss M( t, t ). Learner s goal: o suffer cumulative loss t= M( t, t ) which is not much worse than min M(, t ), which is the loss of the best fixed strategy in hindsight against the actual sequence of plays,...,. 4

Algorithm LW(β) arameter: β [0, ) Initialize all the weights are set to unity w (i) =, i n. Do for t =,..., :. Compute mixed strategy: t (i) = wt(i) i wt(i), i n 2. Update the weights: w t+ (i) = w t (i)β M(i,t) (sanity check: why these updates make sense?) Output:,... t. [w t (i) := weight at time t on row i] heorem : For any matrix M with n rows and entries in [0,], for any sequence of mixed strategies,..., played by the environment, the sequence of mixed strategies,... produced by algorithm LW with parameter β [0, ) satisfy: where t= M( t, t ) a β min a β = ln(/β) β M(, t ) + c β ln(n) t= c β = β roof: For t =,..., we have that: i= w t+ (i) = def w t (i)β M(i,t) i= w t (i)( ( β)m(i, t )) = ( ( β)m( t, t )) w t (i) i= he inequality follows from: β x ( β)x, for β > 0 and x [0, ]. 2 he last equality follows from: n i= w t (i)( ( β)m(i, t )) = w n t(i) n i= w t(i) ( w t (i)( ( β)m(i, t ))) = i= = ( w t (i)) t (i)( ( β)m(i, t )) = ( w t (i)) t (i) t (i)( β)m(i, t )) = i= i= i= i= 2 rove it. hint: use convexity argument (recall the definition). i= i= 5

= ( w t (i))( ( β)m( t, t )) i= [Recall: i t(i) =, t (i) = wt(i) i wt(i)] Unwrapping this recurrence gives: [base case: w (i) =, n i= w (i) = n] w + (i) n ( ( β)m( t, t )) ( ) i= Note that for any j n: t= β t= M(j,t) = w + (j) Combining (*) and (**) and taking logs gives: ln(β) w + (i) ( ) i= M(j, t ) ln(n)+ ln( ( β)m( t, t )) t= t= ln( x) x for x< ln(n) ( β) Rearranging terms and noticing that this holds for any j (so taking the min over j gives a tight bound) we get: t= M( t, t ) ln(β) β min j t= M(j, t ) + ln(n) β Since the minimum (over mixed strategies ) in the bound of the theorem must be achieved by a pure strategy j, this implies the theorem. M( t, t ) t= Remarks: lim β a β = For fixed β and as the number of rounds becomes large, c β ln(n) becomes negligible relative to. hus, by choosing β close to, the learner ensures that its loss will not be much worse than the loss of the best strategy (formalized in the following corollary). Corollary 2: Under the conditions of heorem and with β = average per trial loss suffered by the learner is t= M( t, t ) min M(, t ) + t= + 2ln(n), the 6

where = 2ln(n) + ln(n) Corollary 3: Under the conditions of Corollary 2, ln(n) = O( ). M( t, t ) v + t= roof: Let be the minmax strategy for M, so for all column strategy : M(, ) v. Using corollary 2: Remarks: M( t, t ) t= corollary 2 M(, t ) + v + t= heorem guarantees that the cumulative loss is not much larger than that of any fixed mixed strategy. In particular not much larger than the game value (corollary 3). If the environment is non-adversarial, there might be a better fixed mixed strategy for the player and in which case the algorithm is guaranteed to be almost as good as this better strategy. 2.5 ROOF OF HE MINMAX HEOREM We prove: min max M(, ) max min M(, ) Suppose that we run algorithm LW (produced,.., ) against the maximally adversarial environment: on each round t the environment chooses: t = argmaxm( t, ). Denote = t= t and = t= t (both are probability distributions). We have: min max M max M def of max t M t= t= max t M = def of t t M t t= corollary 2 min t= M t + = def of min M + max min M+ can be made arbitrarily close to zero and it ends the proof. 7

2.6 AROXIMAELY SOLVING A GAME he algorithm LW can be used to find an approximate minmax or maxmin strategy. Definition: Approximate minmax strategy satisfy: max M(, ) v + ( can be made arbitrarily small). Approximate maxmin strategy analogously. Claim: = t= t is an approximate minmax strategy ( t produced by LW) 3. roof: from proof 2.5 we get: max M(, ) max minm(, ) + = v +. Claim: = t= t such that t = argmaxm( t, )] is an approximate maxmin strategy (from proof 2.5): v minm(, ). t can always be chosen to be a pure strategy 4. It is important to our derivation of a boosting algorithm in section 4. 3 ON-LINE REDICION On-line prediction setting: In the on-line prediction model, the learner observes a sequence of examples and predicts their label one at a time. he learner s goal is to minimize its prediction errors. Formal definition: Let X be a finite set of instances, and let H be a finite set of hypothesis h : X {0, } (there are generalizations for infinite cases). Let c : X {0, } be an unknown target concept, not necessarily in H (how to choose a good H? bias-variance trade-off). he learning takes place in a sequence of rounds. On round t =,..., :. he learner observes an example x t X. 2. he learner makes a randomized prediction ŷ t {0, } of the label associated with x t. 3. he learner observes the correct label c(x t ). 3 Give a full proof. Follow proof 2.5 exactly. 4 rove it. Use directly the definition of the loss M( t, ). 8

he goal of the learner: o minimize the expected number (with respect to its own randomization) of mistakes that it makes relative to the best hypothesis in the space H. Definition: Mistake matrix M H X (the game matrix in this case) defined as follows: { if h(x) c(x) M(h, x) = 0 else Reduction to the repeated game problem: he environment s choice of column corresponds to a choice of an instance x that is presented to the learner in a given iteration (pure strategy). he learner s choice of a distribution over the matrix rows (hypotheses) corresponds to making a random choice of a hypothesis with which to predict (mixed strategy). Apply the LW algorithm: On round t, we Have distribution t over H. Given instance x t (pure strategy) We randomly select h t H according to t and predict y t = ˆh t (x t ). Given c(x t ) we compute M(h, x t ), h H and update the weights by LW. Analysis: M( t, x t ) = h H t (h)m(h, x t ) = r h t [h(x t ) c(x t )] herefore, the expected number of mistakes made by the learner equals (by corollary 2) Remarks: t= M( t, x t ) min h H M(h, x t ) + O( ln H ) t= Analysis using theorem gives a better bound. he result can be generalized to any bounded loss function and also to a more generalized settings. 9

4 BOOSING Weak learning definition: For γ > 0, we say the algorithm WL is a γ-weak learning algorithm for (H, c) if for any distribution over X, the algorithm takes as input a set of labeled examples distributed according to and outputs a hypothesis h H with error slightly better than a random guessing: r [h(x) c(x)] x 2 γ Boosting definition: Boosting is the problem of converting a weak learning algorithm into one that performs with good accuracy. he goal is to run the weak algorithm many times on many distributions and combine the selected hypothesis into final hypothesis with small error rate. he main issues are how to choose the distributions (D t ) and how to combine the hypothesis. he boosting process: Boosting proceeds in rounds. On round t =,..., :. he booster constructs a distribution D t on X which is passed to the weak learner. 2. he weak learner produces a hypothesis h t H with error: r x D t [h t (x) c(x)] 2 γ 3. After rounds the weak hypothesis h,..., h are combined into a final hypothesis h fin. Assumptions: robability of successes: We assume that the weak learner always succeeds, thus the boosting algorithm succeeds with absolute certainty (usually we assume that the weak learning succeeds with high probability, thus the boosting algorithm succeed with probability > δ [AC]). Final hypothesis error: We have full access to the labels associated with the entire domain X. hus, we require that the final hypothesis have error zero so that all instances are correctly classified. he algorithm can be modified to fit the more standard (and practical) model in which the final error must be less than some positive parameter ɛ. With given labeled training set, all distributions would be computed over the training set. he generalization error of the final hypothesis can then be bounded using, for instance, standard VC theory. 0

4. BOOSING AND HE MINMAX HEORM Motivation: next section. his part gives an intuition for the boosting algorithm in the Relationship between the mistake matrix M (section 3) and the minmax theorem: min max M(, x) = min max M(, ) = v = max min M(, ) = max min M(h, ) ( ) x Last equality: for any, minm(, ), is realized at pure strategy h. First equality: for any, maxm(, ), is realized at pure strategy x. Note that M(h, ) = r [h(x) c(x)]. x here exist a distribution (maxmin strategy) on X such that for every h: M(h, ) = r [h(x) c(x)] v. x From weak learnability, exists h such that: r [h(x) c(x)] x 2 γ. Hence, v 2 γ. here exist a distribution (minmax strategy) over H such that for every x: M(, x) = r [h(x) c(x)] v h 2 γ < 2. hat is, every x is misclassified by less than 2 of the hypothesis (as weighted by minmax strategy). herefore, the target concept c is equivalent to a weighted majority of hypothesis in H. Corollary: If (H, c) are γ weakly learnable, then c can be computed exactly as a weighted majority of hypotheses in H. he weights defined by distribution on rows (hypotheses) of game M are a minmax strategy for this game. 4.2 IDEA FOR BOOSING ALGORIHM he idea: By the corollary above, we approximate c by approximating the weights of the minmax strategy (recall subsection 2.6). Adapt LW to the boosting model: he LW algorithm does not fit the boosting model. Recall that on each round, algorithm LW computes a distribution over the rows of the game matrix (hypotheses). However, in the boosting model, we want to compute on each round a distribution over instances (columns of M). Rather than using game M directly, we construct the dual of M which is the identical game except that the roles of the row and column players have been h

reversed. Construct the dual matrix: We modify our matrix game M (rows-hypothesis, columns- instances) to fit the boosting model.we construct to dual game matrix M as follows:. he algorithm computes a distribution over rows (hypothesis), in the boosting model we want to compute at each round a distribution over the instances (columns of M). We need to reverse row and column so we take M. 2. he column player of M wants to maximize to loss but the row player of M wants to minimize it. We want to reverse the meaning of minimum and maximum, therefore we take -M. 3. Our convention of losses being in [0,], for that reason we take -M. Definition: he dual matrix is defined as M(x, h) = M(h, x) = M(h, x) = { if h(x) = c(x) 0 else Remark: Any minmax strategy of the game M becomes a maxmin strategy of the game M. herefore, whereas before we were interested in finding an approximate minmax strategy of M, we are now interested in finding an approximate maxmin strategy of M. Apply algorithm LW to M : We apply LW to the dual game matrix M. he reduction proceeds as follows: On round t of boosting:. LW Computes distribution t over rows of M (over X). 2. he boosting algorithm sets D t = t and passes D t to the weak learning algorithm. 3. he weak learning algorithm return h t : r x D t [h t (x) = c(x)] 2 + γ 4. he weights maintained by LW are updated where t is defined to be pure strategy h t. hat is w t+ (i) = w t (i)β M(i,ht), i. 5. Return hypotheses h,..., h. Constructing approximate maxmin strategy: According to the method of approximately solving a game (2.6), on each round t, t may be a pure strategy h t that should be chosen to maximize: M ( t, h t ) = t (x)m (x, h t ) = r [h t (x) = c(x)]. x t x hat is, h t should have maximum accuracy with respect to distribution t. 2

For that, we use the weak learner. Although it is not guaranteed to succeed in finding the best h t, finding one of accuracy 2 + γ turns out to be sufficient for our purposes. Finally, this method suggests that = t= t is an approximate maxmin strategy, and we showed that the target c is equivalent to a majority of the hypotheses if weighted by a maxmin strategy of M (by 4.). t is a pure strategy (hypothesis) h t, this leads us to choose a simple majority of h,..., h : h fin = majority(h,..., h ). 4.3 ANALYSIS Our boosting procedure will compute h fin identical to c (for sufficiently large ). For all t: By corollary 2 we have: 2 + γ M ( t, h t ) = r x t [h t (x) = c(x)] 2 + γ t= Rearranging terms, for all x: 2 < for large ( <γ) M ( t, h t ) min x 2 + γ M (x, h t ) + t= M (x, h t ) ln X ) = O( ).We choose = Ω( γ ln X ) for 2 < γ. By definition of M, t= M (x, h t ) is exactly the number of hypothesis h t which agree with c on x, and its more than 2. So by definition of h fin we get h fin = c(x), x. t= 3

Algorithm 2 Boosting algorithm: Input: instance space X and target function c. γ-weak learning algorithm.. Set = 4 γ 2 ln X (so that < γ) 2. Set β = + 3. D (x) = X 4. For t =,..., : 2ln X for x X. ass distribution D t to weak learner. Get back hypothesis h t, s.t r [h t (x) c(x)] x D 2 γ t { β if h t (x) = c(x) w t+ (x) = w t (x), x else Update D t : D t+ (x) = wt+(x) x wt+(x), x Output: final hypothesis h fin = majority(h,..., h ) 4