Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting, seemingly unrelated topics. aper outline: Review of game theory. Algorithm for learning to play repeated games (based on on-line prediction methods of Littlestone and Warmuth (LW). New simple proof of Von Neumann s minimax theorem (stem from the analysis of the algorithm above). Method of approximately solving a game (stem from the analysis of the algorithm above). On-line prediction model (obtained by applying the game playing algorithm above to an appropriate choice of game). Boosting algorithm (obtained by applying the same algorithm to the dual of this game) 2 GAME HEORY Example: he loss matrix of Rock, aper, Scissors is: R S R 2 0 0 2 S 0 2
Game setting: We study two person game in normal form. hat is, each game is defined by a matrix M. here are two players called the row player and column player. o play the game, the row player chooses a row i, and simultaneously the column player chooses a column j. Definition: ure strategy is a deterministic single choice of row(column) by the players. Definition: he loss suffered by the row player is M(i, j). layers goals: he row player s goal is to minimize its loss, the goal of the column player is to maximize this loss (zero sum game). However, the results apply when no assumptions are made about the goal or strategy of the column player. Assumptions:. All the losses are in [0,], simple scaling can be used to get more general results. 2. he game matrix is finite (that is, the number of choices available to each player is finite). Most of the results translate with very mild additional assumptions to cases of infinite matrix games. 2. RANDOMIZED LAY Game setting: he row player chooses a distribution over the rows of M and simultaneously the column player chooses a distribution over the columns. Definition: Mixed strategy is a choice of distribution over the matrix rows(columns) by the players. Definition: computed as he expected loss of the row player (the loss from now on) is (i)m(i, j)(j) = M := M(, ) i,j If the row player chooses a mixed strategy but the column player chooses pure strategy, single column j, the loss is (i)m(i, j) := M(, j) he notation M(i, ) is defined analogously. i 2
2.2 SEUENIAL LAY Game Setting: Suppose now that instead of choosing strategies simultaneously the play is sequential. he column player chooses its strategy after the row player has chosen and announced its strategy. layers goals: he column player s goal is to maximize the row player s loss - zero sum game.given, such a worst case / adversarial column player will choose to maximize M(,). he row player s plays payoff will be maxm(, ). Knowing this, the row player should choose to minimize it, and its loss will be min max M(, ). Notice that the row player plays first here. If the column player plays first and the row player can choose its play with the benefit of knowing strategy, its loss will be max min M(, ). 2.3 HE MINMAX HEOREM Definition: M inmax strategy is a mixed strategy * realizing this minimum: min max M(, ). Maxmin strategy is a mixed strategy * realizing this maximum: max min M(, ). heorem: max min M(, ) = min max M(, ) := v. Definition: v is called the value of the game M. Explanations: We expect the player who chooses its strategy last to have the advantage, since it plays knowing its opponent s strategy. hus, max min M(, ) min max M(, ). However the theorem states that playing last doesn t give any advantage. he row player has a (minmax) strategy * such that regardless of the strategy played by the column player, the loss suffered M(, ) will be at most v. Symmetrically, it means that the column player has a (maxmin) strategy * such that, regardless of the strategy played by the row player the loss will be at least v. his means that the strategies * and * are optimal in a strong sense. laying by classical game theory: Given zero sum game M, one should play using a minmax strategy (can be computed as Linear rogramming). roblems with this approach: Explain why these extrema exist 3
he column player may not be truly adversarial and may behave in a manner that admits loss significantly smaller than the game value v. M may be unknown. M may be too large for computing. 2.4 REEAED LAY Motivation: We try to overcome these difficulties by playing the game repeatedly (one shot game is hopeless). Our goal is to learn how to play well against a particular opponent (which is not necessarily adversarial). Game setting: We refer to the row player as the learner and the column player as the environment. Let M be a matrix, possibly unknown to the learner. he game is played repeatedly in a sequence of rounds. On round t =,..., :. he learner chooses mixed strategy t. 2. he environment chooses mixed strategy t (which may be chosen with knowledge of t ). 3. he learner is permitted to observe the loss M(i, t ) for each row i. 4. he learner suffers loss M( t, t ). Learner s goal: o suffer cumulative loss t= M( t, t ) which is not much worse than min M(, t ), which is the loss of the best fixed strategy in hindsight against the actual sequence of plays,...,. 4
Algorithm LW(β) arameter: β [0, ) Initialize all the weights are set to unity w (i) =, i n. Do for t =,..., :. Compute mixed strategy: t (i) = wt(i) i wt(i), i n 2. Update the weights: w t+ (i) = w t (i)β M(i,t) (sanity check: why these updates make sense?) Output:,... t. [w t (i) := weight at time t on row i] heorem : For any matrix M with n rows and entries in [0,], for any sequence of mixed strategies,..., played by the environment, the sequence of mixed strategies,... produced by algorithm LW with parameter β [0, ) satisfy: where t= M( t, t ) a β min a β = ln(/β) β M(, t ) + c β ln(n) t= c β = β roof: For t =,..., we have that: i= w t+ (i) = def w t (i)β M(i,t) i= w t (i)( ( β)m(i, t )) = ( ( β)m( t, t )) w t (i) i= he inequality follows from: β x ( β)x, for β > 0 and x [0, ]. 2 he last equality follows from: n i= w t (i)( ( β)m(i, t )) = w n t(i) n i= w t(i) ( w t (i)( ( β)m(i, t ))) = i= = ( w t (i)) t (i)( ( β)m(i, t )) = ( w t (i)) t (i) t (i)( β)m(i, t )) = i= i= i= i= 2 rove it. hint: use convexity argument (recall the definition). i= i= 5
= ( w t (i))( ( β)m( t, t )) i= [Recall: i t(i) =, t (i) = wt(i) i wt(i)] Unwrapping this recurrence gives: [base case: w (i) =, n i= w (i) = n] w + (i) n ( ( β)m( t, t )) ( ) i= Note that for any j n: t= β t= M(j,t) = w + (j) Combining (*) and (**) and taking logs gives: ln(β) w + (i) ( ) i= M(j, t ) ln(n)+ ln( ( β)m( t, t )) t= t= ln( x) x for x< ln(n) ( β) Rearranging terms and noticing that this holds for any j (so taking the min over j gives a tight bound) we get: t= M( t, t ) ln(β) β min j t= M(j, t ) + ln(n) β Since the minimum (over mixed strategies ) in the bound of the theorem must be achieved by a pure strategy j, this implies the theorem. M( t, t ) t= Remarks: lim β a β = For fixed β and as the number of rounds becomes large, c β ln(n) becomes negligible relative to. hus, by choosing β close to, the learner ensures that its loss will not be much worse than the loss of the best strategy (formalized in the following corollary). Corollary 2: Under the conditions of heorem and with β = average per trial loss suffered by the learner is t= M( t, t ) min M(, t ) + t= + 2ln(n), the 6
where = 2ln(n) + ln(n) Corollary 3: Under the conditions of Corollary 2, ln(n) = O( ). M( t, t ) v + t= roof: Let be the minmax strategy for M, so for all column strategy : M(, ) v. Using corollary 2: Remarks: M( t, t ) t= corollary 2 M(, t ) + v + t= heorem guarantees that the cumulative loss is not much larger than that of any fixed mixed strategy. In particular not much larger than the game value (corollary 3). If the environment is non-adversarial, there might be a better fixed mixed strategy for the player and in which case the algorithm is guaranteed to be almost as good as this better strategy. 2.5 ROOF OF HE MINMAX HEOREM We prove: min max M(, ) max min M(, ) Suppose that we run algorithm LW (produced,.., ) against the maximally adversarial environment: on each round t the environment chooses: t = argmaxm( t, ). Denote = t= t and = t= t (both are probability distributions). We have: min max M max M def of max t M t= t= max t M = def of t t M t t= corollary 2 min t= M t + = def of min M + max min M+ can be made arbitrarily close to zero and it ends the proof. 7
2.6 AROXIMAELY SOLVING A GAME he algorithm LW can be used to find an approximate minmax or maxmin strategy. Definition: Approximate minmax strategy satisfy: max M(, ) v + ( can be made arbitrarily small). Approximate maxmin strategy analogously. Claim: = t= t is an approximate minmax strategy ( t produced by LW) 3. roof: from proof 2.5 we get: max M(, ) max minm(, ) + = v +. Claim: = t= t such that t = argmaxm( t, )] is an approximate maxmin strategy (from proof 2.5): v minm(, ). t can always be chosen to be a pure strategy 4. It is important to our derivation of a boosting algorithm in section 4. 3 ON-LINE REDICION On-line prediction setting: In the on-line prediction model, the learner observes a sequence of examples and predicts their label one at a time. he learner s goal is to minimize its prediction errors. Formal definition: Let X be a finite set of instances, and let H be a finite set of hypothesis h : X {0, } (there are generalizations for infinite cases). Let c : X {0, } be an unknown target concept, not necessarily in H (how to choose a good H? bias-variance trade-off). he learning takes place in a sequence of rounds. On round t =,..., :. he learner observes an example x t X. 2. he learner makes a randomized prediction ŷ t {0, } of the label associated with x t. 3. he learner observes the correct label c(x t ). 3 Give a full proof. Follow proof 2.5 exactly. 4 rove it. Use directly the definition of the loss M( t, ). 8
he goal of the learner: o minimize the expected number (with respect to its own randomization) of mistakes that it makes relative to the best hypothesis in the space H. Definition: Mistake matrix M H X (the game matrix in this case) defined as follows: { if h(x) c(x) M(h, x) = 0 else Reduction to the repeated game problem: he environment s choice of column corresponds to a choice of an instance x that is presented to the learner in a given iteration (pure strategy). he learner s choice of a distribution over the matrix rows (hypotheses) corresponds to making a random choice of a hypothesis with which to predict (mixed strategy). Apply the LW algorithm: On round t, we Have distribution t over H. Given instance x t (pure strategy) We randomly select h t H according to t and predict y t = ˆh t (x t ). Given c(x t ) we compute M(h, x t ), h H and update the weights by LW. Analysis: M( t, x t ) = h H t (h)m(h, x t ) = r h t [h(x t ) c(x t )] herefore, the expected number of mistakes made by the learner equals (by corollary 2) Remarks: t= M( t, x t ) min h H M(h, x t ) + O( ln H ) t= Analysis using theorem gives a better bound. he result can be generalized to any bounded loss function and also to a more generalized settings. 9
4 BOOSING Weak learning definition: For γ > 0, we say the algorithm WL is a γ-weak learning algorithm for (H, c) if for any distribution over X, the algorithm takes as input a set of labeled examples distributed according to and outputs a hypothesis h H with error slightly better than a random guessing: r [h(x) c(x)] x 2 γ Boosting definition: Boosting is the problem of converting a weak learning algorithm into one that performs with good accuracy. he goal is to run the weak algorithm many times on many distributions and combine the selected hypothesis into final hypothesis with small error rate. he main issues are how to choose the distributions (D t ) and how to combine the hypothesis. he boosting process: Boosting proceeds in rounds. On round t =,..., :. he booster constructs a distribution D t on X which is passed to the weak learner. 2. he weak learner produces a hypothesis h t H with error: r x D t [h t (x) c(x)] 2 γ 3. After rounds the weak hypothesis h,..., h are combined into a final hypothesis h fin. Assumptions: robability of successes: We assume that the weak learner always succeeds, thus the boosting algorithm succeeds with absolute certainty (usually we assume that the weak learning succeeds with high probability, thus the boosting algorithm succeed with probability > δ [AC]). Final hypothesis error: We have full access to the labels associated with the entire domain X. hus, we require that the final hypothesis have error zero so that all instances are correctly classified. he algorithm can be modified to fit the more standard (and practical) model in which the final error must be less than some positive parameter ɛ. With given labeled training set, all distributions would be computed over the training set. he generalization error of the final hypothesis can then be bounded using, for instance, standard VC theory. 0
4. BOOSING AND HE MINMAX HEORM Motivation: next section. his part gives an intuition for the boosting algorithm in the Relationship between the mistake matrix M (section 3) and the minmax theorem: min max M(, x) = min max M(, ) = v = max min M(, ) = max min M(h, ) ( ) x Last equality: for any, minm(, ), is realized at pure strategy h. First equality: for any, maxm(, ), is realized at pure strategy x. Note that M(h, ) = r [h(x) c(x)]. x here exist a distribution (maxmin strategy) on X such that for every h: M(h, ) = r [h(x) c(x)] v. x From weak learnability, exists h such that: r [h(x) c(x)] x 2 γ. Hence, v 2 γ. here exist a distribution (minmax strategy) over H such that for every x: M(, x) = r [h(x) c(x)] v h 2 γ < 2. hat is, every x is misclassified by less than 2 of the hypothesis (as weighted by minmax strategy). herefore, the target concept c is equivalent to a weighted majority of hypothesis in H. Corollary: If (H, c) are γ weakly learnable, then c can be computed exactly as a weighted majority of hypotheses in H. he weights defined by distribution on rows (hypotheses) of game M are a minmax strategy for this game. 4.2 IDEA FOR BOOSING ALGORIHM he idea: By the corollary above, we approximate c by approximating the weights of the minmax strategy (recall subsection 2.6). Adapt LW to the boosting model: he LW algorithm does not fit the boosting model. Recall that on each round, algorithm LW computes a distribution over the rows of the game matrix (hypotheses). However, in the boosting model, we want to compute on each round a distribution over instances (columns of M). Rather than using game M directly, we construct the dual of M which is the identical game except that the roles of the row and column players have been h
reversed. Construct the dual matrix: We modify our matrix game M (rows-hypothesis, columns- instances) to fit the boosting model.we construct to dual game matrix M as follows:. he algorithm computes a distribution over rows (hypothesis), in the boosting model we want to compute at each round a distribution over the instances (columns of M). We need to reverse row and column so we take M. 2. he column player of M wants to maximize to loss but the row player of M wants to minimize it. We want to reverse the meaning of minimum and maximum, therefore we take -M. 3. Our convention of losses being in [0,], for that reason we take -M. Definition: he dual matrix is defined as M(x, h) = M(h, x) = M(h, x) = { if h(x) = c(x) 0 else Remark: Any minmax strategy of the game M becomes a maxmin strategy of the game M. herefore, whereas before we were interested in finding an approximate minmax strategy of M, we are now interested in finding an approximate maxmin strategy of M. Apply algorithm LW to M : We apply LW to the dual game matrix M. he reduction proceeds as follows: On round t of boosting:. LW Computes distribution t over rows of M (over X). 2. he boosting algorithm sets D t = t and passes D t to the weak learning algorithm. 3. he weak learning algorithm return h t : r x D t [h t (x) = c(x)] 2 + γ 4. he weights maintained by LW are updated where t is defined to be pure strategy h t. hat is w t+ (i) = w t (i)β M(i,ht), i. 5. Return hypotheses h,..., h. Constructing approximate maxmin strategy: According to the method of approximately solving a game (2.6), on each round t, t may be a pure strategy h t that should be chosen to maximize: M ( t, h t ) = t (x)m (x, h t ) = r [h t (x) = c(x)]. x t x hat is, h t should have maximum accuracy with respect to distribution t. 2
For that, we use the weak learner. Although it is not guaranteed to succeed in finding the best h t, finding one of accuracy 2 + γ turns out to be sufficient for our purposes. Finally, this method suggests that = t= t is an approximate maxmin strategy, and we showed that the target c is equivalent to a majority of the hypotheses if weighted by a maxmin strategy of M (by 4.). t is a pure strategy (hypothesis) h t, this leads us to choose a simple majority of h,..., h : h fin = majority(h,..., h ). 4.3 ANALYSIS Our boosting procedure will compute h fin identical to c (for sufficiently large ). For all t: By corollary 2 we have: 2 + γ M ( t, h t ) = r x t [h t (x) = c(x)] 2 + γ t= Rearranging terms, for all x: 2 < for large ( <γ) M ( t, h t ) min x 2 + γ M (x, h t ) + t= M (x, h t ) ln X ) = O( ).We choose = Ω( γ ln X ) for 2 < γ. By definition of M, t= M (x, h t ) is exactly the number of hypothesis h t which agree with c on x, and its more than 2. So by definition of h fin we get h fin = c(x), x. t= 3
Algorithm 2 Boosting algorithm: Input: instance space X and target function c. γ-weak learning algorithm.. Set = 4 γ 2 ln X (so that < γ) 2. Set β = + 3. D (x) = X 4. For t =,..., : 2ln X for x X. ass distribution D t to weak learner. Get back hypothesis h t, s.t r [h t (x) c(x)] x D 2 γ t { β if h t (x) = c(x) w t+ (x) = w t (x), x else Update D t : D t+ (x) = wt+(x) x wt+(x), x Output: final hypothesis h fin = majority(h,..., h ) 4