CS260: Machine Learning Theory Lecture 12: No Regret and the Minimax Theorem of Game Theory November 2, 2011

Size: px

Start display at page:

Download "CS260: Machine Learning Theory Lecture 12: No Regret and the Minimax Theorem of Game Theory November 2, 2011"

Tamsin Morris
6 years ago
Views:

1 CS260: Machine Learning heory Lecture 2: No Regret and the Minimax heorem of Game heory November 2, 20 Lecturer: Jennifer Wortman Vaughan Regret Bound for Randomized Weighted Majority In the last class, we proved a general regret bound for Follow the Regularized Leader In particular, for any arbitrary sequence of losses l,, l with each l i,t [0,], let p,, p be the distributions chosen by Follow the Regularized Leader with parameter η and regularizer R hen for any p n, which implies that lt p t lt p t p n lt p lt p lt ( p t p t+ )+ η (R( p) R( p )) lt ( p t p t+ )+ η ( ) maxr( p) R( p) oday we will use this result to prove a regret bound ofo( logn) for Randomized Weighted Majority, which we know is a Follow the Regularized Leader algorithm with R( p) = H( p) = log o do this, we must bound the two terms on the right hand side of the bound above Step : Bounding the Range of the Regularizer We begin by deriving upper and lower bounds on the entropy functionh( p) he lower bound is easy Since for all i, 0, log 0 (Remember that we define 0log(/0) to be 0 by convention) As we discussed before,h( p) = 0 is achieved when p puts all of its weight on a single expert o upper boundh( p), we can use Jensen s inequality We get H( p) = log log i= his value is achieved when s uniform We have that ( ) maxr( p) R( p) η i= i= = log = logn i= = η (0 ( logn)) = η logn All CS260 lecture notes build on the scribes notes written by UCLA students in the Fall 200 offering of this course Although they have been carefully reviewed, it is entirely possible that some of them contain errors If you spot an error, please Jenn

2 Step 2: Bounding the Stability erm We will prove the following lemma, which gives a bound on the stability term for RWM We make use of the fact that we can write the probability that RWM assigns to expert i at time t as where,t = w i,t n j= w j,t w i,t = e ηl i,t he constant 2 in this bound can be improved, but we won t worry about that here Lemma For any sequence of losses l,, l with each l i,t [0,], let p,, p be the distributions chosen by Randomized Weighted Majority with parameter η For allt, lt ( p t p t+ ) 2η Proof: First we note that this bound holds trivially for all η /2 since p t n and 0 l t for all t For the remainder of the proof, we consider the case in which η < /2 Let s first think about the relationship between w i,t andw i,t+ for someiandt We have w i,t+ = e ηl i,t = e η(l i,t +l i,t ) = e ηl i,t e ηl i,t = w i,t e ηl i,t Sincel i,t [0,],e η e ηl i,t Combining this with the last equation, we get and w i,t+ w i,t w i,t+ w i,t e η Using these two bounds, we can relate,t to,t+ In particular, we have,t = w i,t n j= w j,t Since this holds for all expertsi, we have eη w i,t+ n j= w i,t+ = e η,t+ lt ( p t p t+ ) l t ((e η p t+ ) p t+ ) = l t (e η ) p t+ (e η ) (e )η 2η where the second inequality holds because we are only considering the case in whichη /2 (see Figure ) he point of this stes to get some linear bound on (e η ); we re not worrying too much about the constants, and clearly this could be made tighter Exercise: During the break, someone in the class pointed out that this bound can be tightened to get lt ( p t p t+ ) η without requiring the case analysis at all by using the inequality + x e x that we saw, for example, in Lecture 2 It s true! ry working through this alternative analysis (he first few steps remain the same) 2

3 Figure : eη (e ) η for η Step 3: Putting it ogether We can now apply the general FRL regret bound to get a regret bound for Randomized Weighted Majority: ℓt p~t ℓt p~ ℓt p~t ℓt p~t + R(~ p) R(~ p ) η 2η + log n η If we know the value (or an upper bound on ) in advance, we can optimize the value of η to imize this bound Setting the derivative of the bound equal to 0, we get that it is imized when r log n η=, 2 giving us a regret bound of 2 2 log n = O( log n), which is sublinear as desired Since the average regret per time step approaches 0 as gets large, we call RWM a no-regret algorithm A few comments are in order First, achieving this bound requires knowing in advance However, there are tricks that can be used to get around this (such as the so-called doubling trick ), yielding bounds that are only slightly worse when is unknown Second, although we only showed it for the particular case of RWM, it turns out that as long as the regularizer R is strongly convex, the upper bound on the regret for FRL will always be sublinear in For example, sublinear regret can be achieved using R(~ p) = ~ p 2 (which turns out to be equivalent to the well-studied Online Gradient Descent algorithm) 2 he Minimax heorem of Game heory We will now see how the existence of no-regret algorithms implies a key result from game theory, Von Neumann s Minimax heorem We begin by reviewing the idea of a two-player zero-sum game A two-player zero-sum game is defined by a matrix M Rn m Player (the row player ) chooses a row i {,, n}, and Player 2 (the column player ) chooses a column j {,, m} Player then suffers a loss of Mi,j (or equivalently receives the payoff Mi,j ), while Player 2 receives the payoff Mi,j able shows the payoff matrix M for the well known game Rock-Paper-Scissors 3

4 R P S R P S able : Matrix corresponding to the losses for player for Rock-Paper-Scissors For a game like Rock-Paper-Scissors, there is an advantage to using a randomized strategy for each player We can therefore think of Player as choosing a distribution p over actions (rows),,n, and Player 2 choosing a distribution q over actions (columns),,m Consider the following two scenarios: Player announces a distribution p, after which Player 2 announces a distribution q Player receives the expected payoff n m q j M i,j, while Player 2 receives n m i= j= q j M i,j 2 Player 2 announces a distribution q, after which Player announces a distribution p Again, Player receives the expected payoff n m q j M i,j, while Player 2 receives n m i= j= q j M i,j Intuitively, it seems that the person who can choose their strategy second should have an advantage However, one can show mathematically that neither player has an advantage if both play optimally his is formalized in the following theorem, which we can prove using the existence of no-regret algorithms heorem (Von Neumann Minimax heorem) For anyn m matrixm, q j M i,j = max q j M i,j Proof: We will prove a special case of this theorem for matrices M with entries in [0,], but this result can be generalized easily (Exercise: Show how it can be generalized to any arbitrary matrix M) Suppose that two players play the zero-sum game defined by the matrix M repeated for a sequence of roundst =,, At each roundt, Player chooses a distribution p t and Player 2 chooses a distribution q t For every i and t, define l i,t = m j= q j,tm i,j his is the loss (negative payoff) that Player would suffer for choosing the actioni We can then write the expected loss of Player on roundtas n i=,tl i,t = p t l t Suppose that Player chooses p t at each round by running a no-regret algorithm on this sequence of losses, and suppose that Player 2, knowing this, chooses the distribution q t on each round that will hurt Player the most Since the average per time step regret of Player s algorithm goes to 0, we know that for anyǫ > 0, it is possible to make large enough so that Suppose we make sufficiently large ( p t l t ) p l t ǫ 4

5 For anyt, q j M i,j max =,t q j M i,j,t q j,t M i,j his implies that q j M i,j = p t l t,t q j,t M i,j p l t +ǫ max q j M i,j +ǫ We can driveǫarbitrarily close to 0 and this bound will still hold, so we must have that It remains to show the opposite, that is, that max q j M i,j max q j M i,j max q j M i,j q j M i,j his is the more intuitive direction that says that the loss that Player can guarantee himself when playing second is no more than the loss that Player can guarantee himself when playing first he proof is left as an exercise 5

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin Game Theory Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin Bimatrix Games We are given two real m n matrices A = (a ij ), B = (b ij