1 Online Learning and Regret Minimization

Size: px

Start display at page:

Download "1 Online Learning and Regret Minimization"

Lily Oliver
6 years ago
Views:

1 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in n online environment. At ech time stge, the decision mker must choose n ction nd receives cost or rewrd tht is function of its ction nd of the ction of the environment. We ssume tht nothing is known priori bout the evolution lw for the ctions of the environment, which in prticulr my depend on the ctions of the decision mker nd/or on n unobservble stte of the environment, nd be nonsttionry. The problem of online sequentil decision mking cn be cst s two-plyer repeted gme, where the environment is seen s the opponent. Although this seems restrictive, from the stndpoint of the lgorithms we study, two-plyer repeted gmes ctully encompss lrge clss of problems: how to ply clssicl two-plyer gme such s repeted prisoner s dilemm, rock-pper-scissors, or mtching pennies, ginst humn; how to ply gme ginst multiple other plyers, such s investing in the stock mrket; how to control system tht is difficult to model nd possibly nonsttionry, such s systems tht interct with other systems nd/or humns. Formlly, we describe our problem s follows. At ech time stge t, the decision mker must choose n ction t tking vlues in finite set A, with A K, nd receives rewrd R( t, b t ) tht is function of its ction nd of the ction b t of the environment, tking vlues in (possibly infinite) set B. We ssume tht nothing is known bout the evolution lw for the ction of the environment b t, which in prticulr my depend on the ctions of the decision mker nd/or on n unobservble stte of the environment, nd be vrint in time. The decision mker s objective is, of course, to ccumulte s much rewrd s possible. However, due to the lck of ssumptions bout the opponent, certin objectives such s mximizing expected rewrd rise some controversy s to how the expecttion should be defined. A populr criterion in online lerning is regret minimiztion. Regret is defined s the difference between the rewrd tht could hve been chieved, given the choices of the opponent, nd wht ws ctully chieved. Specificlly, we define the regret of n lgorithm A t time T for the decision mker s R A (T ) mx 1 T r(, b t ) 1 T r( t, b t ). Hving no regret t time T implies tht, in retrospect, given the opponent s ctions b 1,..., b T, no single ction could hve chieved higher rewrd thn the lgorithm s sequence of ctions 1,..., T the lgorithm performs s well s the best ction. Note tht the best ction is chosen with full knowledge of the opponent s whole sequence of ctions, wheres lgorithm A must choose ction t bsed solely on the pst history b 1,..., b t 1, or on the vector of rewrds r(, b 1 ),..., r(, b t 1 ), or in some cses only on the rewrds 1

2 effectively chieved r( 1, b 1 ),..., r( t 1, b t 1 ). Nonetheless, we will see tht in ech of these cses there is reltively simple lgorithm tht chieves no regret, symptoticlly. We strt with wht we cll the full informtion cse the sitution where the opponent s ctions or, equivlently, the vector of rewrds r(, b t ). Note tht, in the full informtion cse, before choosing ction t one is ble to compute how well it ction would do reltive to the previous sequence of moves of the opponent; explicitly, we hve t 1 G (t) r(, b t ). t 1 An intuitive lgorithm could choose ction t with probbility proportionl to G (t). In the sequel, we consider the following lgorithm. Full-Informtion Algorithm: Tke η > 0. Let G (0) 0, A. For t 1, 2,..., 1. choose t with probbility P (t), where P (t) exp (ηg (t 1)) exp (ηg (t 1)). 2. For ll ctions, compute G (t) G (t 1) + r(, b t ). We cn prove the following result bout the expected rewrd chieved by the full informtion lgorithm. It follows from the theorem tht, for every T, the difference in totl rewrd chieved by the lgorithm nd by the best ction is on the order of O( T ). Therefore the lgorithm chieves no regret symptoticlly. Theorem 1 For ll b 1, b 2,..., b T nd 1, 2,..., T, generted ccording to P (t), we hve E r( t, b t ) η mx T r(, b t) ln K e η. 1 Proof: Let Then we hve K exp (ηg (t)) K 1 exp(ηg (t) exp(ηr(, b t ))) K P (t) exp (ηr (, b t )) η P (t)r(, b t ) + (e η 1 η) P (t)r(, b t ) 1 + (e η 1) P (t)r(, b t ). 2

3 It follows tht ln +1 W 1 ln +1 ( ln 1 + (e η 1) ) P (t)r(, b t ) (e η 1) P (t)r(, b t ). (1) On the other hnd, we hve Combining (1) nd (2), the theorem follows. ln W t+1 ln exp (ηg (T + 1)) W 1 K η mx G (T + 1) ln K. [ Corollry 1 If η ln 1 + 2(ln K)/T, we hve E r( t, b t ) 1.1 Prtil Informtion Cse mx r(, b t ) 2T ln K We now consider the sitution where we only observe sequence of rewrds r( 1, b 1 ),..., r( t 1, b t 1 ) before choosing ction t. The following lgorithm is slight modifiction of the full informtion lgorithm. Prtil Informtion Algorithm: Tke η > 0 nd γ (0, 1. Let G (0) 0, A. For t 1, 2, Choose ction t with probbility 2. Let Note tht P (t) (1 γ) exp (ηg (t 1)) exp (ηg + γ (t 1)) K G t (t) G t (t 1) + r( t, b t ) P (t) G (t) G (t 1), t. E P(t) [G (t) G (t 1) + r(, b t ), so tht in expected vlue the prtil informtion lgorithm performs the sme updtes in G (t) s the full informtion lgorithm. However, the probbility of choosing ech ction differs slightly, s ctions re not chosen solely bsed on their expected rewrds, but rther every ction is chosen with probbility t lest γ/k. This introduces extr explortion in the lgorithm, which is necessry in the prtil informtion cse to ensure tht ll ctions re tested often enough. The following theorem shows tht the prtil informtion lgorithm lso chieves no regret symptoticlly, nd the totl loss t time T is still on the order of O( T ). 3

4 } Theorem 2 Suppose tht η γ K {1, nd γ min K ln K (e 1)T. Then, b 1, b 2,..., b t, E r( t, b t ) mx r(, b t ) 2.63 T K ln K The proof of Theorem 2 is bsed on Theorem 1. The min difference in the results is tht the rewrd loss grows fster with the number of ctions. The previous results, s well s lower bounds on the rewrd loss, re summrized in the tble below. Full Informtion Prtil Informtion Upper Bound O( T ln K) O( T K ln K) Lower Bound Ω( T ln K) Ω( T K) It is still n open question whether the lower bound of O( T K) cn be mtched, in the prtil informtion cse. Experts Algorithms. In the previous results, we compre the performnce of the online lerning lgorithm with tht of the best ction. It is conceivble tht, in mny prcticl circumstnces, resonble decision-mking strtegies would not consist of plying single ction ll the time, but rther choose n ction (possibly with rndomiztion) bsed on the whole history. The previous nlysis cn be extended to llow for comprison of the online lerning lgorithm with ny strtegy in fixed set of strtegies. We cll ech such strtegy n expert. Experts lgorithms try to decide, bsed on the ction e t suggested by ech expert e t time t, which ction to choose next. We cn extend the full or prtil informtion lgorithms to choose mong experts in trivil wy, by keeping trck of the rewrd G e (t) ssocited with ech expert, rther thn G (t). It cn be shown tht, if there re N experts, the loss in rewrd is t most 2 e 1 T K ln N. (2) Note tht we could pply Theorem 2 directly to obtin bound on the order of O(sqrtT N log N). However, it is possible to exploit the fct tht there re only K underlying ctions to chieve the upper bound (2) insted. (2) is especilly ttrctive considering tht, in mny prcticl situtions, the number of ctions K my be reltively smll, but one my wnt consider lrge number of strtegies/experts. The notion of regret minimiztion is n interesting one nd prticulrly meningful when one must mke sequentil decisions in n environment tht is not ffected by one s own choices s is the cse, for instnce, when smll investor is trding stock in the stock mrket, nd the volume of his trnsctions is not lrge enough to ffect prices. However, in environments tht my be ffected by the decision mker, hving zero regret my not be s meningful, or even desirble, s illustrted by the following exmple. The Prisoner s Dilemm. In the single-stge Prisoner s Dilemm (PD) gme, ech plyer cn either cooperte (C) or defect (D). Defecting is better thn cooperting regrdless of wht the opponent does, but it is better for both plyers if both cooperte thn if both defect. Consider the repeted PD. One possible pyoff mtrix for the prisoner s dilemm gme is given below: Suppose the row plyer consults with set of experts, including the defecting expert, who recommends defection ll the time. Let the strtegy of the column plyer in the repeted gme be fixed. In prticulr, the column plyer my be very ptient nd coopertive, willing to wit for the row plyer to become coopertive, but eventully becoming noncoopertive if the row plyer does not seem to cooperte. Since defection is dominnt strtegy in the stge 4

5 D C D (1,1) (4,0) C (0,4) (3,3) gme, the defecting expert chieves in ech step rewrd s high s ny other expert ginst ny sequence of choices of the column plyer, so the row plyer lerns with the experts lgorithm to defect ll the time. Obviously, in retrospect, this seems to minimize regret, since for ny fixed sequence of ctions by the column plyer, constnt defection is the best response. Obviously, constnt defection is not the best response in the repeted gme ginst mny possible strtegies of the column plyer. For instnce, the row plyer would regret very much using the experts lgorithm if he were told lter tht the column plyer hd been plying strtegy such s Tit-for-Tt, which repets t time t the sme ction plyed by the row plyer t time t 1. Aginst Tit-for-Tt, the defecting expert induces defection in every stge of the gme, chieving verge rewrd equl to 1. The best strtegy ginst Tit-for-Tt is the cooperting expert, which induces coopertion in every stge nd chieves rewrd equl to 3. 5

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm