Advanced Course in Machine Learning Spring 2010 Games Agains Naure Handous are joinly prepared by Shie Mannor and Shai Shalev-Shwarz In he previous lecures we alked abou expers in differen seups and analyzed he regre of he algorihm by comparing is performance o he performance of he bes fixed expers (and laer he bes shifing exper). In his lecure we consider he game heory connecion and presen games agains Naure. Along he way, we presen one of he mos common ools o analyze predicion problems: approachabiliy heory. The seup in oday s lecure is ha of full informaion. The nex lecure will be devoed o he parial informaion seup. We sar from a more general model for he game and hen show how o apply i o differen online learning seups. 1 The Model The model is comprised of a single player playing agains Naure. The game is repeaed in ime, and a sage he decision maker has o choose an acion a A and Naure chooses (simulaneously) an acion b B. As a resul he decision maker obains a reward r R(a, b ) (ha is, he reward can be sochasic: we will only need finie second momens). The game coninues ad infinium. We le he average reward be denoed by ˆr = 1 r τ. Noe: There is no reward for Naure, herefore his is no a game in he sandard sense of he word (or, one can say his is a zero-sum game). The decision maker keeps rack of he rewards and of Naure s acions. We consider he empirical frequency of Naure s acions as: q (b) = 1 1{b = b} and noe ha q (B), he se of disribuions over B. 1.1 The saionary case If Naure is saionary (i.e., he acions are generaed from an IID source q ) hen: q q a.s. (In fac, we have exponenially fas convergence: Pr( q q > ɛ) C exp( C ɛ 2 ).) In ha case, one can hope o obain a reward as high as he bes response reward: By obaining we mean: r (q) = max p a,b q(a)p(b)r(a, b) = max a ˆr r (q ) Here is a simple ficiious play algorihm ha obains ha: a.s. q(b)r(a, b). b Games Agains Naure-1
1. Observe b and for an esimae: q = 1 1{b = b}. 2. Play a arg max r(a, q ). This algorihm is also based on he celebraed cerainy equivalence scheme. Theorem 1 The Ficiious Play algorihm saisfies ha ˆr r (q ) a.s. Bu wha happens if Naure is no saionary? 1.2 Arbirary source Suppose now ha he sequence b 1, b 2,... is generaed by an arbirary process. Arbirary here means no necessarily sochasic. Clearly, we canno assume ha q converges. Our objecive of having he average reward converge o r (q ) is no well defined anymore since q may no exis. We can define he average regre as: R = r (q ) ˆr. This is a random variable. Randomness is deermined by randomness in he algorihm. The basic quesions is herefore: Can we find an algorihm such ha lim sup R 0 a.s.? If such an algorihm exiss we call i 0-regre (we will laer call such an algorihm 0 exernal regre, bu his is sufficien for now). This is, of course, he same noion from he previous wo lecures where we consider he average regre as opposed o he cumulaive regre. Naure models. 1. Oblivious. Naure wries down he sequence of b 1, b 2,... a ime 0 (no disclosing hem). 2. Non-oblivious. Naure is adversarial and i ries o maximize he regre. Naure may even be aware of any randomizaion he decision maker does (bu no he value of privae coin osses). Observaions: 1. An non-oblivious opponen is a very srong model: i encompasses a wors case view on disurbances in many sysems and i generalizes play agains an adversary. 2. Ficiious play would fail since randomizaion is needed. Ficiious play is called here follow he leader (FL). 3. If he leader does no change (asympoically), FL does have 0 regre. More ineresingly, as long as here are no many swiches, FL works. More precisely, we say ha FL swiches from acion a o a a ime if a 1 = a and a = a. We le he number of swiches be N. We say ha FL exhibis infrequen swiches along a hisory if for every ɛ > 0 here exiss T such ha N / < ɛ for all T. Theorem 2 If FL exhibis infrequen swiches along a hisory i saisfies lim sup R 0 along ha hisory. Proof: Home exercise. (Noe ha we do no use almos sure quanifiers since clearly FL is no opimal for every hisory.) Games Agains Naure-2
1.3 A generalized noion of regre In general, regre can be defined as he difference beween he obained (cumulaive reward) and he reward ha would have been obained by he bes sraegy in a reference se. Tha is: R = sup r(σ, hisory) ˆr, sraegy σ where r(σ, hisory) is an esimae of he average reward if playing σ. This is no always well defined or achievable. In he example above, he se of sraegies is simply he se of saionary sraegies. One can easily hink of oher ses of sraegies such as he se of sraegies ha depend on he las observaion from Naure. In ha case: he se of sraegies is idenified wih p p(a b 1 ) (A) B and he reward as a funcion of hisory is defined as: r(σ, hisory) = 1 p(a b 1 )r(a, b ), where b 0 is defined is one of he members of B. We observe ha his comparison class is richer han he comparison class we considered above which can be idenified wih p(a) (A). We will show laer ha here is an asympoical 0-regre sraegy agains his paricular comparison class. a 2 Blackwell s Approachabiliy We now inroduce a useful ool in he analysis of repeaed games agains Naure called Blackwell s approachabiliy heory. Le us define a vecor-valued wo-player game. We call he players P1 and P2 o disinguish hem from he decision makers above. We consider a wo player vecor-valued repeaed game where boh P1 and P2 choose acions as before from finie ses A and B. The reward is now a k-dimensional vecor, m(a, b) R k. As before, he sage game reward is m m(a, b ) (he reward can be a random vecor). The average reward is ˆm = 1 m. P1 s ask is o approach a arge se T, namely o ensure convergence of he average reward vecor o his se irrespecively of P2 s acions. Formally, le T R k denoe he arge se. In he following, d is he Euclidean disance in R k. The se-o-poin disance beween a poin x and a se T is d(x, T ) = inf y T d(x, y). (We le P π,σ denoe he probabiliy measure when P1 plays he policy π and P2 plays policy σ.) Definiion 1 A policy π of P1 approaches a se T R k if lim d( ˆm n, T ) = 0 P π n,σ-a.s., for every σ Σ. A policy σ Σ of P2 excludes a se T if for some δ > 0, lim inf d( ˆm n, T ) > δ P π,σ -a.s. for every π Π, The policy π (σ ) will be called an approaching (excluding) policy for P1 (P2). A se is approachable if here exiss an approaching policy. Noing ha approaching a se and is opological closure are he same, we shall henceforh suppose ha he se T is closed. The noion of approachabiliy and excludabiliy assumes uniformiy wih respec o ime (and he sraegy of P2 (approachabiliy) or P1 (excludabiliy). Games Agains Naure-3
2.1 The projeced game Le u be a uni vecor in he reward space R k. We ofen consider he projeced game in direcion u as he zero-sum game wih he same dynamics as above, and scalar rewards r n = m n u. Here sands for he sandard inner produc in R k. Denoe his game by Γ(u). 2.2 The Basic Approachabiliy Resuls For any x T, denoe by C x a closes poin in T o x, and le u x be he uni vecor in he direcion of C x x, which poins from x o he goal se T. The following heorem requires, geomerically, ha here exiss a (mixed) acion p(x) such ha he se of all possible (vecor-valued) expeced rewards is on he oher side of he hyperplane suppored by C x in direcion u x. Theorem 3 Assume ha for every poin x T here exiss a sraegy p(x) such ha: (m(p(x), q) C x ) u x 0, q (B). (1) Then T is approachable by P1. An approaching policy is given as follows: If ˆm n T, play p( ˆm n ), oherwise, play arbirarily. Proof Le y n = C ˆmn and denoe by F n he filraion generaed by he hisory up o ime n. We furher le d n = ˆm n y n. We wan o prove ha d n 0 a.s.. We have ha: IE(d 2 n+1 F n ) = IE( ˆm n+1 y n+1 2 Fn ) IE( ˆm n+1 y n 2 Fn ) = IE( ˆm n+1 ˆm n + ˆm n y n 2 Fn ) = ˆm n y n 2 + IE( ˆm n+1 ˆm n 2 F n ) + 2IE(( ˆm n y n ) ( ˆm n+1 ˆm n ) F n ). Now, since ˆm n+1 ˆm n = m n+1 /(n + 1) ˆm n /(n + 1) we have ha: Expanding he las erm we obain: IE(d 2 n+1 F n ) d 2 n + C n 2 + 2IE(( ˆm n y n ) ( ˆm n+1 ˆm n ) F n ). ( ˆm n y n ) ( ˆm n+1 ˆm n ) = ( ˆm n y n ) (m n+1 /(n + 1) ˆm n /(n + 1)) = ( ˆm n y n ) (y n /n + 1 ˆm n /(n + 1) + m n+1 /(n + 1) y n /(n + 1)) = d 2 n/(n + 1) + 1 n + 1 ( ˆm n y n ) (m n+1 /(n + 1) y n /(n + 1)) Now, he expeced value of he las erm is negaive so we obain: IE(d 2 n+1 F n ) (1 2 n + 1 )d2 n + c n 2. I follows by Lemma 1 ha d n 0 almos surely. Remarks: 1. Convergence Raes. The convergence rae of he above policy is O( T ) and is independen of he dimension. The only dependence kicks in hrough he magniude of he randomness (he second momen, o be exac). Games Agains Naure-4
2. Complexiy. There are wo disinc elemens o compuing an approaching sraegy as in Theorem 3. The firs is finding he closes poin C x and he second is solving he projeced game. Solving he projeced 0-sum game can be easily done using linear programming (or oher mehods) wih polynomial dependence on he number of acions of boh players. Finding C x, however, can be in general a very hard problem as finding he closes poin in a non-convex se is NP-hard. There are, however, some easy insances such as he case where T is convex and described in some compac form. In fac, i is enough o assume ha a convex T has a separaion oracle (i.e., we can query in polyime if a poin belongs o T or no). 3. Is a se approachable? In general, i is NP-hard even o deermine if a poin is approachable where hardness here is measured in he dimension (if he dimension is fixed i is no hard o decide if a poin is approachable). 4. The game heory connecion. The above resul generalizes he celebraed min-max heorem. To observe ha, ake a one dimensional problem. In ha case he approachable se is he segmen [v, ]. For convex arge ses, he condiion of he las heorem urns ou o be boh sufficien and necessary. Moreover, his condiion may be expressed in a simpler form, which may be considered as a generalizaion of he minimax heorem for scalar games. Given a saionary policy q (B) for P2, le Φ(A, q) co({m(p, q)} p (A) ), where co is he convex hull operaor. The Euclidean uni sphere in R k is denoed by IB k. The following heorem is characerizes convex approachable ses in an elegan way. Theorem 4 Le T be a closed convex se in R k. (i) T is approachable if and only if Φ(A, q) T for every saionary policy q (B). (ii) If T is no approachable hen i is excludable by P2. In fac, any saionary policy q ha violaes (i) is an excluding policy. (iii) T is approachable if and only if val Γ(u) inf m T u m for every u IB k, where val is he value of he (scalar) 0-sum game. Condiion (i) in Theorem 4 is someimes very easy o check, as we see below. 3 Back o regre We are now ready o use approachabiliy for proving we can minimize he regre. Consider he following vecor-valued game. When he decision maker plays a and Naure plays b and a reward r is obained he vecor-valued reward is m = (r, e b ) where e b is a vecor of zeros excep for he b-h enry which is one. I holds ha: ˆm = (ˆr, q ). Now, define he following arge se T R (B): T = {(r, q) : r r (q), q (B)}. We claim ha T is convex. Indeed, i follows ha r (q) is convex as a maximum of linear funcions. The se T is convex as he epigraph of a convex funcion. We now claim ha T is approachable. By Theorem 4, a necessary and sufficien condiion is ha Φ(A, q) T for every q. Fix some q and le p (A) be a member of he argmax of r, ha is: p arg max r(p, q). Bu his is easy o show since m(p, q) Φ(A, q) and m(p, q) T. This means ha by using approachabiliy we have ha d( ˆm, T ) 0. Wha is lef is o argue ha approaching T implies ha ˆr r (q ) 0 asympoically. This holds since r is a uniformly coninuous funcion (i is convex, coninuous and on a compac domain). We have hus proved: = Games Agains Naure-5
Theorem 5 There exiss a sraegy ha guaranees ha lim sup ˆr r (q ) 0 In fac, we have proved ha he convergence rae is O( T ). We now reurn o he problem where we considered generalized regre. We claim a 0-regre sraegy does exis. Indeed, consider he arge se of he form: T = {(r, π) R (B 2 ) : r max π(b, b )p(a b)r(a, b )}, p (A) B b,b B where we idenify p wih a condiional probabiliy of choosing an acion given he pas observaion (noe ha i suffices o choose a pure acion). I is easy o see ha T is convex as an epigraph of a convex funcion. Now, we need o define he game: when P1 chooses a, P2 chooses b and he previous acion chosen by P2 was b he reward is a vecor whose enries are r(a, b ) in he firs coordinae and he remaining coordinaes are zero excep for one a he b B + b coordinae. I remains an easy exercise o show ha he se T is approachable. (We noe ha a sligh exension of approachabiliy is needed: see The Empirical Bayes Envelope and Regre Minimizaion in Compeiive Markov Decision Processes. MOR 28(1):327-345, S. Mannor and N. Shimkin.) 4 Calibraion The definiion of calibraion and a very easy proof using approachabiliy is provided in he aached noe. a.s. A Appendix Lemma 1 Assume e is a non-negaive random variable, measurable according o he sigma algebra F (F F +1 ) and ha IE(e +1 F ) (1 d )e + cd 2. (2) Furher assume ha =1 d =, d 0, and ha d 0. Then e 0 P-a.s. Proof Firs noe ha by aking he expecaion of Eq. (2) we ge: IEe +1 (1 d )IEe + cd 2. According o Bersekas and Tsisiklis (Neuro-dynamic programming, page 117) i follows ha IEe 0. Since e is non-negaive i suffices o show ha e converges. Fix ɛ > 0, le = max{ɛ, e }. V ɛ Since d 0 here exiss T (ɛ) such ha cd < ɛ for > T. Resric aenion o > T (ɛ). If e < ɛ hen If e > ɛ we have: V ɛ IEV ɛ IE(V ɛ +1 F ) (1 d )ɛ + cd 2 ɛ V ɛ. IE(V+1 F ɛ ) (1 d )e + d e V ɛ. is a super-maringale, by a sandard convergence argumen we ge V ɛ V. ɛ By definiion V ɛ ɛ and herefore IEV ɛ ɛ. Since IE [max(x, Y )] IEX + IEY i follows ha IEe + ɛ. So ha IEV ɛ = ɛ. Now we have a posiive random variable, wih expecaion ɛ which is above ɛ wih probabiliy 1. I follows ha V ɛ = ɛ. To summarize, we have shown ha for every ɛ > 0 wih probabiliy 1: lim sup e lim sup V ɛ = lim V ɛ = ɛ. Since ɛ is arbirary and e non-negaive i follows ha e 0 almos surely. Games Agains Naure-6