Online Learning, Regret Minimization, Minimax Optimality, and Correlated Equilibrium

Size: px

Start display at page:

Download "Online Learning, Regret Minimization, Minimax Optimality, and Correlated Equilibrium"

Shauna Dorsey
5 years ago
Views:

1 Algorihm Online Learning, Regre Minimizaion, Minimax Opimaliy, and Correlaed Equilibrium High level Las ime we discussed noion of Nash equilibrium Saic concep: se of prob Disribuions (p,q, ) such ha nobody has any incenive o deviae Bu doesn alk abou how sysem would ge here Troubling ha even finding one can be hard in large games Wha if agens adap (learn) in ways ha are well-moivaed in erms of heir own rewards? Wha can we say abou he sysem hen? Avrim Blum High level Today: Online learning guaranees ha are achievable when acing in a changing and unpredicable environmen Wha happens when wo players in a zero-sum game boh use such sraegies? Approach minimax opimaliy Gives alernaive proof of minimax heorem Wha happens in a general-sum game? Approaches (he se of) correlaed equilibria Consider he following seing Each morning, you need o pick one of N possible roues o drive o work Bu raffic is differen each day No clear a priori which will be bes When you ge here you find ou how long your roue ook (And maybe ohers oo or maybe no) Is here a sraegy for picking roues so ha in he long run, whaever he sequence of raffic paerns has been, you ve done nearly as well as he bes fixed roue in hindsigh? (In expecaion, over inernal randomness in he algorihm) Yes Robos R Us 32 min No-regre algorihms for repeaed decisions A bi more generally: Algorihm has N opions World chooses cos vecor Can view as marix like his (maybe infinie # cols) No-regre algorihms for repeaed decisions A each ime sep, algorihm picks row, life picks column Alg pays cos for acion chosen Alg ges column as feedback (or jus is own cos in he bandi model) Need o assume some bound on max cos Le s say all coss beween 0 and 1 A each ime sep, algorihm picks row, life picks column Define Alg average pays cos regre for acion in T ime chosen seps as: (avg per-day cos of alg) (avg per-day cos of bes Alg ges column as feedback (or jus is own cos in fixed row in hindsigh) he bandi model) We wan his o go o 0 or beer as T ges large Need o assume some bound on max cos Le s say all coss [called beween a no-regre 0 and 1 algorihm] 1

2 Algorihm Algorihm Some inuiion & properies of no-regre algs Le s look a a small example: des Noe: No rying o compee wih bes adapive sraegy jus bes fixed row (pah) in hindsigh No-regre algorihms can do much beer han playing minimax opimal, and never much worse Exisence of no-regre algs yields immediae proof of minimax hm (will see in a bi) Hisory and developmen (abridged) [Hannan 57, Blackwell 56]: Alg wih avg regre O((N/T) 1/2 ) Re-phrasing, need only T = O(N/ 2 ) seps o ge imeaverage regre down o (will call his quaniy T ) Opimal dependence on T (or ) Game-heoriss viewed #rows N as consan, no so imporan as T, so prey much done Why opimal in T? Say world flips fair coin each day Any alg, in T days, has expeced cos T/2 Bu E[min(# heads,#ails)] = T/2 Θ T So, expeced avg per-day gap is Θ 1/ T des Hisory and developmen (abridged) [Hannan 57, Blackwell 56]: Alg wih avg regre O((N/T) 1/2 ) Re-phrasing, need only T = O(N/ 2 ) seps o ge imeaverage regre down o (will call his quaniy T ) Opimal dependence on T (or ) Game-heoriss viewed #rows N as consan, no so imporan as T, so prey much done Learning-heory 80s-90s: combining exper advice Imagine large class C of N predicion rules Perform (nearly) as well as bes f2c [LilesoneWarmuh 89]: Weighed-majoriy algorihm E[cos] OPT(1+ ) + (log N)/ Regre O((log N)/T) 1/2 T = O((log N)/ 2 ) Opimal as fn of N oo, plus los of work on exac consans, 2 nd order erms, ec [CFHHSW93] Exensions o bandi model (adds exra facor of N) To hink abou his, le s look a he problem of combining exper advice Using exper advice Say we wan o predic he sock marke We solici n expers for heir advice (Will he marke go up or down?) We hen wan o use heir advice somehow o make our predicion Eg, Basic quesion: Is here a sraegy ha allows us o do nearly as well as bes of hese in hindsigh? [ exper = someone wih an opinion No necessarily someone who knows anyhing] Simpler quesion We have n expers One of hese is perfec (never makes a misake) We jus don know which one Can we find a sraegy ha makes no more han lg(n) misakes? Answer: sure Jus ake majoriy voe over all expers ha have been correc so far Each misake cus # available by facor of 2 Noe: his means ok for n o be very large halving algorihm 2

3 Wha if no exper is perfec? One idea: jus run above proocol unil all expers are crossed off, hen repea Makes a mos log(n) misakes per misake of he bes exper (plus iniial log(n)) Seems waseful Consanly forgeing wha we've learned Can we do beer? Weighed Majoriy Algorihm Inuiion: Making a misake doesn' compleely disqualify an exper So, insead of crossing off, jus lower is weigh Weighed Majoriy Alg: Sar wih all expers having weigh 1 Predic based on weighed majoriy voe Penalize misakes by cuing weigh in half Analysis: do nearly as well as bes exper in hindsigh M = # misakes we've made so far m = # misakes bes exper has made so far W = oal weigh (sars a n) Afer each misake, W drops by a leas 25% So, afer M misakes, W is a mos n(3/4) M Weigh of bes exper is (1/2) m So, So, if m is small, hen M is prey small oo consan raio Randomized Weighed Majoriy 24(m + lg n) no so good if he bes exper makes a misake 20% of he ime Can we do beer? Yes Insead of aking majoriy voe, use weighs as probabiliies (eg, if 70% on up, 30% on down, hen pick 70:30) Idea: smooh ou he wors case Also, generalize ½ o 1- M = expeced #misakes unlike mos wors-case bounds, numbers are prey good Analysis Say a ime we have fracion F of weigh on expers ha made misake So, we have probabiliy F of making a misake, and we remove an F fracion of he oal weigh W final = n(1- F 1 )(1 - F 2 ) ln(w final ) = ln(n) + [ln(1 - F )] ln(n) - F (using ln(1-x) < -x) = ln(n) - M ( F = E[# misakes]) If bes exper makes m misakes, hen ln(w final ) > ln((1- ) m ) Now solve: ln(n) - M > m ln(1- ) Summarizing E[# misakes] (1+ )m + -1 log(n) If se =(log(n)/m) 1/2 o balance he wo erms ou (or use guess-and-double), ge bound of E[misakes] m + 2(m log n) 1/2 Since m T, his is a mos m + 2(Tlog n) 1/2 So, avg regre = 2(Tlog n) 1/2 /T! 0 3

4 Wha can we use his for? Can use o combine muliple algorihms o do nearly as well as bes in hindsigh Bu wha abou cases like choosing pahs o work, where expers are differen acions, no differen predicions? Game-heoreic version Wha if expers are acions? (pahs in a nework, rows in a marix game, ) A each ime, each has a loss (cos) in {0,1} Can sill run he algorihm Raher han viewing as pick a predicion wih prob proporional o is weigh, View as pick an exper wih probabiliy proporional o is weigh Choose exper i wih probabiliy p i = w i / i w i Same analysis applies Game-heoreic version Wha if expers are acions? (pahs in a nework, rows in a marix game, ) Wha if losses (coss) in [0,1]? If exper i has cos c i, do: w i Ã w i (1-c i ) Our expeced cos = i c i w i /W Amoun of weigh removed = i w i c i So, fracion removed = (our cos) Res of proof coninues as before So, now we can drive o work! (assuming full feedback) (1- c 12 )(1- c 11 ) 1 (1- c 22 )(1- c 21 ) 1 (1- c 32 )(1- c 31 ) (1- c n2 )(1- c n1 ) 1 Illusraion c 1 c 2 Guaranee: E[cos] OPT + 2(OPT log n) 1/2 scaling so coss in [0,1] Since OPT T, his is a mos OPT + 2(Tlog n) 1/2 So, regre/ime sep 2(Tlog n) 1/2 /T! 0 Connecions o Minimax Opimaliy Minimax-opimal sraegies Can solve for minimax-opimal sraegies using Linear programming Claim: no-regre sraegies will do nearly as well or beer agains any sequence of opponen plays Do nearly as well as bes fixed choice in hindsigh Implies do nearly as well as bes disrib in hindsigh Implies do nearly as well as minimax opimal! Lef Righ Lef Righ (½,-½) (1,-1) (1,-1) (0,0) 4

5 Proof of minimax hm using RWM Suppose for conradicion i was false This means some game G has V C > V R : If Column player commis firs, here exiss a row ha ges he Row player a leas V C Bu if Row player has o commi firs, he Column player can make him ge only V R Scale marix so payoffs o row are in [-1,0] Say V R = V C - V C V R Proof cond Now, consider playing randomized weighedmajoriy alg as Row, agains Col who plays opimally agains Row s disrib In T seps, Alg ges [bes row in hindsigh] 2(Tlog n) 1/2 BRiH T V C [Bes agains opponen s empirical disribuion] Alg T V R [Each ime, opponen knows your randomized sraegy] Gap is T Conradics assumpion once T > 2(Tlog n) 1/2, or T > 4log(n)/ 2 Proof cond Now, consider playing randomized weighedmajoriy alg as Row, agains Col who plays opimally agains Row s disrib Wha if wo RWMs play each oher? Can anyone see he argumen ha heir ime-average sraegies mus be approaching minimax opimaliy? Noe ha our procedure gives a fas way o compue apx minimax-opimal sraegies, if we can simulae Col (bes-response) quickly Wha if all players minimize regre? Inernal/Swap Regre and Correlaed Equilibria In zero-sum games, empirical frequencies quickly approaches minimax opimal In general-sum games, does behavior quickly (or a all) approach a Nash equilibrium? Afer all, a Nash Eq is exacly a se of disribuions ha are no-regre wr each oher So if he disribuions sabilize, hey mus converge o a Nash equil Well, unforunaely, no 5

6 Wha can we say? If algorihms minimize inernal or swap regre, hen empirical disribuion of play approaches correlaed equilibrium Foser & Vohra, Har & Mas-Colell, Though doesn imply play is sabilizing Wha are inernal/swap regre and correlaed equilibria? More general forms of regre 1 bes exper or exernal regre: Given n sraegies Compee wih bes of hem in hindsigh 2 sleeping exper or regre wih ime-inervals : Given n sraegies, k properies Le S i be se of days saisfying propery i (migh overlap) Wan o simulaneously achieve low regre over each S i 3 inernal or swap regre: like (2), excep ha S i = se of days in which we chose sraegy i Inernal/swap-regre Eg, each day we pick one sock o buy shares in Don wan o have regre of he form every ime I bough IBM, I should have bough Microsof insead Formally, swap regre is wr opimal funcion f:{1,,n}!{1,,n} such ha every ime you played acion j, i plays f(j) Weird why care? Correlaed equilibrium Disribuion over enries in marix, such ha if a rused pary chooses one a random and ells you your par, you have no incenive o deviae Eg, Shapley game R P S R P S -1,-1-1,1 1,-1 1,-1-1,-1-1,1-1,1 1,-1-1,-1-1,-1 In general-sum games, if all players have low swapregre, hen empirical disribuion of play is apx correlaed equilibrium Connecion If all paries run a low swap regre algorihm, hen empirical disribuion of play is an apx correlaed equilibrium Correlaor chooses random ime 2 {1,2,,T} Tells each player o play he acion j hey played in ime (bu does no reveal value of ) Expeced incenive o deviae: j Pr(j)(Regre j) = swap-regre of algorihm So, his suggess correlaed equilibria may be naural hings o see in muli-agen sysems where individuals are opimizing for hemselves Correlaed vs Coarse-correlaed Eq In boh cases: a disribuion over enries in he marix Think of a hird pary choosing from his disr and elling you your par as advice Correlaed equilibrium You have no incenive o deviae, even afer seeing wha he advice is Coarse-Correlaed equilibrium If only choice is o see and follow, or no o see a all, would prefer he former Low exernal-regre ) apx coarse correlaed equilib 6

7 Inernal/swap-regre, cond Algorihms for achieving low regre of his form: Foser & Vohra, Har & Mas-Colell, Fudenberg & Levine Will presen mehod of [BM05] showing how o conver any bes exper algorihm ino one achieving low swap regre Unforunaely, #seps o achieve low swap regre is O(n log n) raher han O(log n) Can conver any bes exper algorihm A ino one achieving low swap regre Idea: Insaniae one copy A j responsible for expeced regre over imes we play j Play p = pq Q Alg Cos vecor c q 2 Allows us o view p j as prob we play acion j, or as prob we play alg A j Give A j feedback of p j c A j guaranees (p j c ) q j min i p j c i + [regre erm] Wrie as: p j (q j c ) min i p j c i + [regre erm] p 2 c A 1 A 2 A n Can conver any bes exper algorihm A ino one achieving low swap regre Idea: Insaniae one copy A j responsible for expeced regre over imes we play j Can conver any bes exper algorihm A ino one achieving low swap regre Idea: Insaniae one copy A j responsible for expeced regre over imes we play j Q Play p = pq Alg Cos vecor c q 2 p 2 c A 1 A 2 Q Play p = pq Alg Cos vecor c q 2 p 2 c A 1 A 2 Sum over j, ge: A n Sum over j, ge: A n p Q c j min i p j c i + n[regre erm] p Q c j min i p j c i + n[regre erm] Our oal cos For each j, can move our prob o is own i=f(j) Our oal cos For each j, can move our prob o is own i=f(j) Wrie as: p j (q j c ) min i p j c i + [regre erm] Ge swap-regre a mos n imes orig exernal regre 7

Topics in Machine Learning Theory

Topics in Machine Learning Theory The Adversarial Muli-armed Bandi Problem, Inernal Regre, and Correlaed Equilibria Avrim Blum 10/8/14 Plan for oday Online game playing / combining exper advice bu: Wha