Topics in Machine Learning Theory

Size: px

Start display at page:

Download "Topics in Machine Learning Theory"

Dominic Goodman
5 years ago
Views:

1 Topics in Machine Learning Theory The Adversarial Muli-armed Bandi Problem, Inernal Regre, and Correlaed Equilibria Avrim Blum 10/8/14 Plan for oday Online game playing / combining exper advice bu: Wha if we only ge feedback for he acion we chose? (called he muli-armed bandi seing) Wha abou sronger forms of regre-minimizaion (inernal regre)? Connecion o noion of correlaed equilibria Bu firs, a quick discussion of [0,1] vs {0,1} coss for RWM algorihm Robos R Us 32 min [0,1] coss vs {0,1} coss We analyzed Randomized Wd Majoriy for case ha all coss in {0,1} (and slighly hand-waved exension o [0,1]) Here is an alernaive simple way o exend o [0,1] Given cos vecor c, view c i as bias of coin Flip o creae vecor c 0,1 n, s E[c i ] = c i Feed c o alg A world c For any sequence of vecors c 0,1 n, we have: E A [cos (A)] min i cos (i) + [regre erm] So, E $ [E A [cos (A)]] E $ [min i cos (i)] + [regre erm] LHS is E A [cos(a)] (since E $ E A cos A = E $ c p = c p) RHS min i E $ [cos (i)] + [r] = min i [cos(i)] + [r] In oher words, coss beween 0 and 1 jus make he problem easier $ c A Cos = cos on c vecors Expers! Bandi seing In he bandi seing, only ge feedback for he acion we choose Sill wan o compee wih bes acion in hindsigh [ACFS02] give algorihm wih cumulaive regre O( (TN log N) 1/2 ) [average regre O( ((N log N)/T) 1/2 )] Will do a somewha weaker version of heir analysis (same algorihm bu no as igh a bound) Talk abou i in he conex of online pricing Online pricing Say you are selling lemonade (or a cool new sofware ool, or boles of waer a he world cup) View each possible For =1,2, T price as a differen Seller ses price p row/exper Buyer arrives wih valuaion v If v p, buyer purchases and pays p, else doesn Repea $2 Assume all valuaions h Goal: do nearly as well as bes fixed price in hindsigh If v revealed, run RWM E[gain] (1-²) - O(² -1 h log n) Muli-armed bandi problem Exponenial Weighs for Exploraion and Exploiaion (exp 3 ) [Auer,Cesa-Bianchi,Freund,Schapire] Exper i ~ q Gain g i q Exp3 q = (1- )p + unif ĝ = (0,,0, g i /q i,0,,0) Disrib p Gain vecor ĝ 1 RWM believes gain is: p ĝ = p i (g i /q i ) g RWM 2 g RWM (1-²) - O(² -1 nh/ log n) 3 Acual gain is: g i = g RWM (q i /p i ) g RWM(1- ) nh/ RWM n = #expers 4 E[ ] Because E[ĝ j ] = (1- q j )0 + q j (g j /q j ) = g j, so E[max j [ ĝ j ]] max j [ E[ ĝ j ] ] = 1

2 Muli-armed bandi problem Exponenial Weighs for Exploraion and Exploiaion (exp 3 ) [Auer,Cesa-Bianchi,Freund,Schapire] Exper i ~ q Gain g i q Exp3 q = (1- )p + unif ĝ = (0,,0, g i /q i,0,,0) Disrib p Gain vecor ĝ Conclusion ( = ²): E[Exp3] (1-²) 2 - O(² -2 nh log(n)) nh/ RWM n = #expers Balancing would give O(( nh log n) 2/3 ) in bound because of ² -2 Bu can reduce o ² -1 and O(( nh log n) 1/2 ) wih beer analysis General-sum games In general-sum games, can ge win-win and lose-lose siuaions Eg, wha side of sidewalk o walk on? : you Lef Righ Lef Righ (1,1) (-1,-1) (-1,-1) (1,1) person walking owards you Nash Equilibrium A Nash Equilibrium is a sable pair of sraegies (could be randomized) Sable means ha neiher player has incenive o deviae on heir own Eg, wha side of sidewalk o walk on : Lef Righ Uses Economiss use games and equilibria as models of ineracion Eg, polluion / prisoner s dilemma: (imagine polluion conrols cos $4 bu improve everyone s environmen by $3) don pollue pollue Lef (1,1) (-1,-1) don pollue (2,2) (-1,3) Righ (-1,-1) (1,1) pollue (3,-1) (0,0) NE are: boh lef, boh righ, or boh 50/50 Need o add exra incenives o ge good overall behavior Exisence of NE Nash (1950) proved: any general-sum game mus have a leas one such equilibrium Migh require mixed sraegies Proof is non-consrucive Finding Nash equilibria in general appears o be hard (is PPAD-hard) Wha if all players minimize regre? In zero-sum games, empirical frequencies quickly approach minimax opimaliy In general-sum games, does behavior quickly (or a all) approach a Nash equilibrium? Afer all, a Nash Eq is exacly a se of disribuions ha are no-regre wr each oher So if he disribuions sabilize, hey mus converge o a Nash equil Well, unforunaely, hey migh no sabilize 2

A bad example for general-sum games Augmened Shapley game from [Zinkevich04]: Firs 3 rows/cols are Shapley game (rock / paper / scissors bu if boh do same acion hen boh lose) 4 h acion play foosball

3 A bad example for general-sum games Augmened Shapley game from [Zinkevich04]: Firs 3 rows/cols are Shapley game (rock / paper / scissors bu if boh do same acion hen boh lose) 4 h acion play foosball has sligh negaive if oher player is sill doing r/p/s bu posiive if oher player does 4 h acion oo RWM will cycle among firs 3 and have no regre, bu do worse han only Nash Equilibrium of boh playing foosball Anoher ineresing bad example [Balcan-Consanin-Meha12]: Failure o converge even in Rank-1 games (games where R+C has rank 1) Ineresing because one can find equilibria efficienly in such games We didn really expec his o work given how hard NE can be o find Inernal/Swap Regre and Correlaed Equilibria Wha can we say? If algorihms minimize inernal or swap regre, hen empirical disribuion of play approaches correlaed equilibrium Foser & Vohra, Har & Mas-Colell, Though doesn imply play is sabilizing Wha are inernal/swap regre and correlaed equilibria? More general forms of regre 1 bes exper or exernal regre: Given n sraegies Compee wih bes of hem in hindsigh 2 sleeping exper or regre wih ime-inervals : Given n sraegies, k properies Le S i be se of days saisfying propery i (migh overlap) Wan o simulaneously achieve low regre over each S i 3 inernal or swap regre: like (2), excep ha S i = se of days in which we chose sraegy i Inernal/swap-regre Eg, each day we pick one sock o buy shares in Don wan o have regre of he form every ime I bough IBM, I should have bough Microsof insead Formally, swap regre is wr opimal funcion f:{1,,n}!{1,,n} such ha every ime you played acion j, i plays f(j) 3

4 Correlaed equilibrium Disribuion over enries in marix, such ha if a rused pary chooses one a random and ells you your par, you have no incenive o deviae Eg, Shapley game R P S R P S -1,-1-1,1 1,-1 1,-1-1,-1-1,1-1,1 1,-1-1,-1-1,-1 In general-sum games, if all players have low swapregre, hen empirical disribuion of play is apx correlaed equilibrium Connecion If all paries run a low swap regre algorihm, hen empirical disribuion of play is an apx correlaed equilibrium Correlaor chooses random ime 2 {1,2,,T} Tells each player o play he acion j hey played in ime (bu does no reveal value of ) Expeced incenive o deviae: j Pr(j)(Regre j) = swap-regre of algorihm So, his suggess correlaed equilibria may be naural hings o see in muli-agen sysems where individuals are opimizing for hemselves Correlaed vs Coarse-correlaed Eq In boh cases: a disribuion over enries in he marix Think of a hird pary choosing from his disr and elling you your par as advice Correlaed equilibrium You have no incenive o deviae, even afer seeing wha he advice is Coarse-Correlaed equilibrium If only choice is o see and follow, or no o see a all, would prefer he former Inernal/swap-regre, cond orihms for achieving low regre of his form: Foser & Vohra, Har & Mas-Colell, Fudenberg & Levine Will presen mehod of [BM05] showing how o conver any bes exper algorihm ino one achieving low swap regre Unforunaely, #seps o achieve low swap regre is O(n log n) raher han O(log n) Low exernal-regre ) apx coarse correlaed equilib Can conver any bes exper algorihm A ino one achieving low swap regre Idea: Insaniae one copy A j responsible for expeced regre over imes we play j Can conver any bes exper algorihm A ino one achieving low swap regre Idea: Insaniae one copy A j responsible for expeced regre over imes we play j Cos vecor c Cos vecor c Allows us o view p j as prob we play acion j, or as prob we play alg A j Give A j feedback of p j c A j guaranees (p j c ) q j min i p j c i + [regre erm] Wrie as: p j (q j c ) min i p j c i + [regre erm] Sum over j, ge: p c j min i p j c i + n[regre erm] Our oal cos For each j, can move our prob o is own i=f(j) Wrie as: p j (q j c ) min i p j c i + [regre erm] 4

5 Can conver any bes exper algorihm A ino one achieving low swap regre Idea: Insaniae one copy A j responsible for expeced regre over imes we play j Sum over j, ge: Cos vecor c p c j min i p j c i + n[regre erm] Our oal cos For each j, can move our prob o is own i=f(j) Ge swap-regre a mos n imes orig exernal regre 5

Online Learning, Regret Minimization, Minimax Optimality, and Correlated Equilibrium

Online Learning, Regret Minimization, Minimax Optimality, and Correlated Equilibrium Algorihm Online Learning, Regre Minimizaion, Minimax Opimaliy, and Correlaed Equilibrium High level Las ime we discussed noion of Nash equilibrium Saic concep: se of prob Disribuions (p,q, ) such ha nobody