Online Learning with Partial Feedback. 1 Online Mirror Descent with Estimated Gradient

Size: px

Start display at page:

Download "Online Learning with Partial Feedback. 1 Online Mirror Descent with Estimated Gradient"

Carol Rich
6 years ago
Views:

1 Avance Course in Machine Learning Spring 2010 Online Learning wih Parial Feeback Hanous are joinly prepare by Shie Mannor an Shai Shalev-Shwarz In previous lecures we alke abou he general framework of online convex opimizaion an erive an algorihm for preicion wih exper avice from his general framework To apply he online algorihm, we nee o know he graien of he loss funcion a he en of each roun In he preicion of exper avice seing, his boils own o knowing he cos of each iniviual exper In his lecure, we show ha in orer o apply he online mirror escen algorihm i suffices o know an esimae of he graien In paricular, his yiels a no-regre algorihm for a famous problem calle he muli-arme bani problem 1 Online Mirror Descen wih Esimae Graien Recall he online mirror escen algorihm we escribe in Lecure 4 Now suppose ha insea of seing v o be a sub-graien of g (w ), we shall se v o be a ranom vecor wih E[v g (w ) Algorihm 1 Online Mirror Descen wih Esimae Graien Iniialize: w 1 f (0) for = 1 o T Play w A Pick v a ranom s ( E[v v 1,, v 1 g (w ) Upae w +1 f η ) s=1 v en for We now show ha he analysis sill hols as long as we have some boun on E[ v 2 Theorem 1 Suppose Algorihm 1 is use wih a funcion f ha is β-srongly [ convex wr a norm on A an has f 1 T (0) = 0 Suppose he loss funcions g are convex an ha E T v 2 V 2 Then, he algorihm run wih any posiive η enjoys he expece regre boun, E g (w ) min g (u) max u A f(u) + ηv 2 T u A η 2β In paricular, choosing η = E 2β max u f(u) V 2 T we obain g (w ) min u A g (u) V 2 max u A f(u) T β Proof Apply Corollary 1 from Lecure 4 o he sequence ηv 1,, ηv T o ge, for all u, Rearranging gives, η v, u f(u) η v, w + 1 ηv 2 2β v, w u f(u) η + η 2β v 2 Online Learning wih Parial Feeback-1

2 Taking expecaion of boh sies wih respec o he ranomness in choosing v we obain ha [ E[ v, w u f(u) + η η 2β T E 1 T v 2 A each roun, le v = E[v v 1,, v 1 g (w ) Using he assumpions in he heorem we ge ha E v, w u f(u) + η η 2β T V 2 By convexiy of g, g (w ) g (u) v, w u Therefore, E g (w ) g (u) f(u) η Since he above hols for all u A he resul follows + ηv 2 T 2β 2 The Muli-Arme Bani Problem In he muli-arme bani problem, here are arms, an on each online roun he learner shoul choose one of he arms, enoe I, where he chosen arm can be a ranom variable Then, i receives a cos of choosing his arm, c,i [0, 1 The vecor c [0, 1 associaes a cos for each of he arms, bu he learner only ge o see he cos of he arm i pulls Nohing is assume abou he sequence of vecors c 1, c 2, The performance of he learner is using by is regre for no always pulling he bes arm, E c,i min c,i, i where he expecaion is over he ranomness of he learner This problem nicely capures he exploraion-exploiaion raeoff On one han, we woul like o pull he arm which, base on previous rouns, we believe has he lowes cos On he oher han, maybe i beer o explore he arms an fin anoher arm wih a smaller cos To approach he muli-arme bani problem we use he general resul erive in he previous secion Le he loss funcion be g (w) = w, c an noe ha if w is a probabiliy vecor an I w, hen g (w ) = E[c,I The graien of he loss is c, bu we on know he value of all elemens of c To esimae he graien we shall efine a vecor v s Clearly, E[v = c Aiionally, v,j = { c,j /w,j if j = I 0 else E[ v 2 i w,i (c,i ) 2 /w 2,i i 1/w,i To ensure ha his quaniy is no excessively large we will efine he se of allowe isribuions o be A = {w : w i [, 1, i w i = 1}, where is a parameer o be efine laer Thus, E[ v 2 1/ Applying Theorem 1 we obain ha for all u A 2 log() T E g (w ) g (u) + Online Learning wih Parial Feeback-2

3 Finally, Le C i = T c,i an noe ha for each i if we se u o be s u i = 1 ( 1) an u j = hen So, overall, g (u) = C i + j i E g (w ) C i + T + (C j C i ) C i + T 2 log() T Seing = (2 log()t/( 2 T 2 )) 1/4 = (2 log()/( 2 T )) 1/4 we obain he regre boun ( E g (w ) C i + O (log() 2 T 3 ) 1/4) = Õ(1/2 T 3/4 ) 3 An improve Muli-Arme Bani Preicor We now erive anoher algorihm, calle EXP3 (which sans for exponenial-weigh algorihm for exploraion an exploiaion), ha enjoys a regre boun of O( T ) The algorihm is ue o Auer, Cesa-Bianchi, Freun, an Schapire Remark: Throughou his secion, we hink abou c as gain ha we like o maximize raher han a cos One can erive a resul for minimizing a cos by efining c,i 1 c,i for all an i Algorihm 2 EXP3 Parameer: (0, 1 Iniialize: w 1 = (1,, 1) for = 1 o T Se = j=1 w,j Se p,i = (1 )w,i / + / Pull I ranomly accoring o p Receive cos c,i [0, 1 Le v be he vecor wih v,j = c,j p,j 1 [I=j Upae: w +1,j = w,j e v,j/ en for Theorem 2 For any (0, 1) an j [ we have c,j E[C exp3 (e 1) c,j + 1 ln() Proof We have +1 = = w +1,i w,i e v,j/ w,i ( 1 + v,j / + (e 2)(v,i /) 2), Online Learning wih Parial Feeback-3

4 where in he las inequaliy we use he inequaliy e x 1 + x + (e 2)x 2 which hols for x 1 Denoe w,i = w,i / an using he efiniion of v, he above implies: w,i v,i + (e 2) ( ) 2 w,i v 2,I Since w,i p,i /(1 ), an using he efiniion of v,i we ge Z (1 ) c,i + (e 2) ( ) Taking logarihms of boh sies an using ln(1 + x) x we ge c,i p,i ln +1 (1 ) c,i + (e 2) ( ) c,i p,i Summing over we obain ln +1 (1 ) C exp3 + (e 2) ( ) c,i p,i On he oher han, for any acion j we have ln +1 ln w T +1,j Z 1 v,j ln Combining he upper an lower boun we obain v,j ln (1 ) C exp3 + (e 2) ( ) c,i p,i Now, ake expecaion of boh sies (wr o he ranom choice of I ) Noe ha E[v I 1,, I 1 = c an ha E[c,I /p,i I 1,, I 1 = i c,i c,j Therefore, E [ [ c,j ln E (1 ) C exp3 + (e 2) ( ) 2 1 T 1 c,j Rearranging he above gives c,j E[C exp3 (e 1) c,j + 1 ln(), which conclues our proof Corollary 1 Choose = min{1, ln()/((e 1)g}, hen for any j s c,j g we have c,j E[C exp3 2 e 1 g ln() = O( T ln()) Online Learning wih Parial Feeback-4

5 31 Lower boun Theorem 3 For any 2 an T 1 here exiss a isribuion over assignmens of rewars such ha he expece regre of any algorihm (where expecaion is boh wih respec o he ranomizaion of he algorihm an he assignmens of rewars) is a leas Ω(min{ T, T }) A proof can be fin in Auer e al paper The iea is o efine a isribuion over rewars of arms as follows Before he play begins, one acion I is chosen uniformly a ranom o be he goo acion The rewars of he goo acion are chosen ii o be 1 wih probabiliy 1/2 + ɛ an 0 oherwise for some ɛ o be efine laer The rewars of he res of he arms are chosen o be eiher 0 or 1 wih probabiliy 1/2 Now, he iea is o show ha any funcion efine on rewars in previous rouns canno isinguish o well beween rewars ha come accoring o he isribuion menione above an rewars ha come from a uniform isribuion Online Learning wih Parial Feeback-5

Notes on online convex optimization

Notes on online convex optimization Noes on online convex opimizaion Karl Sraos Online convex opimizaion (OCO) is a principled framework for online learning: OnlineConvexOpimizaion Inpu: convex se S, number of seps T For =, 2,..., T : Selec