Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would actually perform when dealng wth a practcal problem that s probably not the worst case or even relatvely easy. Indeed, the regret bound we proved for Hedge only says that for all problem nstances, Hedge s regret s unformly bounded by O T ln. However, deally we want to have an algorthm that enjoys a much smaller regret n many easy stuatons, but n the worst case stll guarantees the mnmax regret O T ln. Dervng adaptve algorthms and adaptve regret bounds s exactly one way to acheve ths goal. Small-loss Bounds We start wth the arguably smplest adaptve bound, sometmes called small-loss bound or frst order bound. Recall that we proved the followng ntermedate bound for Hedge: R T = L T L T ln = p t l t, where L T s the cumulatve loss vector, s the best expert and we defne L T = T p t, l t to be the cumulatve loss of the algorthm. By boundedness of losses the last term above can be bounded by L T. If <, then rearrangng gves R T ln L T. Therefore, f { for a moment } we assume we knew the quantty L T ahead of tme and was able to set = mn, ln L T, then we arrve at { } R T max ln, ln ln /LT ln L T L T = O LT ln ln. The fnal bound above s the so-called small-loss bound, whch essentally replaces the dependence on T n the mnmax bound T ln by the loss of the best expert L T. ote that L T s bounded by T, therefore the small-loss bound s not worse than the mnmax bound. More mportantly, t can be much smaller than T when the best expert ndeed suffers very small loss. In partcular, f the best expert makes no mstakes at all and have L T = 0, then the small-loss bound s only Oln, ndependent of T. Ths s one typcal example of adaptve bounds that we are amng for. Of course, one obvous ssue n the above dervaton s that the learnng rate has to be set n terms of the unknown quantty L T. In fact, ths becomes an even more severe problem n a non-oblvous envronment snce L T can depend on the algorthm s actons and thus, makng the defnton of crcular. Fortunately, there are many dfferent ways to address ths ssue, and we explore one of them here. The dea s to use a more adaptve and tme-varyng learnng rate schedule. Specfcally, the algo-
rthm predcts p t exp t L t where t = ln L t. ote that L t = t τ= p τ, l τ s the cumulatve loss of the algorthm up to round t and s thus avalable at the begnnng of round t. Ths s sometmes called a self-confdent learnng rate snce the algorthm s confdent that ts loss s close to the loss of the best expert and thus uses t as a proxy for the loss of the best expert to tune the learnng rate. We next prove that ths algorthm ndeed provdes a small-loss bound. Theorem. Hedge wth adaptve learnng rate schedule ensures R T 3 L T ln 9 ln. Proof. Let Φ t = ln = exp L t. In Lecture we already proved Summng over t and rearrangng gve L T Φ 0 Φ T T Φ t t Φ t t p t, l t t t = ln T T ln exp T L T = L T ln L T p t l t To bound the term T t p t, l t, note that p t, l t = L t = L t L t L t L t L t L t L t L t L t LT L 0 L T, and thus T t p t, l t L t L t L t dx x t = = p t l t. Φ t t Φ t t p t l t Φ t t Φ t t t p t, l t Φ t t Φ t t. L t L t L t L t L t L t L t L t L T ln. To bound Φ t t Φ t t, we prove that Φ t n ncreasng n and thus Φ t t Φ t t. It suffces to prove that the dervatve s non-negatve. Indeed, drect calculaton shows that wth
p t exp L t, Φ t = ln exp L t = L t exp L t = = exp L t = ln p t ln exp L t j L t = ln = ln = = = j= p t ln j= exp L tj exp L t p t ln p 0, t where the last step s by the fact that entropy s maxmzed by the unform dstrbuton. To sum up, we have proven that R T = L T L T 3 L T ln. Solvng for L T leads to L T 3 ln L T 9 ln. 4 Fnally squarng both sdes and usng a b a b gve whch completes the proof. L T 9 ln L T 3 L T ln, Besdes enjoyng a better theoretcal regret bound, ths algorthm s also ntutvely more reasonable snce t tunes the learnng rate adaptvely based on observed data. In general, learnng rate tunng s an mportant topc n machne learnng and could be of great practcal mportance. Quantle Bounds Small-loss bounds mprove the dependence on T n the mnmax regret bound to L T. Is t possble to mprove the other term ln n the mnmax bound to somethng better? To answer ths queston, consder agan Hedge wth a fxed learnng rate for smplcty, and note that we proved n Lecture, L T ln ln exp L T T. = Wthout loss of generalty, assume L T L T so that expert s the -th best expert. Prevously we obtaned the fnal regret bound by lower boundng = exp L T by max exp L T = exp L T. In general, however, for each we have exp L T j j= exp L T j exp L T, j= and we therefore have the followng regret bound aganst the -th best expert: L T L T ln T. Wth optmally tuned to ln /T, the bound becomes T ln. Ths s called the quantle bound and t states that the algorthm suffers at most ths amount of regret for all but / fracton of 3
the experts. Of course, at the end of the day what we care about s actually the loss of the algorthm. So assumng we had the knowledge of L T for a moment, then we could pck the optmal to acheve L T mn L T T ln [], 3 whch s a strctly better bound compared to L T T ln. To understand the mprovement, consder the case when s huge but there are many smlar experts so that for example the top % of them all have the same cumulatve loss. Then bound 3 s at most whch s ndependent of. L T % T ln % = L T T ln00, Just as n the prevous dscusson, one obvous ssue n the dervaton of bound 3 above s agan that the learnng rate needs to be tuned based on unknown knowledge. To address the ssue, here we explore a qute dfferent approach. The dea s to have dfferent nstances of Hedge runnng wth dfferent learnng rates, and have a master Hedge to combne the predctons of these metaexperts. To ths end, we use Hedge to denote an nstance of Hedge runnng wth learnng rate. The algorthm s shown below. Algorthm : Hedge wth Quantle Bounds Input: master learnng rate > 0, base learnng rates,..., M Intalze: M Hedge algorthms Hedge,..., Hedge M, C 0 j = 0 for all j [M] for t =,..., T do let p j t be the predcton of Hedge j on round t compute p t = M j= q tjp j t where q t j exp C t j play p t and observe loss vector l t [0, ] update C t j = C t j pass l t to Hedge,..., Hedge M. p j t, l t for all j [M] By Eq., we have for each Hedge j and each expert p j t, l t L T ln T j. j On the other hand, for the master Hedge, we have for each meta-expert j, M q t j p j t, l t C T j ln M T. j= ote that by constructon, we have M j= q tj p j t, l t = p t, l t and C T j = T p j t, l t. Therefore summng up the above two nequaltes lead to p t, l t L T ln T j ln M T = ln T j T ln M, j j where the last step s by pckng the optmal = ln M/T. ote that the above holds for all j and all. Therefore, suppose we have a for each, there s an j such that j ln T j = O T ln, and b M s much smaller than, then we obtan bound 3. Settng M = and j = ln j /T would clearly satsfy a, but not b. Fortunately, t turns out that one only needs to create M ln meta-experts and stll satsfy a. Specfcally, let j = ln ln j and M = log T ln. 4
ow clearly for each, there exst a j such that j ln /T j and therefore p t, l t L T ln T j T ln M j ln ln T ln /T T ln M /T = 3 T ln T ln M. It remans to show that M s small enough. Indeed, snce ln x x/, x, we have ln = ln, and therefore M = Oln ln. So as least for the case when / s larger than Oln ln, the term T ln M s domnated by T ln n the regret bound. We summarze the result n the followng theorem. ln Theorem. Algorthm wth = T, j = ln j T and M = log ln ln ensures L T mn L T 3 T ln O T lnln ln. Ths dea of combnng algorthms usng Hedge s useful for many other problems. It s usually a quck and easy way to verfy whether some regret bound s possble or not n theory. However, the resultng algorthm mght not be so elegant and practcal. In the next lecture, we wll study a dfferent algorthm that not only guarantees a quantle bound n fact even better than the one proven here, but also enjoys several more useful propertes. 5