6.883: Online Methods in Machine Learning Alexander Rakhlin

Size: px

Start display at page:

Download "6.883: Online Methods in Machine Learning Alexander Rakhlin"

Tobias Williamson
5 years ago
Views:

1 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURE 23. SOME CONSEQUENCES OF ONLINE NO-REGRET METHODS I this lecture, we explore some cosequeces of the developed techiques.. Covex optimizatio Wheever we have a olie liear optimizatio guaratee ŷ t, z t mi v, z t Reg () that holds for all sequeces z,..., z K, the for ay covex fuctio f K R with gradiets i K { f(v) v K} K, it holds that for f (ȳ) mi f(v) Reg (2) ȳ = I other words, by ruig a olie liear optimizatio method ad averagig the trajectory (called Polyak averagig), we obtai a poit i K with suboptimality bouded by ormalized regret Reg. For istace, if K = B 2 ad gradiets of f are also bouded by i Euclidea orm, methods developed earlier give Reg =, ad thus the average of the trajectory ȳ satisfies f(ȳ) mi f(v). Recall that this is a boud we already derived i the begiig of the course. Let s prove the claim (2). By covexity, for ay v K, f (ȳ) f(v) ŷ t. f (ŷ t ) f(v) ŷ t v, f (ŷ t ). (3) The (2) follows from () by takig z t = f (ŷ t ). This step is valid because z t ca be chose based o ŷ t (as per the olie liear optimizatio protocol). We remark that (2) holds with Reg = Reg (z,..., z ). Ideed, adaptive datadepedet bouds for optimizatio have bee derived via this approach.

2 Fially, observe that a versio of (3) f t (ŷ t ) f t (v) ŷ t v, f t (ŷ t ) (4) holds with time-varyig covex fuctios f t. Hece, a regret guaratee for olie liear optimizatio traslates ito a regret guaratee for this olie covex optimizatio sceario..2 Zero-sum games Let M be a d d 2 real matrix with bouded etries. We thik of M i,j (resp., M i,j ) as the cost for Player I (resp., Player II) whe the two players choose their moves as i ad j. As discussed a few lectures ago, the miimax value is mi max q d j [d 2 ] qt Me j = max mi p d2 i [d ] et i Mp = mi max q T Mp. (5) q d p d2 It turs out, both players ca fid a ear-miimax-optimal strategy simply by employig a olie liear optimizatio method with a o-trivial regret guaratee. More precisely, we assume that o roud t, Player I picks a probability distributio q t d o rows, while Player II picks a probability distributio p t d2 o colums of M. The first player the receives Mp t (or a radom draw Me Jt, J t p t ), while the secod player receives qt T M (or the correspodig radom draw). Let q = q t ad p = p t be the averages of the trajectories. Sice each player employs a o-regret strategy, ad qt T Mp t mi q d ( q T t M)p t mi p d2 q T Mp t Reg() (6) ( qt T M)p Reg(2). (7) We put a miuses i the secod expressio because the secod player ca preted she is the miimizig player i the game with losses give accordig to M. Addig the two expressios ad usig the defiitio of q ad p, However, we kow that max q T Mp mi q T M p p d2 q d (Reg() + Reg (2) ) (8) mi q T M p max mi q T Mp = mi max q T Mp max q T Mp. (9) q d p d2 q d q d p d2 p d2 I view of (8) ad (9), the pair ( q, p) attais a value withi (Reg() + Reg (2) ) from miimax. I other words, the average of the trajectory of both players coverges (i the sese of the miimax value) to a set of miimax optima (Nash equilibria). This techique exteds to multi-player games, whereby each decisio-maker rus a o-regret algorithm agaist the (complicated) set of moves of the other players. 2

3 .2. What if Player II best-respods As a alterative (which will be used i the ext sectio) suppose Player I employs a regret-miimizatio strategy, while Player II best respods to the observed q T t M: p t = argmi j [d 2 ] q T t Me j. This behavior also leads to the miimax strategy sice max q T Mp p mi q d max q T p t Mp = q T Mp t + Reg() q T t Mp t (0) = mi q d q T M p + Reg() () I view of (9), the pair of strategies ( q, p) is withi Reg() from miimax optimal. I particular, ote that d 2 does ot affect the rate of covergece this will be importat i the ext sectio..3 Applicatio: Boostig We briefly describe a variat of Boostig, a procedure that aggregates weak classifiers ito a strog oe. This is a batch procedure, ad the olie aspect will appear as a iterative improvemet of the aggregate. More precisely, let (x, y ),..., (x m, y m ) X {±} be a dataset, ad F a class of [, ]- valued fuctios o X. Each fuctio f F iduces a biary predictio sig(f(x t )) at x t. For the give pair (x t, y t ), a mistake occurs if y t f(x t ) < 0. Suppose the class F has fiite cardiality k (though this ca be lifted at the ed sice the fial boud does ot deped o this value) ad eumerate F = {f,..., f k }. Defie a matrix M [, ] m k with M i,j = y i f j (x i ). Cosider a miimax value q T Mp, where q is a distributio o examples ad p is a distributio o fuctios. We have mi max q m j [k] qt Me j = max mi p k i [m] et i Mp γ. (2) Suppose γ > 0. The two sides of (2) have iterestig ad distict iterpretatios. The left-had side says: o matter what distributio q o the dataset we choose, there is some fuctio f j such that E i q y i f j (x i ) γ. If you ve ecoutered the literature o boostig (search for it i the iteret if you have t), the above requiremet is kow as the weak learability coditio [FS95, SF2]. To be able to boost weak learers, oe must be able to fid, for ay discrete distributio o the m datapoits, a hypothesis that beats radom guess. The right-had side of (2) says: there exists a distributio p over the k fuctios, such that o ay data poit (x i, y i ), oe has y i f(xi ) γ, (3) 3

4 for f = k i= p if i. This ca be recogized as a separability assumptio with margi γ (see Lecture 2). Thus, separability with a margi is equivalet to the weak learability coditio. It is o surprise that weak learability meas oe ca create a mixture of fuctios that attais zero mistakes. Equivalece aside, the goal of boostig is to furish a distributio p that early achieves (3). Accordig to the previous subsectio ad, i particular, Eq. (), oe may achieve this by ruig a expoetial weightig (or some other regret miimizatio procedure) over the m choices for Player I, ad best respose for Player II. Best respose exactly correspods to the usual weak learability assumptio whereby a choice f j (a pure strategy) is retured to Player I for the give q t, a key step i ay boostig procedure. It remais to check the umber of iteratios required. If margi is γ, we ca aim for achievig a margi of γ/2. The equatig the Expoetial Weights regret boud to target accuracy γ/2, we fid Reg() = C log m = O ( log m). (4) γ2 The typical aalysis of AdaBoost claims expoetially decayig error, which is expressed as m m i= {y i f(xi ) 0} exp{ cγ 2 } (5) after iteratios. The settig of (4) esures that the upper boud is less tha /m, ad hece o errors are made (covice yourself of this fact!). That is precisely what we foud by esurig a margi of γ/2. Observe that k = F ever eters the picture sice we are usig best respose strategy for Player II..4 From Olie to Statistical Learig (idividual sequece to i.i.d.) Cosider a abstract sceario where the learer chooses ŷ t, observes z t, ad icurs a cost l(ŷ t, z t ). We assume l is covex i the first argumet ad that we are able to prove l(ŷ t, z t ) mi l(v, z t ) + Reg. (6) for ay sequece z,..., z. We are ot specifyig the sets where the decisios ad outcomes take values, as the techique we are about to describe is quite geeral. Now, suppose the data are actually i.i.d. with some distributio P. The E [ l(ŷ t, Z t )] E [mi sice the boud holds for each realizatio (Z,..., Z ). Let ȳ = 4 ŷ t. l(v, Z t )] + Reg (7)

5 Observe that El(ȳ, Z) E [ l(ŷ t, Z)] (8) where the expectatio is over both Z ad Z,..., Z. The expressio i (8) ca be writte as which is the same as O the other had, E [ E [mi E [ E [l(ŷ t (Z t ), Z) Z t ]] (9) E [l(ŷ t (Z t ), Z t ) Z t ]] = E [ Puttig everythig together, l(v, Z t )] mi El(v, Z t ) = mi El(v, Z) l(ŷ t, Z t )]. (20) El(ȳ, Z) mi El(v, Z) + Reg (2) Such a boud is a goal of Statistical Learig Theory. Moreover, if l is square loss, the above boud also traslates ito a exact oracle iequality, a subject of iterest i Statistics. We see that a regret guaratee gives rise to a statistical guaratee for the average of the trajectory if we assume that data are i.i.d. I particular, cosider the supervised learig sceario, where at each step we observe side iformatio x t, predict a real value ŷ t, ad observe y t. We ca thik of ŷ t as a fuctio of x t, sice for differet x t values we would have made a differet predictio. I other words, a equivalet formulatio would be: o roud t, predict a fuctio ŷ t X R ad observe (x t, y t ). Applyig the earlier argumet, if (X, Y ),..., (X, Y ) are i.i.d., the fuctio has the guaratee ȳ( ) = ŷ t ( ) El(ȳ(X), Y ) mi f F El(f(X), Y ) + Reg (22) wheever l(ȳ(x), Y ) is covex i ȳ( ). While further developmet of this techique is beyod the scope of this course, let us metio that uder certai assumptios, this way of costructig a estimator ȳ( ) is optimal for a wide rage of problems ad loss fuctios, both i Noparametric Statistics ad Statistical Learig Theory. (ask me if you d like to kow more) For those who kow the aalysis of ERM (empirical risk miimizatio) i Statistical Learig may ask whether we have miraculously circumveted the issue of uiform covergece of meas to expectatios, a ecessary step i the aalysis of ERM. The aswer is that we have t it is hidde i the regret boud Reg = Reg (F). More precisely, it ca be show that this term is cotrolled by a martigale versio of uiform covergece [RST5]. I may cases of iterest, this covergece is equivalet to i.i.d. covergece. Hece, by passig through olie methods, we might be gaiig i terms of algorithmic uderstadig, ad (uder certai coditios) ot losig i terms of accuracy for i.i.d. data. 5

6 .5 Equivalece of determiistic regret bouds ad martigale iequalities The coectio betwee olie regret bouds ad Statistics ad Probability are eve more itriguig tha outlied above. Let us sketch oe example. Let z,..., z B 2 be a arbitrary sequece i the uit Euclidea ball. We costruct the sequece of predictios by ŷ 0 = 0 ad ŷ t+ = Π B2 (ŷ t z t ), (23) which we ca all recogize as gradiet descet with projectios oto B 2 ad step size η =. Earlier i the course we have proved a regret boud of ŷ t v, z t (24) for ay z,..., z B 2 ad v B 2. Let us choose the optimal v ad rearrage terms: z,..., z B 2, z t ŷ t, z t. (25) Now, apply this iequality to a realizatio of a martigale differece sequece Z,..., Z (that is, E[Z t+ Z t ] = 0). Sice the iequality holds for ay realizatio, P ( Z t u) P ( ŷ t, Z t u). (26) for ay u > 0. Sice ŷ t = ŷ t (Z,..., Z t ), the right-had side is a sum of real-valued martigale differeces with values i [, ], ad so the right-had side is upper bouded by exp{ u 2 /(2)} via Hoeffdig-Azuma iequality. Therefore, P ( Z t u) exp { u 2 /(2)}. (27) Oe may pass from this statemet i probability to a expected-value boud E Z t c (28) by itegratig the tails i (27). Note that the iequality holds for all martigale differece sequeces with values i B 2. Usig the miimax theorem, oe ca show (we basically have doe this already) that there exists a strategy with (25) for ay sequece because (28) holds. That is, (25) implies (27) which implies (28) which implies (25) (with a worse costat). Hece, the statemets are, i a certai sese, equivalet. Oe ca, i fact (see the relevat paper [RS5]) improve tail bouds (27) by takig a smarter gradiet descet method with a adaptive step size, ad, coversely, it is possible to derive ew optimizatio methods by provig tail bouds for martigales. This lecture was meat to give a brief overview of some iterestig coectios betwee olie methods ad the ear-by areas of game theory, statistics, probability, ad optimizatio. 6

7 Refereces [FS95] Yoav Freud ad Robert E Schapire. A desicio-theoretic geeralizatio of o-lie learig ad a applicatio to boostig. I Computatioal learig theory, pages Spriger, 995. [RS5] Alexader Rakhli ad Karthik Sridhara. O equivalece of martigale tail bouds ad determiistic regret iequalities. arxiv preprit arxiv: , 205. [RST5] Alexader Rakhli, Karthik Sridhara, ad Ambuj Tewari. Sequetial complexities ad uiform martigale laws of large umbers. Probability Theory ad Related Fields, 6(-2): 53, 205. [SF2] Robert E Schapire ad Yoav Freud. Boostig: Foudatios ad algorithms. MIT press,

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform