6.883: Online Methods in Machine Learning Alexander Rakhlin

Size: px

Start display at page:

Download "6.883: Online Methods in Machine Learning Alexander Rakhlin"

Jemima Garrett
6 years ago
Views:

1 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform two evaluatios of φ. I may situatios, a easier method is available oe that does ot require drawig the radom variables. I fact, most of the methods ecoutered i the olie learig literature ca be see as doig precisely this gettig rid of the radom variables for future rouds. Oe of the most famous scearios i olie learig is that of predictio with expert advice. There are several roughly equivalet formulatios, but oce you ve see oe, you ll be able to modify the proof for the other. Cosider the situatio where o each roud t =,...,, we observe advice of N experts. Suppose the advice comes i the form of a vector x t [, ] N, ad we thik of x t (i) as, say, the buy/sell advice by expert i. We treat x t as side iformatio for makig our ow decisio. After seeig the advice, we decide o a mixed strategy ŷ t (N) (a distributio over the N experts) ad make a predictio ŷ t, x t [, ] by mixig the opiios accordig to ŷ t. The outcome y t {±} is the revealed. Oce we have the mea ŷ t, x t for the mixed strategy, we may either draw the actual biary-valued predictio from this distributio, or we may simply thik of y t ŷ t, x t = ŷ t, y t x t as the expected idicator loss of our strategy (see the collaborative filterig example). What is differet here from previous lectures is that our decisio variable ŷ t is ot a real umber, but a distributio. The goal of the learer is to icur small average loss y t ŷ t, x t. () As it turs out, a simple algorithm allows the learer to keep this average loss ot much worse tha the loss of the best expert, without kowig who the best is util the ed. I particular, we prove that Lemma. There is a algorithm (i fact, several distict methods) that guaratees y t ŷ t, x t mi y t x t (j) + c j [N] for ay sequece (x, y ),..., (x, y ). As a example, this boud (with c = 8) is attaied by the expoetial weights algorithm ŷ t (j) η t e s= ys xs(j) (2) N (3) j= e η t s= ys xs(j)

2 with a step size η = 2. We will preset two very similar proofs. After provig the Lemma, we will re-do the proof i the slightly simpler trasductive settig through the les of Cover s statemet. It s istructive to look at both proofs ad see the few small differeces. Both proofs will utilize the followig iequalities. First is the soft-max boud. Choose a parameter η > 0 ad let A,..., A N be real umbers. We the have max A j = j [N] η max ηa j = j η log exp max ηa j = j η log max j exp {ηa j } N η log exp {ηa j }. There is oly oe iequality betwee the maximum over j ad the soft-max fuctio. Suppose all A j are equal. The the right-had-side is larger tha the left-had-side by a additive η factor (verify this!). As η icreases, the gap betwee the two sides vaishes. Same ca be argued for the case whe the values are ot equal. I fact, the last upper boud is a equality if we are allowed to choose η. The secod iequality we use is 2 (ex + e x ) e x2 /2 which you ca prove via Taylor expasios. The iequality implies j= Ee λɛ e λ2 /2 (4) for the Rademacher radom variable ɛ ad a cotat λ R. The same boud holds for ay zero-mea radom variable Z with values i [, ]: Ee λz e λ2 /2, (5) ad it is immediate that the boud becomes e b2 λ 2 /2 for [ b, b]-valued Z.. Proof of Lemma Thaks to the idetity a b = ab that holds for a {±} ad b [, ], we may rewrite the differece betwee the loss of the algorithm ad the bechmark i (2) as y t ŷ t, x t mi j [N] y t x t (j). (6) Let us omit the fractio ad brig it back at the very ed. Whe comparig to the proof i the ext sectio, just isert this fractio throughout. Cosider the last step t =. I the first sum, all the terms except the last oe are fixed, ad so we eed to solve mi ŷ (N) max { y ŷ, x + Rel(x, y )} (7) y {±} 2

3 with Rel(x, y ) mi y t x t (j). (8) j [N] Here (N) is the probability simplex o N experts. We could choose Rel to be equal to the right-had side i (8). However, for computatioal purposes, we slightly modify this fuctio. Istead of max we shall work with soft-max. That is, take Rel(x, y ) = N η log exp {η y t x t (j)} for some η, to be determied. The algorithm (3), i the form (6), ca be writte as while the loss o roud t is (trivially) j= ŷ t (j) e η t s= ysxs(j) (9) y t ŷ t, x t = E j ŷt [y t x t (j)] = η log exp { ηe j ŷ t [y t x t (j)]}. The key observatio ow is that N j= exp {η y t x t (j)} = N j= because E j ŷ [A(j)] = N j= ŷ(j)a(j) with exp {ηy x (j)} exp {η = E j ŷ [exp {ηy x (j)}] ŷ (j) = N j= y t x t (j)} (0) exp {η y t x t (j)} () exp {η y t x t (j)} N j= exp {η y t x t (j)}. (2) Puttig everythig together, (7) is upper bouded by the particular choice of the ifimum strategy ŷ t as (7) max y {±} { ŷ, y x + Rel(x, y )} (3) = max y {±} { η log exp { ηe j ŷ [y x (j)]} + η log E j ŷ [exp {ηy x (j)}]} (4) + N η log j= exp {η We ow focus o the two terms i (4): y t x t (j)} (5) η log exp { ηe j ŷ [y x (j)]} + η log E j ŷ [exp {ηy x (j)}] (6) = η log E j ŷ [exp {η(y x (j) E j ŷ [y x (j)])}] (7) (2η) 2 η 2 = 2η (8) 3

4 by (5). Note that the rage of the zero-mea radom variable is [ 2, 2], ad so a additioal factor of 2 2 appears from the applicatio of (5). Observe that this last step of peelig off the zero-mea term makes the expressio idepedet of y ad x! I particular, it does ot matter whether the sequece of x s is geerated i.i.d. or i a arbitrary maer. Ulike the Cover s approach of solvig the max over the two alteratives, we preseted a particular ŷ t that allows (through a upper boud) to make the choice y irrelevat. While the two approaches give slightly differet algorithms, the upper bouds they ejoy are the same. Now, we simply defie ad Of course, Rel(x, y ) = N η log j= exp {η y t x t (j)} + 2η (9) Rel(x t, y t ) = N t η log exp {η y s x s (j)} + 2( t)η. (20) j= s= Rel( ) = η + 2η 2 2. (2) by choosig η = the lemma is Now, sice we iitially divided throughout by /, the boud of.2 (slightly easier) trasductive settig through the les of Cover s algorithm Cosider the simplified settig where expert advices x,..., x {±} N are fixed ad kow a priori, ad let F = {x x j j [N]} be the set of N fuctios that simply output a coordiate of x. As discussed earlier, F iduces a subset F {±} of fiite cardiality N, ad F = {y y = (x (j),..., x (j)), j [N]} φ(y) = d H (y, F ) + C = mi {f t y t } + C (22) for some appropriate C which we will defie later. I this sectio, we will directly solve for the real-valued predictio q t [, ], which ca be viewed as the mea of the mixed strategy for predictig ŷ t {±}. This is slightly differet from what was described earlier, where the predictio is calculated by mixig the advice as ŷ t, x t. We will solve the latter directly i the ext sectio. Recall that the choice of relaxatio Rel defies the algorithm. I this sectio we give the derivatio usig the very basic techique that goes back to Lecture. The algorithm that arises from Cover s lemma is ot expoetial weights, but it gives the same guaratee o performace as the expoetial weights method. 4

5 Let us take Rel(y ) as ay upper boud o the bechmark term mi We will use the soft-max upper boud (verify that it holds): {f t y t } = max f, y. (23) Rel(y ) η log exp{η f, y } (24) Check that this fuctio does ot chage by more tha / whe flippig oe bit. Now, as before, mi max {E [ q y {ŷ y }] + Rel(y )} = E ɛ Rel(y, ɛ ) + 2 By Jese s iequality (E log log E), (25) E ɛ Rel(y, ɛ ) η log E ɛ exp{η f, ỹ } (26) with ỹ = (y, ɛ ). The oly radomess i the above expressio is the ɛ o the last coordiate of ỹ. Let us abuse the otatio ad write f, ỹ = f, y +ɛ f. Our aim is to get rid of ɛ. If we succeed, we do ot have to draw the radom coi flips for radom playout at the itermediate steps. By (4), E ɛ exp {η f, ỹ } exp {η f, y } exp{η 2 /2} (27) ad, therefore, (26) is upper bouded by 2 η log exp {η f, y } + η 4 2 (28) I view of (25), we ca ow defie, Rel(y ) = 2η log exp {η f, y } + η 4 2. (29) That s it! There is o ɛ i the relaxatio at time. We peeled it off. Oe ca check that at the itermediate step t, Rel(y t ) = 2η log ( t)η exp {η f t, y t } + t 4 2 (30) with f t, y t beig defied as t s= f s y s. We also see that Rel( ) = 2η log exp {0} + η 4 = 2η + η 4 = 2 (3) 2 by choosig η =. This is a o-algorithmic derivatio, ad the algorithm is give i Lecture. We leave it as a homework exercise to write it explicitly. (hit: it does ot become expoetial weights). We also ote that the differece i the costat c comes from scalig of the idicator loss vs the absolute value loss. 5

6 .3 Discussio The two proofs are essetially the same. Both start by relaxig the max to a soft-max, ad takig this as Rel. The, the secod approach explicitly solves for the optimal real-valued predictio (the mea of the mixed strategy), while the first approach guesses a (potetially suboptimal) strategy of expoetial weights. Oce the strategy for the ifimum is plugged i, oe obtais a expressio with a zero-mea radom variable. This zero-mea variable is elimiated usig a probabilistic iequality (Eq. (7) ad (27), respectively). To reiterate, the saliet features of the proofs are: () passig to a relaxatio for Rel, (2) solvig for the best strategy or guessig a ear-best strategy, ad (3) usig probabilistic iequalities to remove the radom variable that arises from pluggig i the strategy. These steps ca be take as a rough prescriptio for the developmet of olie methods. We will illustrate the steps agai i the subsequet lectures. The ext ote is o the ature of the sequece x,..., x. Essetially, both approaches make it irrelevat how the x t s are geerated. That is ot to say that the method does ot take the side iformatio ito accout (of course it does through the losses of experts). Rather, the poit is that we ca successfully deal with adversarially geerated x s, a stregth of the experts approach. Aother stregth of the experts boud is its mild (logarithmic) depedece o the umber of experts. Oe may take a large umber of experts ad still have a average error beig o() from the average error of the best expert. Fially, we remark that the experts approach ca be see as a uio boud or a aggregatio procedure. Suppose oe has N algorithms makig predictios. The oe ca predict as well as ay of these algorithms by payig O ( ). Such a black box techique is very useful (see the versio of liearized experts below for the geeral black box statemet). For istace, suppose oe does ot kow how to choose a parameter θ [0, ] of the algorithm optimally. Oe ca the ru (at least i priciple) N = /ɛ algorithms correspodig to a ɛ-discretizatio of the parameter. If the output is i some sese Lipschitz with respect to the parameter choice, o ca claim that the resultig aggregatig procedure does as well as the best choice, plus a ɛ-precisio term, plus a O ( log(/ɛ) ) pealty..4 Liearized Experts Recall that i the experts settig itroduced i the begiig of the lecture, we observe predictios x t [, ] N of the experts, choose a distributio ŷ t (N) for the weighted vote, ad the observe the outcome y t {±}. Observe that the expoetial weights algorithm at time t does ot use x t to calculate the distributio over experts. Hece, we may thik of a settig where we choose a distributio ŷ t (N) ad the observe both predictios x t ad the true outcome y t. Rather tha mixig the advice of the experts to produce our ow, we may istead choose the expert at radom from the distributio ŷ t ad go with her advice. The the expected cost for the period t is ŷ t, z t where z t [, ] N is the vector of losses for each expert: z t (j) = y t x t (j). I fact, the loss fuctio does ot matter aymore, ad it does ot matter that data comes i the form (x t, y t ) pairs. Istead, we may just thik of each expert icurrig some cost, we are 6

7 choosig a expert at radom ad icur the same cost as that expert. I expectatio, we pay ŷ t, z t. Let us state the protocol explicitly: For t =,..., Predict ŷ t (N) Observe costs z t [, ] N Alteratively, we may choose the radom expert j accordig to ŷ t ad pay z t (j). The goal here is to have small expected cost, relative to the cost of the best expert: ŷ t, z t mi j [N] e j, z t + c (32) for ay sequece z,..., z of costs. The cost z t may be chose eve with the kowledge of our decisio ŷ t. Let us quickly prove that the expoetial weights algorithm t ŷ t (j) exp { η z s (j)} achieves the above guaratee. We write the last step of the problem (removig the / ormalizatio term) as with mi ŷ (N) mi e j, z t = max j [N] j [N] s= max { ŷ, z + Rel(z )} (33) z [,] N z t (j) N η log exp { η z t (j)} Rel(z ). (34) The rest of the proof of (32) is essetially idetical to that of the proof of Lemma. j= Refereces 7

Rademacher Complexity

EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for