6.883: Online Methods in Machine Learning Alexander Rakhlin

Size: px

Start display at page:

Download "6.883: Online Methods in Machine Learning Alexander Rakhlin"

Nathaniel Albert Stephens
6 years ago
Views:

1 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURE 0. Preamble This course will focus o olie methods i machie learig. Roughly speakig, olie methods are those that process oe datum at a time. This is a somewhat vague defiitio, but the scope of methods will be clear as we go through the course. We will focus o the algorithmic aspect ( how it works ) ad the theoretical aspect ( why it works ). We will also discuss applicatios, ad you will get to implemet the methods for homework/projects. For the more applicatio-orieted studets there will be plety of opportuities to try thigs out i practice; for the more theoretically-iclied, there will be plety of ew mathematical ideas ad research directios. We ca divide the scope of olie methods ito those that have to process oe datum at a time because there is too much data to fit ito memory, or the applicatio is itself iheretly olie. We will cover both of these ad draw may coectios betwee them. While the first group of olie methods is ofte covered i machie learig classes (e.g. ru stochastic gradiet descet o a eural etwork), the secod is more rare. I this secod part, we shall focus o very recet developmets i olie learig ad develop powerful tools for creatig ew algorithms (that may or may ot look like some form of a gradiet descet). With these ew tools we will be able to derive ad implemet methods for predictio i social etworks, ad placemet, ad so forth. 0.2 Bit predictio I 950 s, David Hagelbarger [Hag56] at the Bell Labs built the so-called mid readig machie to play the game of matchig peies. His colleague ad fried, Claude Shao [Sha53], built a simplified versio (which is curretly i the MIT Museum s storage facility i Somerville [Pou4]). Accordig to some accouts, the machie was cosistetly predictig the sequece of bits produced by utraied players better tha 50%. Here is a moder versio of this machie: by Yoav Freud ad colleagues. Try to beat it! Of course, the oly reaso the machie might be able to predict the sequece etered by a huma is that these sequeces ted to be oradom. How ca we capture the structure i the sequece ad exploit it to our advatage? (If you re comig from iformatio theory, you might be familiar with the close coectio betwee predictability of the sequece ad the ability to compress it. ) Let us try to formalize the problem. Eve i this simple settig there will be a few surprises. Let y = (y,..., y ) {, } be as sequece of (siged) bits. A determiistic

2 predictio strategy ca be writte as A = (ŷ,..., ŷ ), with ŷ t = ŷ t (y,..., y t ) {, }. If we employ the determiistic strategy ŷ o (y,..., y ), we make t= {ŷ t y t } average umber of mistakes. You should be able to covece yourself that for ay such determiistic strategy there exists a sequece o which this strategy makes mistakes all the time, thus icurrig a average loss of. This upsettig issue, however, is fixed by cosiderig radomized strategies, as we show ext. A radomized algorithm will be deoted by A = (q,..., q ), with q t = q t (y,..., y t ) [, ]. Each value q t is to be uderstood as parametrizig the mea of a distributio o {, }. For ay sequece y,..., y, we ow cosider the expected value of the loss l(a; y,..., y ) = E [ t= {ŷ t y t }], where the expectatio is over the radomizatio of the algorithm. The value l(a; y,..., y ) tells us how may mistakes, o average, the radomized algorithm A is expected to make o the give sequece. A algorithm that guesses radomly will icur the average loss of /2 for ay sequece, ad so there is o loger a sigle bad sequece for the algorithm. Ca l(a; y,..., y ) be always smaller tha 50%? That is, ca we fid a super algorithm that will predict all the sequeces better tha a radom guess? Let s take a look. Let ɛ,..., ɛ deote idepedet Rademacher radom variables (ubiased coi flips). The 2 l(a; y,..., y ) = E ɛ,...,ɛ l(a; ɛ,..., ɛ ) = (y,...,y ) 2. Covice yourself of all these steps. Coclusio? Ay algorithm will ecessarily pay larger expected loss o some sequeces ad smaller o the others, so that, o average, the loss is 50%. This souds like a simple egative result, but we ca put a positive spi o it. If we care about particular sequeces (e.g. those produced by humas) we may try to develop a algorithm A that has small expected umber of mistakes o these chose sequeces, ad large umber of mistakes o the oes we will ot expect to ecouter. Of course, the ext questio is: how does oe costruct a good algorithm, give that we have some idea of the sequeces we ll likely ecouter i practice. Let φ(y,..., y ) deote the target expected value we d like to achieve o sequece (y,..., y ), ad let us specify the values of φ o all such sequeces. We may thik of φ as a fuctio o the biary hypercube {, }. Let us also posit that φ does ot chage too fast o earby vertices: φ(..., +,...) φ(...,,...). () If Eφ(ɛ,..., ɛ ) < /2, o algorithm ca achieve l(a; y,..., y ) = φ(y,..., y ) for all sequeces. This follows from the previous argumet. But the iterestig questio is whether we ca achieve ay φ with Eφ 2. This simple but somewhat surprisig result is stated i [Cov65]. Lemma. Let φ {±} R be such that () holds ad Eφ = /2 uder the uiform distributio o {±}. The there exists a radomized algorithm A with (y,..., y ), l(a; y,..., y ) = φ(y,..., y ). 2

3 Proof. Defie the followig shorthad to ease the proof a bit: φ t (y,..., y t ) = Eφ(y,..., y t, ɛ t+,..., ɛ ) where the expectatio is over ɛ t+,..., ɛ. Showig l(a; y,..., y ) = φ(y,..., y ) is equivalet to exhibitig a algorithm with E [ {ŷ t y t } + φ t (y t ) φ t (y t ) 2 ] = 0 (2) t= (you should check this by simplifyig the expressio ad usig the assumptios of the lemma). To show (2), we start from the ed ad defie algorithm A o roud as q (φ(y,..., y, ) φ(y,..., y, +)). Writig {a b} = 2 ( ab) for a, b {±} (a useful represetatio to keep i mid for the rest of the course), ad usig liearity of expectatio, the expected loss o roud is E[{ŷ y }] = 2 ( q y ). We ow peel off the last term from the sum i (2): E[{ŷ y }] + φ (y ) φ (y ) 2 = φ (y ) 2 (φ(y,..., y, +) + φ(y,..., y, )) = 0. (4) We also remark that the choice of q equalizes the values of (3) for the possibilities y = ad y =. Such a strategy is called a equalizer. For itermediate steps, the algorithm takes o the form q t (y,..., y t ) (φ t (y,..., y t, ) φ t (y,..., y t, +)). (5) Cotiuig i this fashio from backwards to t = proves the claim. We may iterpret φ as a potetial fuctio. The the algorithm at time t is sayig: The predictio should be biased towards that label whose potetial (uder a radom evolutio of future) is higher. What is especially iterestig, we are justified (by the fact that our method is optimal) i usig coi flips ɛ t+,..., ɛ for the future eve though the sequece y,..., y t will cotiue to evolve i reality i a arbitrary way o which we place o stochastic assumptios. This is a fortuitous cosequece of our choice of a equalizer strategy. A trivial corollary is: Corollary 2. For ay φ {±} R, satisfyig () ad Eφ /2, there exists a radomized algorithm A makig (o average over its radomizatio) the umber of mistakes o more tha φ(y,..., y ), for ay sequece: (y,..., y ), l(a; y,..., y ) φ(y,..., y ). (3) 3

4 Why is this simple result surprisig? It tells us that existece of a predictio method ca be verified by simply checkig some probabilistic iequality (Eφ /2). I the ext part of the course, we will sped some time learig several techiques for checkig (slightly more complicated) expressios for certifyig existece of predictio methods, ad for fidig them. We ow have a way of choosig φ that will work well for the problem at had. For istace, if we are predictig class labels for odes of a graph, we may tilt φ towards smooth labeligs, expectig homogeeity with respect to the graph structure. Moreover, this is doe is a completely determiistic fashio, without reliace o a probabilistic source for the data. Later i the course we will cover more geeral techiques that do ot arise from simply takig a φ fuctio ad defiig a algorithm the way we did. But for ow we ca already try to costruct good φ s for the problem of predictig sequeces geerated by humas. Oe of the first tools is a method for combiig several good predictors ito a meta-predictor. After that, we may defie a collectio of φ s that predict huma-geerated sequeces accordig to some fiite state machies, ad create a meta-predictor that does as well as the best of them o the give sequece. (For those iterested, the versio of the mid-reader o the web seems to be usig cotext weighted trees). Performig as well as a give fiite collectio of predictors is ofte called the experts settig. Assume that we have access to predictios of k experts E,..., E k, which produce o roud t (i a o-aticipatig maer) a umber (called advice) E j (y,..., y t ) {, }. For each j [k], defie φ j (y,..., y ) = t= {E j (y,..., y t ) y t }, the performace of expert j, which is oly kow at the ed. We check that Eφ j (ɛ,..., ɛ ) = /2. Of course, we ca do as well as a sigle expert j by followig her advice, ad also icurrig φ j umber of mistakes. The questio is whether it is possible to do (almost) as well as the best expert? The first attempt is to take the fuctio mi j [k] φj (y,..., y ) as the overall quality fuctio φ(y,..., y ). However, we observe that the expectatio of this fuctio uder the uiform distributio is less tha /2 (prove it!). Luckily, we eed oly a small correctio: φ(y,..., y ) = mi j [k] φj (y,..., y ) + c log(k) Homework: fid a value of c that esures Eφ /2 for this fuctio. Now that we ve certified Eφ /2, there must exist a method that makes the umber of mistakes o more tha the best of the experts, plus a vaishig term. What is the method? There are several. Of course, the oe i (5) works. However, it requires certai averagig with radom sigs. There are more efficiet methods, ad a popular oe is called the Expoetial Weights (or, Multiplicative Weights). It ca be derived i a pricipled maer with the tools we will itroduce. Two caveats: the argumet i the lemma requires us to be able to simulate the values of experts o some hypothetical future give by radom 4

5 coi flips, ad the experts caot kow future. As we will see, the Expoetial Weights algorithm works eve if these two requiremets are removed. We fiish this lecture with a cocrete example. Suppose you have a huch that humaetered sequeces ted to have imbalace of s ad s. Ca we exploit it? Here is what follows immediately from our previous discussio: There is a (simple) algorithm that will make at most mi{ p, p} + O(/ ) mistakes o ay sequece with p proportio of s, ad the method does ot eed to kow this proportio. For example, if the sequece happes to cosist of 80% s, the algorithm makes roughly 20%, if is large eough. This is ot a trivial statemet, sice we do ot assume that the sequece is i.i.d. The coclusio is: we ca easily build a predictio method that will wi over ay ubalaced sequece etered by a huma. Refereces [Cov65] Thomas M Cover. Behaviour of sequetial predictors of biary sequeces. I Proc. 4th Prague Cof. Iform. Theory, Statistical Decisio Fuctios, Radom Processes, 965. [Hag56] DW Hagelbarger. Seer, a sequece extrapolatig robot. Electroic Computers, IRE Trasactios o, (): 7, 956. [Pou4] W. Poudstoe. How to Predict the Upredictable: The Art of Outsmartig Almost Everyoe [Sha53] Claude E Shao. A mid-readig machie. Bell Laboratories memoradum,

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform