General Weighted Majority, Online Learning as Online Optimization

Sascal Technques n Robocs (16-831, F10) Lecure#10 (Thursday Sepember 23) General Weghed Majory, Onlne Learnng as Onlne Opmzaon Lecurer: Drew Bagnell Scrbe: Nahanel Barshay 1 1 Generalzed Weghed majory 1.1 Recap In general onlne learnng, we canno hope o make guarenees abou loss n any absolue sense. Insead, we use he noon of regre R o compare he loss of our algorhm (alg) agans ha of he bes exper (e ) from some famly of expers. We hus defne regre as: R = l (alg) l (e ) (1) The frs of such algorhms analyzed was Weghed Majory, whch works on 0/1 loss, and acheves a number of msakes m bounded by: m 2.4(m + log N) (2) where N s he number of expers and m s he number of msakes he bes exper n rerospec makes over he enre me horzon. Nex we looked a Randomzed Weghed Majory, whch s smlar o WM bu uses a weghed draw (raher hen a weghed average), o make a predcon a a gven mesep, and also nroduces a learnng rae β. 1.2 The Maser Learnng Algorhm: Generalzed Weghed Majory The RWM algorhm menoned above assumes a bnary loss funcon: l {0, 1}. We now generalze o any loss funcon wh oupus n [0, 1] (keepng RWM as a specal case). The algorhm s: 1. Se w 0 = 1. 2. A me, pck an exper e n proporon o s wegh w, and le ha exper decde. 3. Adjus he exper weghs: w +1 w e ɛl(e ) The bound on he regre of hs algorhm becomes E[R] ɛ l (e ) + 1 ln N (3) ɛ 1 Conen adaped from prevous scrbes: Anca Drăgan, Jared Goerner. 1

Ideally, we would lke hs algorhm o be wha s called No Regre, defned as he average regre over me convergng o 0: R T T 0 (4) To do so, we need o make sure ha ɛ(t ) decays n such a way ha he regre grows less han lnearly as a funcon of T. Snce l(e ) 1 by defnon, we have ha l (e ) O(T ). Therefore, applyng hs o (3), we ge: Seng ɛ = 1 T, we ge ha E[R] O(ɛT + 1 ln N) (5) ɛ E[R] O( T + T ln N) (6) Ths grow sublnearly n T, hus he rao n (4) ends owards 1 T, and we have shown a no reger algorhm. Of course, hs requres knowng T beforehand, urns ou one can also acheve no regre (va a harder proof) by varyng ɛ wh me:. ɛ = 1 Noe: Ths s he pon where Drew says ha hs algorhm can solve any problem n he unverse. For more nformaon on General Weghed Majory, refer o he orgnal paper by Arora e.al[1]. Ths algorhm has many suprsng applcaons: compuaonal geomery, Adaboos (where he expers are he daa pons), and even Yao s XOR lemma (see [2] for more deals) n complexy heory. 1.3 Applcaon: Self Superfzed Learnng Suppose we have a robo (drvng n 1-dmenson) ha wans o learn o denfy objecs a long range, gven ha can denfy objecs perfecly a shor range. Such a sensor model s que common, we have far less nformaon (and hus classfcaon s far more dffcul) when objecs are far away. I mgh be desrable o learn such a model n a pah plannng seng. Le us assume ha every observed obsacle s eher a Tree or a Gan Carro (and hus he dffculy of classfcaon s que undersandable). The formal onlne learnng s as follows: we ge feaures (from an objec a range) and decde a class (Tree/Carro) from he feaures avalable, hen we drve close o he objec and he world gves us he rue classfcaon. We wll use 0/1 Loss (0 f correc, 1 f ncorrec). Almos any famly of classfers can be used for our se of expers (decson rees, lnear classfer). However, we mus dscreze he parameers of such learners o keep he number of expers fne. 2

1.4 Example: Lnear Classfers The general form of a bnary lnear classfer s θ T f 0 1 θ T f < 0 1 Here, f s a feaure vecor and θ s he vecor of weghs. Noe ha f we asser θ = 1, hen hea essenally has only d 1 parameers, snce he las s redundan (and each θ [ 1, 1]). In order o have a fne famly of expers, we mgh dscreze each θ no b levels, and have an exper for each combnaons of θ. In hs case we have N = (b 1) d, and we can run GWM verbam. Pluggng he number of expers no (3), we ge ha: E[R] ɛ l (e ) + 1 ) (O(b ɛ ln d ) = ɛ l (θ ) + 1 O(d ln(b)) (7) e Therefore, he regre scales lnearly n he dmenson of he feaure space. I urns ou s generally rue ha we need O(n) samples for a lnear classfer (When nroducng he consan becomes abou 10n). Ths s grea heorecally, bu keep n mnd we sll need o rack weghs for each of O(b d ) expers! Thus he algorhm s only praccal for small d (upper bound a abou 4). 2 Onlne Learnng as Onlne Opmzaon Movaon: In fac, he grea waershed n opmzaon sn beween lnear and nonlnear, bu beween convexy and nonconvexy Mos mporanly, f we use: 1. Convex ses of expers 2. Convex loss funcons hen we may be able o solve our onlne learnng problem effcenly n he realm of onlne opmzaon. 2.1 Wha are Convex Ses and Funcons? A convex se s a se such ha any lnear combnaon of wo pons n he se s also n he se: f A C, B C, hen θa + (1 θ)b C (8) For example, he permer of a crcle s no convex, because a lnear combaon of wo pons s a chord, whch passes hrough he eneror (whch s no n he se). Examples of convex ses nclude: 3

Un ball n R n under l 2 norm. S = { x R} Box n R n. S = { x R} General un ball. S = { x R 1} Lnear subspace Half space. S = {w x b} Inersecon of half spaces, I.E. polyhedron. Cone, I.E. all posve lnear combnaons of a se of vecors. Convex funcons are he funcons for whch he epgraph (he area above he curve) s a convex se. Defned rgerously we have: f(a)θ + (1 θ)f(b) f(θa + (1 θ)b) (9) Ths drecly generalzes o Jensen s nequaly: θ = 1 = f (θ 1 x 1 + + θ n x n ) θ f(x ) 2.2 Subgradens Convex funcons have subgradens a every pon n her doman. A subgraden f(x) s a subgraden a x f s he normal o some plane ha ouches f(x) a x, and s below he res of f. In symbols: f(y) f(x) + f(x) T (y x) y (10) If a funcon s dfferenable a a pon, hen has a unque subgraden a ha pon. Furhermore, convex funcons are he max over all subgradens. Ths s an neresng propery ha wll be used laer n he class, because he maxmum of convex funcons s convex. Several key properes of convex funcons follow: Any local mnma s also a global mnmum (no necessarly he unque global mnmum). Ths s easy o see, he subgraden a a local mnma ses a lower bound on he funcons value. Local opmzaon never ges suck (we can always follow a subgraden down, unless already a a global mn) 4

2.3 The Onlne Convex Programmng Problem - Inro Onlne Convex Programmng was proposed by Marn Znkevch[3] n 2003. I s framed n he same conex of me seps, loss funcon, expers and weghed majory, preservng all he same quales form WM, whle beng compuaonally feasble. The dea s ha he expers are elemens of some convex se, and ha he loss a me s convex over he se of expers and hus has a subgraden. A every me sep, we need o predc an exper x and receve he loss l (x ) and l (x ). Example: for he case of he lnear classfer, where he expers are x = θ n some convex se, he loss funcon could be l (θ ) = (θ T f y ) 2 where y s he acual label for he daa pon f, n { 1, 1}. Ths loss s convex and s n fac a parabola n erms of θ. The nex lecure wll formalze he Onlne Convex Programmng Problem beer, and explan s applcaons o No Regre Porfolo creaon. References [1] Sanjeev Arora, Elad Hazan, and Sayen Kale, The mulplcave weghs updae mehod: A mea algorhm and s applcaons. Techncal repor, The Prnceon Unversy [2] O Goldrech, N Nsan, A Wgderson, On Yao s XOR-lemma [3] Marn Znkevch, Onlne Convex Programmng and Generalzed In nesmal Graden Ascen, ICML 2003 5