COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, PDF Free Download

COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and do as well as the best expert. An alternatve scenaro s there s no sngle good expert. But we could form a commttee of experts and they mght be much better. We ll formalze ths as follows: We have N experts. For t = 1,..., T we get x t {1, 1} N Note: x t : a set of predctons N: dmenson th component: predcton of expert In each round, learner predcts ŷ t {1, 1} In each round, we observe the outcome y t {1, 1} The above s the same; what we changed s the assumpton of data. We assume that there s a perfect commttee,.e. a weghted sum of experts that are always rght. Formally, ths means that u R N, N t : y t = sgn( u x t, ) = sgn(x t u) =1 y t (u x t ) > 0 Geometrcally, the perfect commttee means that there s a lnear threshold that separates the 1 ponts and 1 ponts, generated by the approprate weghted sum of the experts. How to do updates We are focusng on w t, the predcton of u. It s sort of a guess of the correct weghtng of the experts. We wll update the weghtng on each round. Today we are lookng at two algorthms. For each algorthm, we only need to focus on (1)ntalze ()update

Fgure 1: Perceptron geometrc ntuton: tpng the hyperplane.1 Perceptron The frst way to update weghts wll gve us an algorthm called Perceptron. The update rules are as follows: Intalze: w 1 = 0 Update: If mstake( ŷ t y t y t (w t x t ) 0), w t+1 = w t + y t x t else, w t+1 = w t Not adjustng the weghts when there are no mstakes makes the algorthm conservatve; the algorthm gnores the correctly classfyng samples. The ntuton s that n case of a wrong answer we shft the weghts on all the experts n the drecton of the correct answer. Fgure 1 gves a geometrcal ntuton of the Perceptron algorthm. Here y t = +1, when (x t, y t ) s classfed ncorrectly, then we add x t y t to w t to such a drecton that s more lkely to correctly classfy (x t, y t ) next tme; we are shftng the hyperplane defned by w t n such a drecton that we are more lkely to correctly classfy x t. Now let s state a theorem to formally analyze the performance of the Perceptron algorthm. However, frst we wll make a a few assumptons: Mstakes happens n every round. Ths s because no algorthmc change happens durng other rounds. So: T = # of rounds = # of mstakes. We normalze the vector of predctons x t, so that x t 1.

We normalze the vector of weghts for the perfect commttee, so that u = 1. (Ths s fne because the value of the sgn functon wll not be affected by ths normalzaton.) We make the assumpton that the ponts are lnearly separable wth margn at least δ: δ, u IR N, t : y t (u x t ) δ > 0. Note that ths assumpton s wth loss of generalty. Theorem.1 Under the assumptons above, T = # mstakes, we have T 1 δ Proof : In order to prove ths, we wll fnd some quantty that depends on the state of the algorthm at tme t, upper bound and lower bound t, and derve a bound from there. The quantty here s Φ t, whch s cosne of the angle Θ between w t and u. More formally, Φ t = w t u w t = cos Θ 1 Now for the lower bound, we wll prove that Φ T +1 T δ We wll do ths n two parts by lower boundng the numerator of Φ t and by upper boundng the denomnator. step 1: w T +1 u T δ: w t+1 u = (w t + y t x t ) u = w t u + y t (u x t ) w t u + δ The nequalty s by the 4th assumpton above. Intally we have set w 1 u = 0, thus the above bound mples that w T +1 u T δ. step : w T +1 T : w t+1 = w t+1 w t+1 = (w t + y t x t ) (w t + y t x t ) = w t + y t (x t w t ) + x t Snce we have made the assumpton that we get a mstake at each round, y t (x t w t ) 0, and from the normalzaton assumpton, x t 1, so that we get w t+1 w t + 1. Intally we have set w 1 = 0, so we get w T +1 T Now we put step 1 and step together, 1 Φ T +1 T δ T,.e. T 1. δ Let H be the hypothess space and M perceptron (H) be the number of mstakes made by the Perceptron algorthm. As a smple consequence of the above, snce the VC dmenson of the hypothess space s upper bounded by the number of mstakes the algorthm makes, we get the VC dmenson of threshold functons wth margn at least δ s at most 1 δ : V C-dm(H) opt(h) M perceptron (H) 1 δ Now consder a scenaro where the target u conssts of 0s and 1s, and the number of 1s n the vector s k. u = 1 (0 1 0 0 1...) k 3

Note that here 1 k s for normalzaton. Thnk of k as beng small compared to N, the number of experts,.e. t could be a very sparse vector. Ths s also one example of the problems we earler examned the k experts are the perfect commttee. We have, x t = 1 N (+1, 1, 1, +1,...) y t = sgn(u x t ) y t (u x t ) 1 kn 1 Note that here 1 N s for normalzaton. So usng kn as δ, by Theorem.1, the Perceptron algorthm would make at most kn mstakes. However ths s not good consder nterpretng the experts as features, and we have mllons of rrelevant features, and the commttee s the mportant (maybe a dozen) features. We get a lnear dependences on N, whch s usually large. Motvated by ths example, we present another update algorthm, called the Wnnow algorthm, whch wll get a better bound.. Wnnow Algorthm Intalze:, w 1, = 1 N we start wth a unform dstrbuton over all experts. Update: If we make a mstake, : w t+1, = w t, e ηytxt Z t Here η s a parameter we wll defne later, and Z t s a normalzaton factor. Else, w t+1 = w t Ths update rule s lke exponental punshment for the experts that are wrong. If we gnore the normalzaton factors, the above update s equvalent to w t+1, = w t, e η, f predcts correctly, and w t+1, = w t, e η otherwse. Ignorng the normalzaton factor, we could see t as w t+1, = w t,, f predcts correctly, and w t+1, = w t, e η otherwse. Ths s the same as the weghted majorty vote. Before statng the formal theorem for the Wnnow algorthm, we make a few assumptons wthout loss of generalty: We make mstake at every round. t : x t 1. δ, u : t : y t (u x t ) δ > 0. u 1 = 1 and : u 0. Notce here we used L 1 and L norm here nstead of the L norm that we used n Perceptron algorthm. 4

Theorem. Under the assumptons above, We have the followng upper bound on the number of mstakes: ln N T ηδ + ln( ) e η +e η If we choose an optmal η to mnmze the bound, we get when η = 1 ln( 1+δ 1 δ ), T ln N δ Proof The approach s smlar to the prevous one. We use a quantty Φ t, whch we both upper and lower-bound. The quantty we use here s Φ t = RE(u w t ). Immedately we have, Φ t 0 for all t. Φ t+1 Φ t = = u u ln( ) w t+1, u ln( w t, w t+1, ) u ln( u w t, ) = Z t u ln( e ηytx ) t, (1) = u ln Z t u ln e ηytx t, = ln Z t ηy t (u x t ) ln Z t ηδ The last nequalty follows from the margn property we assumed. Now let s approxmate Z t. We know that Z s the normalzaton factor and can be computed as: Z = w e ηyx () Note that here we are droppng the subscrpt t for smplcty; Z and w are same as Z t and w t,. We wll bound the exponental term by a lnear functon, as llustrated n fgure : e ηx ( 1 + x )e η + ( 1 x )e η, for 1 x 1. Usng ths bound, we have: Z = w e ηyx = eη + e η = eη + e η eη + e η w ( 1 + yx )e η + w + eη e η y + eη + e η y(w x) w ( 1 yx )e η w x (3) 5

Fgure : Usng lnear functon to bound exponental functon The last nequalty comes from the assumpton that the expert makes a wrong predcton every tme, so the second term s less than 0. So we have, Φ t+1 Φ t ln Z t ηδ ln( eη + e η ) ηδ = C (4) Note that here ln( eη +e η ) ηδ s a constant and let s make t equals C. So for each round Φ t s decreasng by at least C = ln( ) + ηδ. e η +e η In the next class, we wll fnsh the proof of Theorem. and we wll study a modfed verson of Wnnow Algorthm called Balanced Wnnow Algorthm that gets rd of the assumpton that : u 0. 6

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014