E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng settng that was dscussed brefly n the frst lecture. Unlke the batch settng we have studed so far, where one s gven a sample or batch of tranng data and the goal s to learn from ths data a model that can make accurate predctons n the future, n the onlne settng, learnng takes place n a sequence of trals: on each tral, the learner must make a predcton or take some acton, each of whch can potentally result n some loss, and the goal s to update the predcton/decson model at the end of each tral so as to mnmze the total loss ncurred over a sequence of such trals. Onlne learnng s relevant for a varety of problems, ncludng predcton problems e.g. forecastng the weather the next day and decson/allocaton problems e.g. nvestng n dfferent stocks or mutual funds. We wll start by consderng onlne supervsed learnng problems, where on each tral, the learner receves an nstance and must predct ts label, followng whch the true label s revealed and a correspondng loss ncurred; as noted above, the goal of the learner s to mnmze the total loss over a sequence of trals. We wll focus n ths lecture on onlne bnary classfcaton problems, and n the next lecture on onlne regresson problems. We wll then dscuss onlne learnng from experts, a framework that can be useful for both onlne supervsed learnng problems and onlne decson/allocaton problems; we wll analyze ths framework n some detal n a couple of lectures, and then wll conclude n the last lecture wth a bref dscusson of how onlne learnng algorthms and ther analyses can be transported back nto the batch settng. The basc onlne bnary classfcaton settng can be descrbed as follows: Onlne Bnary Classfcaton Receve nstance x t X Predct ŷ t {±1} Incur loss ly t, ŷ t The goal of a learnng algorthm n ths settng s to mnmze the total loss ncurred. Specfcally, let S x 1, y 1,..., x T, y T. Then the cumulatve loss of an algorthm A on the tral sequence S s gven by L l S[A] T ly t, ŷ t. 1 t1 The goal s to desgn algorthms wth small cumulatve loss on any tral sequence or any tral sequence satsfyng certan propertes; the analyss here s therefore worst-case, rather than probablstc as n the batch settng. For bnary classfcaton wth zero-one loss l 0-1, the cumulatve loss of an algorthm over a tral sequence S corresponds to the number of predcton mstakes made by the algorthm on ths sequence; bounds on the cumulatve zero-one loss S [A] are therefore termed mstake bounds. In the followng, we wll study two classcal algorthms for onlne bnary classfcaton, namely the perceptron and wnnow algorthms, and dscuss the mstake bounds that can be derved for them. 1
Onlne Classfcaton: Perceptron and Wnnow Perceptron In ts basc form, the perceptron algorthm apples to Eucldean nstance spaces X R n, and mantans a lnear classfer represented by a weght vector n such a space: Algorthm Perceptron Intal weght vector w 1 0 R n Receve nstance x t X R n Predct ŷ t sgnw t x t Incur loss l 0-1 y t, ŷ t w t1 w t y t x t w t1 w t Notce that the algorthm makes an update to ts model weght vector only when there s a mstake n ts predcton; onlne algorthms satsfyng ths property are sad to be conservatve. To get an ntutve feel for the algorthm, observe that f the true label y t on tral t s 1 and the algorthm predcts ŷ t 1, then t means w t x t < 0; n order to mprove the predcton on ths example, the algorthm must ncrease the value of ths dot product. Indeed, we have w t1 x t w t x t x t w t x t. Smlarly, t can be verfed that when y t 1 and the algorthm predcts ŷ t 1, the update has the effect of decreasng the value of the dot product. Thus the updates make sense ntutvely. More formally, one can prove the followng classcal mstake bound for the perceptron algorthm n the lnearly separable case: Theorem.1 Perceptron Convergence Theorem; Block, 196; Novkoff, 196. Let S x 1, y 1,..., x T, y T R n {±1} T. Let R max{ x t t [T ]} and let > 0. Then for any u R n such that y t u x t t [T ], S [Perceptron] R u. Proof. Denote S [Perceptron] k. Consder measurng the progress towards u or closeness to u on each tral n terms of w t u. For each tral t on whch there s a mstake, we have w t1 u w t u y t x t u. For all other trals t, we have w t1 u w t u 0. Therefore summng over t 1,..., T gves Notng that w 1 0 and usng Cauchy-Schwartz, we have w T 1 u w 1 u k. 3 k wt 1 u Now for each tral t on whch there s a mstake, wt 1 u. 4 w t1 w t y t w t x t x t 5 w t R snce y t w t x t 0 for a mstake tral. 6 For all other trals t, w t1 w t 0. Therefore summng over t 1,..., T and notng agan w 1 0, we get w T 1 kr. 7 Substtutng n Eq. 4 gves Squarng both sdes yelds the result. k kr u. 8
Onlne Classfcaton: Perceptron and Wnnow 3 One can also show the followng weaker mstake bound n the general non-separable case: Theorem. Freund and Schapre; 1999. Let S x 1, y 1,..., x T, y T R n {±1} T. Let R max{ x t t [T ]} and let > 0. Then for any u R n, S [Perceptron] R T u t1 yt u x t. Detals of the proof can be found n [1]. We conclude our dscusson of the perceptron algorthm by observng that the algorthm can be re-wrtten so as to use only dot products between nstances seen by the algorthm, whch facltates a natural extenson to a kernel-based varant for arbtrary nstance spaces X : Algorthm Kernel Perceptron Kernel functon K : X X R Receve nstance x t X t 1 Predct ŷ t sgn r1 α rkx r, x t Incur loss l 0-1 y t, ŷ t α t y t α t 0 A mstake bound smlar to that for the lnear perceptron algorthm can be shown n ths case too: Theorem.3 Kernel Perceptron Convergence Theorem. Let S x 1, y 1,..., x T, y T X {±1} T and let K : X X R be a kernel functon on X. Let R max{ Kx t, x t t [T ]} and let > 0. Then for any u X such that y t Ku, x t t [T ], We leave the proof detals as an exercse. S [PerceptronK] R Ku, u. 3 Wnnow The wnnow algorthm also mantans a lnear classfer n a Eucldean nstance space; n ths case, however, the updates to the weght vector are multplcatve rather than addtve: Algorthm Wnnow Learnng rate parameter η > 0 Intal weght vector w 1 1 n,..., 1 n Rn Receve nstance x t X R n Predct ŷ t sgnw t x t Incur loss l 0-1 y t, ŷ t [n]: w t1 wt expηyt x t w t1 w t Z t where Z t n j1 wt j expηyt x t j
4 Onlne Classfcaton: Perceptron and Wnnow Here too, one can observe that when a mstake s made on some tral t, the effect of the update s to move the dot product w t1 x t n the rght drecton compared to w t x t. Formally, we have the followng mstake bound for tral sequences that are lnearly separable by a non-negatve weght vector: Theorem 3.1. Let S x 1, y 1,..., x T, y T R n {±1} T. Let R max{ x t t [T ]} and let > 0. Then for any u R n such that u 0 [n] and y t u x t t [T ], S [Wnnowη] η u 1 ln u 1 ln n eηr e ηr. Moreover, f R, u 1, and are known, then one can select η to yeld R S [Wnnowη ] u 1 ln n. Proof. Denote S [Wnnowη] k, and let p u/ u 1 so that p n, where n s the probablty smplex n R n. Consder agan measurng the progress towards u or p on each tral; n ths case, we wll measure the dstance of w t from p n terms of the KL-dvergence, KLp w t n p ln p w. For each t tral t on whch there s a mstake, we have KLp w t KLp w t1 n w t1 p ln w t n ηy t x t p ln Z t n ηy t p x t 9 10 n p ln Z t 11 ηy t p x t ln Z t 1 η ln Z t. u 1 13 Now, Z t n wt x t eηyt. Notng that y t x t [ R, R ] for all, t, we can bound Z t as follows usng convexty of the mappng t e ηt : n 1 y Z t w t t x t /R 1 y t x t e ηr /R e ηr 14 e ηr n w t e ηr n y t w x t t 15 e ηr e e ηr y t w t x t 16 e ηr snce e ηr e ηr > 0, and y t w t x t 0 for mstake trals t. 17 Therefore, on each mstake tral t, we have On all other trals t, KLp w t KLp w t1 η e e ηr ln u 1. 18 KLp w t KLp w t1 0. 19 Therefore summng over t 1,..., T, we have KLp w 1 KLp w T 1 η e e ηr k ln. u 1 0
Onlne Classfcaton: Perceptron and Wnnow 5 Now, Ths yelds the desred bound k KLp w 1 ln n. 1 KLp w T 1 0. η u 1 ln u 1 ln n eηr e ηr. 3 Now f R, u 1, and are known, then one can mnmze the rght hand sde above w.r.t. η; ths yelds η 1 R u 1 ln. 4 R R u 1 Wth ths choce of η, one gets k ln n, 5 g R u 1 where gɛ 1ɛ 1 ɛ ln1 ɛ ln1 ɛ note that /R u 1 1, snce y t u t x t u 1 R. One can show that gɛ ɛ /, whch when appled to the above yelds the desred result. 4 Comparson of the Two Algorthms To understand the relatve strengths of the two algorthms, consder the followng two examples, where k n: Example 1 Sparse target vector, dense nstances. Let u {0, 1} n wth at most k non-zero components, and let x t {±1} n t. Thus u 1 k, u k, R n, and R 1. Example Dense target vector, sparse nstances. Let u 1 R n, and let x t {0, 1} n t such that each x t has at most k non-zero components. Thus u 1 n, u n, R k, and R 1. In Example 1, the mstake bound we get for perceptron s nk, whle that for wnnow s k ln n. On the other hand, n Example, the mstake bound we get for perceptron s kn, whereas that for wnnow s n ln n. Thus, for a sparse target vector that depends on only a small number of relevant features, wnnow gves a better mstake bound; for dense target vectors and sparse nstances, perceptron has a better bound. 5 Next Lecture In the next lecture we wll see both addtve and multplcatve update algorthms for onlne regresson, and wll derve bounds on ther regret, whch measures the cumulatve loss of the algorthm wth respect to the best possble loss wthn some class of predctors. Acknowledgments. The proof of the mstake bound for wnnow s based on a proof descrbed by Sham Kakade and Ambuj Tewar n ther lecture notes for a course taught at TTI Chcago n Sprng 008. References [1] Yoav Freund and Robert E. Schapre. Large margn classfcaton usng the perceptron algorthm. Machne Learnng, 373:77 96, 1999.