Online Linear Regression using Burg Entropy

Size: px

Start display at page:

Download "Online Linear Regression using Burg Entropy"

Reynold Donald Boyd
5 years ago
Views:

1 Onlne Lnear Regresson usng Burg Entropy Prateek Jan, Bran Kuls, and Inderjt Dhllon Techncal Report TR Unversty of Texas at Austn Austn, TX 7872 February 4, 2007 Abstract We consder the problem of onlne predcton wth a lnear model. In contrast to exstng work n onlne regresson, whch regularzes based on squared loss or KL-dvergence, we regularze usng dvergences arsng from the Burg entropy. We demonstrate regret bounds for our resultng onlne gradent-descent algorthm; to our knowledge, these are the frst onlne bounds nvolvng Burg entropy. We extend ths analyss to the matrx case, where our algorthm employs LogDet-based regularzaton, and dscuss an applcaton to onlne metrc learnng. We demonstrate emprcally that usng Burg entropy for regularzaton s useful n the presence of nosy data. Introducton Recently, onlne learnng has receved sgnfcant attenton, and several recent results have demonstrated strong provable onlne bounds. In the onlne lnear regresson problem, consdered n ths paper, the algorthm receves an nstance x t x t R m +) at tme t and predcts ŷ t = w t x t ; w t s the current model used by the algorthm. We denote w t as the -th component of the model at the t-th step. The error ncurred at step t s l t w t ) = y t ŷ t ) 2, where y t s the true/target dstance, and the goal s to mnmze the total loss over all tme steps t l tw t ). If there s no correlaton between the nput x t and the output y t, then the cumulatve loss of the algorthm may be unbounded. Hence, onlne algorthms are tradtonally compared to the performance of the best possble offlne soluton, where all the nput and output nstances are provded beforehand. The goal s to prove a relatve loss bound on how the onlne algorthm compares to the optmal offlne algorthm. Gven a T-tral sequence S = {x, y ), x 2, y 2 ),..., x T, y T )}, the optmal offlne soluton s gven by u =argmn w T l t w) t= s.t. w > 0 A typcal approach to solvng the onlne lnear regresson problem uses the standard gradent descent method wth regularzaton for mnmzng the loss at each step. Specfcally, the algorthm tres to mnmze the followng functon at each step: Uw) = Regularzaton Term {}}{ Loss Term {}}{ Dw, w t ) +η t Ly t, w x t ), ) where η t s learnng rate at t-th step and dw, w t ) s some dstance measure between the new weght vector w and the current weght vector w t. In the above objectve functon, the regularzaton term favors weght vectors whch are close to the current model w t, representng a tendency for conservatveness. On the other hand, the loss term s mnmzed

2 when w s updated to exactly satfy y t = ŷ t ; hence the loss term has a tendency to satsfy target outputs for recent examples. The tradeoff between these two quanttes s handled by the learnng rate η t, an mportant parameter n onlne learnng problems. The performance of any method followng ths approach wll depend heavly on the choce of the regularzaton term and the loss functon. Kvnen et. al[6, 8] show that the performance of any regularzaton term n turn depends heavly on the problem doman. For example, for sparse nputs, relatve entropy as a regularzaton term often works better than squared Eucldean dstance. For dense nputs, relatve entropy s performance s generally very poor compared to that of squared Eucldean dstance. In ths paper, we study the performance of onlne lnear regresson when the Burg entropy s consdered as the regularzaton term: D Burg w t+, w t ) = N w t+ w )) log t+. = The relatve Burg entropy or Itakura-Sato dstance) s a specal case of a Bregman dvergence, and such measures have been shown to be useful dstance measures for varous machne learnng tasks [] relatve entropy and squared Eucldean dstance are other specal cases of Bregman dvergences). The Burg entropy n partcular has recently been shown to be useful for matrx approxmatons and kernel learnng [7]. Our study of Burg entropy s also motvated by the fact that onlne nformaton-theoretc metrc learnng reduces to onlne lnear regresson problem wth the LogDet dvergence the natural generalzaton of the Burg entropy as regularzaton term [3]. The results n ths paper complement those of our work on nformaton-theoretc metrc learnng. We present an algorthm for usng Burg relatve entropy for regularzaton and prove an assocated regret bound. We also generalze the above approach to the matrx case when the nput matrces are rank-one; ths s precsely the case needed for nformaton-theoretc metrc learnng. In secton 5, we emprcally show how the Burg entropy performs compared to other standard regularzaton terms. 2 The Vector Case In ths study, we select the Bregman dvergence correspondng to the Burg entropy as our regularzaton term for ): N w D Burg w t+, w t ) = t+ w )) log t+. = We derve Algorthm from standard gradent descent; n partcular, f we dfferentate Uw), we obtan the update gven as 2) n Algorthm. Algorthm Burg Gradent BG) Algorthm : Intalze: Set η t = 4mR 2 and w 0 = N,. 2: Predcton: Upon recevng the t-th nstance x t, gve the predcton ŷ t = w t x t. 3: Update: Upon recevng the t-th outcome y t, update the weghts accordng to the rule, w t w t w t w t w t+ = w t) + 2η t ŷ t y t )x t), 2) where n t f ŷ t y t > 0 η t = mn n t, 2ŷ t y t)x ) ) f ŷ t wt t y t < 0. 3) Note that the dvergence D Burg s only defned over postve-valued weght vectors. To mantan postvty, we use an adaptve learnng rate η t, whch guarantees that the resultng vector w s postve. Assumng η t > 0, then η t > 0 snce η t s the mnmum of two postve quanttes. Also, f ŷ t y t > 0 then clearly 2

3 ) w t+ > 0 assumng w t > 0). On the other hand, f ŷ t y t < 0, then snce η t 2ŷ t y t)x, we t wt can conclude that w t+ > 0. Hence, we mantan postvty. 2. Regret Bounds Now we demonstrate regret bounds for the proposed algorthm. Let the total loss of Algorthm be Loss BG = t l tw t ), and let Loss u = t l tu) be the loss of the best offlne soluton. Frst we bound the loss of the algorthm at a sngle tral n terms of the loss of a comparson vector u at that tral and the progress of the algorthm towards u. Lemma. a t ŷ t y t ) 2 b t u x t y t ) 2 D Burg u, w t ) D Burg u, w t+ ), where a t and b t are postve constants and u s the optmal offlne soluton. Proof. N D Burg u, w t ) D Burg u, w t+ ) = u w ) ) = t wt+ + ln w t wt+ N N = 2η t ŷ t y t ) u x t + ln + 2η t ŷ t y t )wtx t) Consder the followng two cases: = = 2η t ŷ t y t )u x t + = N ln + 2η t ŷ t y t )wt x t ).. ŷ t y t 0: Snce wt > 0 and x t > 0, t follows that 2η tŷ t y t )wt x t 0. ) 2. ŷ t y t < 0: Snce η t < 2ŷ t y t)x, t follows that 2η t wt t ŷ t y t )wtx t >. Now, w wt t > 0 and so 2η t ŷ t y t )wtx t >. Thus, n ether case, 2η t ŷ t y t )wt x t >. We can therefore apply the nequalty whch holds for all z >. Hence ln + z) z z 2 /2, D Burg u, w t ) D Burg u, w t+ ) 2η t ŷ t y t )u x t + Assume x t R and N = w t m; the above nequalty smplfes to = N 2ηt ŷ t y t )wt x t 2η2 t ŷ t y t ) 2 wt x t )2). D Burg u, w t ) D Burg u, w t+ ) 2η t ŷ t y t )u x t + 2η t ŷ t y t )ŷ t 2η 2 t ŷ t y t ) 2 mr 2. Let r = u x t. Provng the lemma amounts to showng that D Burg u, w t ) D Burg u, w t+ ) 2η t ŷ t y t )r + 2η t ŷ t y t )ŷ t 2η 2 t ŷ t y t ) 2 mr 2 4) for some postve constants a and b. Consder the functon = a t ŷ t y t ) 2 b t r y t ) 2, 5) Fr) = a t ŷ t y t ) 2 b t r y t ) 2 + 2η t ŷ t y t )r 2η t ŷ t y t )ŷ t + 2η 2 t ŷ t y t ) 2 mr 2. Equaton 5 s equvalent to Fr) 0, r. It can be easly seen that Fr) s maxmzed when r = y t + ηt b t ŷ t y t ). Substtutng for r n Fr) and smplfyng, we get: a t ŷ t y t ) 2 + η2 t b t ŷ t y t ) 2 2η t ŷ t y t ) 2 + 2η 2 t ŷ t y t ) 2 mr 2. 3

4 Hence, we need to prove that 0 a t ŷ t y t ) 2 + η2 t b t ŷ t y t ) 2 2η t ŷ t y t ) 2 + 2η 2 t ŷ t y t ) 2 mr 2. Snce ŷ t y t ) 2 0, ths amounts to showng 0 a t b t + η 2 t 2η tb t + 2η 2 t mr2 b t = Qη t ). Now, Qη t ) s mnmzed for η t = +2mR 2 b t, whch mples b t = that b t > 0. Qη t ) < 0 ff a t holds. b t η t 2η tmr 2. Snce, η t η 0 = b t +2mR 2 b t = η t. Hence, for 0 a t η t and b t = 4mR, we know 2 η t 2η tmr, Lemma 2 Usng Lemma, we can sum the loss over all tme steps to get an overall bound on the loss of the Burg entropy-based onlne learnng algorthm. Theorem. L BG 2η T mr 2 L u + η T D Burg u,w 0 ), where L BG s the loss ncurred by the Burg Gradent descent algorthm and L u s the loss ncurred by the optmal offlne algorthm. Proof. By Lemma, for each tral t, η t ŷ t y t ) 2 Hence, addng over all trals we get T η t ŷ t y t ) 2 t= η t 2η t mr 2 u x t y t ) 2 D Burg u, w t ) D Burg u, w t+ ). T t= Thus, snce η t η T for all t, we have: η t 2η t mr 2 u x t y t ) 2 + D Burg u, w 0 ) D Burg u, w T ) L BG 2η T mr 2 L u + D Burg u,w 0 ). η T Note that smlar bounds can be obtaned usng the framework presented n Gordon[5], but the resultng bound s substantally weaker than our bound and the proof s much more nvolved. 3 The Matrx Case In the matrx case, at the t-th step, the algorthm receves an nstance X t X t R m m + ) and predcts ŷ t = trw t X t ). W t s the current model used by the algorthm. The error ncurred at the t-th step s l t W t ) = y t ŷ t ) 2. Smlarly to the vector case, for a gven T-tral sequence S = {X, y ), X 2, y 2 ),..., X T, y T )}, the optmal offlne soluton s gven by U =argmn W T l t W) t= s.t. W 0. As a result, the goal here s to compare the loss of the onlne algorthm wth the optmal offlne soluton, just as n the vector case. For the matrx case, we consder the LogDet dvergence as the regularzaton n the objectve. The LogDet dvergence s the generalzaton of D Burg to matrces, gven by D ld W, W t ) = trwwt ) log detwwt ). 4

5 3. Rank-One Input We consder the specal case where each nput X t s a rank- matrx. Ths assumpton holds n the case of nformaton-theoretc metrc learnng dscussed later), where the nputs are rank-one constrants. We further assume that nput s symmetrc; thus, the nput matrx X t can be wrtten as X t = z t z T t. Usng LogDet as the regularzaton term, we derve Algorthm 2 analogously as n the vector case. We set η t carefully at each step so that the algorthm ensures postve defnteness of the weght matrx W t. Lemma 2 proves ths nvarant. Algorthm 2 Matrx Burg Gradent BG) Algorthm : Intalze: Set η t = 4R and W 2 0 = N I,. 2: Predcton: Upon recevng the t-th nstance X t, gve the predcton ŷ t = trw t X t ). 3: Update: Upon recevng the t-th outcome y t, update the weghts accordng to the rule W t+ = W t + 2η t ŷ t y t )X t ), 6) where n t f ŷ t y t > 0 η t = ) ) mn n t, 2y t ŷ t) f ŷ t y t < 0. tri+w t I) )W tx t) 7) Lemma 2. Algorthm 2 preserves the nvarant 0 W t I. Proof. We prove ths by nducton over t. The base case holds trvally for N >. By the nducton hypothess, 0 W t I. Assume X t = zz T as X t s a symmetrc postve defnte rank-one matrx. Thus, usng the update rule and the Sherman Morrson-Woodbury formula[4]: We break up the proof nto the followng two cases: W t+ =W t 2η tŷ t y t )W t zz T W t + 2η t ŷ t y t )z T W t z. 8) Case. ŷ t y t 0: Consder the update W t+ = Wt ). + 2η t ŷ t y t )X t Snce ηt > 0 and ŷ t y t > 0, then 2η t ŷ t y t )X t 0. Now, Wt+ s sum of two symmetrc postve defnte matrces, and so W t+ s postve defnte. Usng Equaton 8, Snce, W t s symmetrc, I W t+ = I W t + 2η tŷ t y t )W t zz T W t + 2η t ŷ t y t )z T W t z. W t zz T W t = W t zz T W T t = vv T 0, where v = Wz. Also, η t > 0 and ŷ t y t > 0, and thus 2ηtŷt yt)wtzzt W t +2η tŷ t y t)z T W tz 0. Furthermore, W t I, and so I W t 0. Thus, I W t+ s sum of two symmetrc postve defnte matrces, mplyng I W t+ 0. We conclude that W t+ I. Case 2. ŷ t y t < 0: By Equaton 8, W t+ = W t 2η tŷ t y t )W t zz T W t + 2η t ŷ t y t )z T W t z = W t + 2η ty t ŷ t )W t zz T W t 2η t y t ŷ t )z T W t z. 5

6 The update rule guarantees that η t < 2y t ŷ t)trw tx t). Snce trw t X t ) = trw t zz T ) = trz T W t z) = z T W t z, 9) we know that 2η t y t ŷ t )z T W t z > 0, mplyng that W t+ s sum of two postve defnte matrces hence W t+ 0. Wrte W t as ts egendecomposton W t = QΣQ T. Snce W t s postve defnte matrx, Σ s a dagonal matrx wth postve dagonal. Usng ths decomposton, the update for W t+ s expressed as W t+ =QΣQ T 2η tŷ t y)qσq T zz T QΣQ T + 2η t ŷ t y)z T QΣQ T z. 0) Snce W t I, then I Σ 0. Usng Equaton 0, we have that I W t+ = I QΣQ T 2η ty t ŷ t )QΣQ T zz T QΣQ T + 2η t ŷ t y t )z T QΣQ T z = Q I Σ 2η ty t ŷ t )ΣQ T zz T ) QΣ + 2η t ŷ t y t )z T QΣQ T Q T z = QI Σ) 2 I 2η ty t ŷ t )I Σ) 2ΣQ T zz T QΣI Σ) 2 + 2η t ŷ t y t )z T QΣQ T z ) I Σ) 2 Q T. Let G = QI Σ) 2. If the mddle term s symmetrc postve defnte, then t can be wrtten as EE T. Ths would lead to I W t+ = GEE T G T = GEGE) T 0, snce for any A, AA T 0. Thus to prove that W t+ I, we need to prove that ) I 2η ty t ŷ t )I Σ) 2ΣQ T zz T QΣI Σ) 2 + 2η t ŷ t y t )z T QΣQ T z 0. ) To do ths, we calculate the egenvalues of ths matrx, and show that they are greater than zero. Let v = I Σ) 2ΣQ T z. Then I Σ) 2 ΣQ T zz T QΣI Σ) 2 = vv T. Snce, vv T s a rank-one matrx, t has exactly one postve egenvalue, and that egenvalue s equal to ts trace tri Σ) 2 ΣQ T zz T QΣI Σ) 2 ). Expandng the trace yelds tri Σ) 2 ΣQ T zz T QΣI Σ) 2 ) = trqσi Σ) ΣQ T zz T ) = trqσ I) ΣQ T zz T ) = trw t I) W t X t ). To prove Equaton, we need to show that the egenvalues are postve; equvalently, we can show that By the update rule, η t < η t < 2y t ŷ t) 2η ty t ŷ t )trwt I) W t X t ) > 0. 2) 2η t y ŷ t )trw t X t ) 2y t ŷ t)trw tx t) tri+w t I) )W tx t) 3.2 Regret Bound, and hence the denomnator of the second term s postve. Also, ), so Equaton 2 holds. Thus W t+ I. Smlar to the vector case, we bound loss at each step ncurred by algorthm n terms of loss ncurred by the optmal offlne soluton. Assume z 2 R. Lemma 3. a t ŷ t y t ) 2 b t trux t ) y t ) 2 D ld U, W t ) D ld U, W t+ ), where a t and b t are postve constants and U s the optmal offlne soluton. 6

7 Proof. ) detwt ) D ld U, W t ) D ld U, W t+ ) = log + trwt Wt+ detw t+ ) )U). Snce logdeta)) = trloga)) and trab) = trba), we have that D ld U, W t ) D ld U, W t+ ) = trlogw t ) logw t+ )) + trw t W t+ )U) = trlogw t W t+ )) 2η tŷ t y t )trx t U) = trlogi + 2η t ŷ t y t )W t X t )) 2η t ŷ t y t )trux t ). Usng the Taylor seres expanson and gnorng lower order terms, D ld U, W t ) D ld U, W t+ ) 2η t ŷ t y t )trw t X t ) 2η 2 t ŷ t y t ) 2 trw t X t ) 2 ) 2η t ŷ t y t )trux t ). We wrte X as zz T, so trw t X t ) 2 ) = trw t zz T W t zz T ) = z T W t z)trwzz T ) = trw t X t ) 2. Hence, D ld U, W t ) D ld U, W t+ ) 2η t ŷ t y t )trw t X t ) 2η 2 t ŷ t y t ) 2 trw t X t ) 2 ) 2η t ŷ t y t )trux t ) 2η t ŷ t y t )trw t X t ) 2η 2 t ŷ t y t ) 2 trw t X t ) 2 2η t ŷ t y t )trux t ). Snce W t I by Lemma 2, trw t X t ) = z T W t z z T z R 2. As a result, D ld U, W t ) D ld U, W t+ ) 2η t ŷ t y t )trw t X t ) 2η 2 t ŷ t y t ) 2 R 2 2η t ŷ t y t )trux t ) = 2η t ŷ t y t )ŷ t 2η 2 t ŷ t y t ) 2 R 2 2η t ŷ t y t )r, where r = tru t X t ). Thus, to prove Lemma 3, we need to prove the same nequalty as n Equaton 5, wth mr 2 replaced by R 2 η. Hence, for 0 a t η t and b t = t 2η tr, Lemma 3 holds. 2 As n the vector case, we can obtan the fnal bound by summng the loss at each step. Theorem 2. L BG 2η T R 2 L U + η T H ld U), where L MBG s the loss ncurred by the Matrx Burg Gradent descent algorthm, L U s loss ncurred by optmal offlne algorthm and H ld U) = logdetu). Proof. The proof s very smlar to that of the vector case. 4 Applcaton: Onlne Metrc Learnng Selectng an approprate dstance measure or metrc) s fundamental to many learnng algorthms such as k-means, nearest neghbor searches, and others. However, choosng such a measure s hghly problemspecfc and ultmately dctates the success or falure of the learnng algorthm. To ths end, there have been several recent approaches that attempt to learn dstance functons. These methods work by explotng dstance nformaton that s ntrnscally avalable n many learnng settngs. For example, n the problem of sem-supervsed clusterng, ponts are constraned to be ether smlar.e. the dstance between them should be relatvely small) or dssmlar the dstance should be larger). In nformaton retreval settngs, constrants between pars of dstances can be gathered from clck-through feedback. In fully supervsed settngs, constrants can be nferred so that ponts n the same class have smaller dstances to each other than to ponts n dfferent classes. Gven a set of n ponts {x,..., x n } n R d, we seek a postve defnte matrx A whch parameterzes the Mahalanobs dstance : d A x, x j ) = x x j ) T Ax x j ). 3) The term Mahalanobs dstance frequently refers to the squared Mahalanobs dstance, even though t s p d A x, y) that s a metrc. Ths dfference does not affect our method, hence we drop the qualfcaton squared when referrng to 3) 7

8 We can quantfy the measure of closeness between two Mahalanobs matrces A and A 0 va a natural nformaton-theoretc approach. There exsts a smple bjecton up to a scalng functon) between the set of Mahalanobs dstances and the set of equal-mean multvarate Gaussan dstrbutons wthout loss of generalty, we can assume the Gaussans have zero-mean µ). Gven a Mahalanobs dstance parameterzed by A, we express ts correspondng multvarate Gaussan as px; A) = Z exp 2 d Ax, µ)), where Z s a normalzng constant. Usng ths bjecton, we measure the dstance between two Mahalanobs dstance functons parameterzed by A and A 0 by the dfferental) relatve entropy between ther correspondng multvarate Gaussans: px; A) KLpx; A) px; A 0 )) = px; A)log dx. 4) px; A 0 ) The dstance 4) provdes a well-founded measure of closeness between two Mahalanobs dstance functons and forms the bass of nformaton-theoretc metrc learnng. It s known that the dfferental relatve entropy between two multvarate Gaussans can be expressed as the convex combnaton of a Mahalanobs dstance between mean vectors and the LogDet dvergence between the covarance matrces. Assumng the means to be the same, we have KLpx; A) px; A 0 )) = 2 D lda, A 0 ), 5) We therefore select da, A t ) = D ld A, A t ) as the regularzer for ), usng the above equvalence 5 to obtan the update A t+ = argmnd ld A, A t ) + η t d t ˆd t ) 2. A We can vew the problem of onlne metrc learnng n the followng manner: at each tme step, we receve a matrx X t and a target y t. These encode the dstance between two vectors x and x j. If we let z = x x j, then tra t X t ) = z T A t z = x x j ) T A t x x j ) = d At x, x j ). Thus, the predcton tra t X t ) captures the Mahalanobs dstance between the vectors x and x j usng the current Mahalanobs matrx A t. The resultng loss tra t X t ) y t ) 2, where y t s the target dstance between the two vectors, tells us how close the current Mahalanobs matrx A t s n terms of satsfyng the gven dstance constrant. As a result, the rank-one onlne regresson algorthm usng the LogDet dvergence can be used as method for learnng a Mahalanobs matrx A, where at every tme step two ponts, along wth ther target dstance, are presented to the algorthm. The regret bound proven n the prevous secton apples drectly to ths case. See[2, 3] for further detals on nformaton-theoretc metrc learnng, ncludng the analyss of an offlne algorthm. 5 Expermental Results In secton 2. and 3.2, we showed worst case bounds for our algorthms. To smulate the onlne learnng problem, we now generate smple artfcal data. Frst, we generate a random vector u drawn from S m. Then, we generate some random nputs x t, also drawn from S m. We output y as u x t mxed wth some nose,.e, y = u x t ) + νz), where Z s an unformly dstrbuted random number over [ 0.5, 0.5] and ν s the nose level. We generate ths artfcal data for N = 00 and T = 0000 teratons. Fgure compares performance of the Burg entropy-based algorthm wth those usng relatve entropy and squared Eucldean dstance as the regularzaton term, when the nose level s set to 0. Clearly, the Burg algorthm converges slower compared to the other methods. But, Fgures 2 and 3 show that as nose ncreases, the Burg performance mproves n comparson to both relatve entropy and squred Eucldean dstance. Fgure 4 shows that when nose s around 30% then the Burg algorthm performs comparatvely to relatve entropy, and squred Eucldean performs poorly. 6 Concluson In ths paper, we ntroduced onlne regresson based on usng a Burg entropy-based regularzaton, leadng to new onlne regret bounds. An extenson to the rank-one matrx case was also analyzed. The man applcaton 8

9 0. Burg Entropy Relatve Entropy Eucldean Entropy Burg Entropy Relatve Entropy Eucldean Entropy Fgure : Comparson of performance of burg, relatve and eucldean dstance measure wth nose=0% Fgure 2: Comparson of performance of burg, relatve and eucldean dstance measure wth nose=0% Burg Entropy Relatve Entropy Eucldean Entropy Burg Entropy Relatve Entropy Eucldean Entropy Fgure 3: Comparson of performance of burg, relatve and eucldean dstance measure wth nose=20% Fgure 4: Comparson of performance of burg, relatve and eucldean dstance measure wth nose=30% 9

10 s to onlne metrc learnng, whch can be cast as an onlne regresson problem for rank-one matrces; ths metrc learnng formulaton has an nformaton-theoretc motvaton, and results n [2] have shown that t outperforms exstng methods. Future work ncludes mprovng the bounds so that they do not depend on the learnng rate, and extendng the matrx analyss to the general case. References [] A. Banerjee, S. Merugu, I. S. Dhllon, and J. Ghosh, Clusterng wth bregman dvergences., n SDM, 2004, pp [2] J. Davs, B. Kuls, P. Jan, S. Sra, and I. S. Dhllon, Informaton-theoretc metrc learnng, n Internatonal Conference on Metrc Learnng, 2007, p. to appear. [3] J. Davs, B. Kuls, S. Sra, and I. Dhllon, Informaton-theoretc metrc learnng, n NIPS 2006 Workshop on Learnng to Compare Examples, [4] G. H. Golub and C. F. van Loan, Matrx Computatons, Johns Hopkns Seres n the Mathematcal Scences, Johns Hopkns Unv. Press, second ed., 989. [5] G. J. Gordon, Regret bounds for predcton problems, n COLT, 999, pp [6] J. Kvnen and M. K. Warmuth, Exponentated gradent versus gradent descent for lnear predctors, Inf. Comput., ), pp. 63. [7] B. Kuls, M. Sustk, and I. S. Dhllon, Learnng low-rank kernel matrces., n ICML, 2006, pp [8] K. Tsuda, G. Raetsch, and M. K. Warmuth, Matrx exponentated gradent updates of onlne learnng and bregman projecton, Journal of Machne Learnng Research, ). 0

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng