Linear classification models: Perceptron. CS534-Machine learning

Lnear classfcaton odels: Perceptron CS534-Machne learnng

Classfcaton proble Gven npt, the goal s to predct, hch s a categorcal varable s called the class label s the featre vector Eaple: : onthl ncoe and ban savng aont; : rs or not rs : reve tet for a prodct : sentent postve, negatve or netral

Lnear Classfer We ll be begn th the splest chose: lnear classfers - - - - - - - 1

Wh lnear odel?

Bnar classfcaton: General Setp Gven a set of tranng eaples 1, 1,, nn, nn, here each RR dd, {1,1} Learn a lnear fncton gg, = 0 1 1 dd dd Gven an eaple = 1,, dd TT : predct = 1 f gg, 0 predct = 0 otherse Copactl the classfer can be represented as: = sgn 0 1 1 dd dd = sgn TT here = 0, 1,, dd TT, and = 1, 1,, dd TT Goal: fnd a good that nzes soe loss fncton JJ

0/1 Loss J 1 n = T 0 /1 Lsgn, n = 1 here LL, = 0 hen =, otherse LL, = 1 3 staes staes 1 staes 0 staes Isse: does not prodce sefl gradent snce the srface of JJ 0/1 s pece-se flat

0/1 loss Perceptron crteron Perceptron Loss J n 1 T = a0, p n = 1 If predcton s correct, TT > 0, a 0, TT = 0 If ncorrect, TT 0, a 0, TT = TT > 0 A lnear fncton of npt featres JJ pp s pecese lnear Has a nce gradent leadng to the solton regon

Stochastc Gradent Descent The objectve fncton conssts of a s over data ponts--- Stochastc Gradent Descent pdates the paraeter after observng each eaple otherse 0 f 0 a0, a0, 1 1 > = = = = T n T J J n J Update Rle After observng,, f t s a stae

Onlne Perceptron Stochastc gradent descent Let 0,0,0,...,0 Repeat ntl convergence for ever tranng eaple = 1,..., n : T f 0

When an error s ade, oves the eght n a drecton that corrects the error Decson bondar 1 - - Decson bondar 1 - - 3 Decson bondar 3 Red ponts belong to the postve class, ble ponts belong to the negatve class

Convergence Theore Bloc, 196, Novoff, 196 Gven tranng eaple seqence 1, 1,,,... N, N. If, D, and then the nber of, = 1and T γ > 0 for all, staes that the perceptron algorth aes s at ost D / γ. Note that s the Ecldean nor of a vector.

Proof = 1 e have be the th stae, Let γ D = e can set s an arbtrar scalng factor, Becase 1 D γ γ = = = = D D D becase, 0 becase, D becase, ] [ 1 Let be a solton vector, e no then s also a solton

Proof cont. 1 1 D D D = = / 0 γ γ D D D D = B ndcton on

Margn γγ s referred to as the argn Mn dstance fro data ponts to the decson bondar Bgger argn -> easer the classfcaton proble Bgger argn -> ore confdence n or predcton Ths concept ll be tlzed n later ethods: spport vector achnes

Batch Perceptron Algorth Gven : tranng eaples Let 0,0,0,...,0 repeat{ delta 0,0,0,...,0 for = 1to n{,, = 1,..., n T f 0 : delta delta } delta delta / n λ delta }ntl delta < ε

Onlne VS. Batch Perceptron Batch learnng learns fro a batch of eaples collectvel Onlne learnng learns fro one eaple at a te Both learnng echanss are sefl n practce Onlne Perceptron s senstve to the order tranng eaples are receved In batch tranng, the correctons are acclated and appled at once In onlne tranng, each correcton s appled edatel once a stae s encontered, hch ll change the decson bondar, ths dfferent staes abe encontered for onlne and batch tranng Onlne tranng perfors stochastc gradent descent, an approaton to the real gradent descent sed b the batch tranng

Not lnearl separable case In sch cases the algorth ll never converge! Ho to f? Loo for decson bondar that ae as fe staes as possble NP-hard!

Fng the Perceptron Idea one: onl go throgh the data once, or a fed nber of tes Let 0,0,0,...,0 Repeat for T tes for each tranng eaple : T f 0 At least ths stops Proble: the fnal ght not be good e.g. the last pdate cold be on a total otler

Voted Perceptron Keep nteredate hpotheses and have the vote [Frend and Schapre 1998] Let 0,0,0,...,0 c 0 = 0, n = 0 Repeat for T tes for each tranng eaple : f else 0 n 1 n = n 1 c n c n T = 0 = c n n 1 The otpt ll be a collecton of lnear separators 0 1,, MM along th ther srvval te cc 0, cc 1,, cc MM The cc s can be veed as easres of the relablt of the s For classfcaton, tae a eghted vote aong all separators: ŷ = sgn{ N c n n= 0 sgn T n }

Average Perceptron Voted perceptron reqres storng all nterttent eghts Large eor conspton Slo predcton te Average perceptron ŷ = sgn{ N c n n= 0 Tae the eghted average of all the nterttent eghts Can be pleented b antanng an rnnng average, no need to store all eghts Fast predcton te T n }

Fnal Dscsson Perceptron learns ŷ = f drectl a dscrnatve ethod Gradent descent to optze the perceptron loss Onlne verson perfors stochastc gradent descent Garanteed to converge n fnte steps f lnearl separable The pper bond on the nber of correctons needed s nversel proportonal to the argn of the optal decson bondar If not lnearl separable, voted or average perceptrons can be sed Hper-paraeter: the nber of epochs T Ver large T cold stll lead to overfttng

Beond the Basc Perceptron

Strctred Predcton th Perceptrons S S VP VP PP PP NP? VP NP N V P D N V N P D N Te fles le an arro Te fles le an arro S S VP V S NP NP V NP N N V D N V V V D N Te fles le an arro Te fles le an arro Based on Jason Esner's notes

A general proble Gven soe npt An eal, a sentence Consder a set of canddate otpts Classfcatons for sall nber: often jst Taggngs of eponentall an Parses of eponentall an Translatons of eponentall an Want to fnd the best, gven Based on Jason Esner's notes Strctred predcton

Scorng b Lnear Models Gven soe npt Consder a set of canddate otpts Defne a scorng fncton score, Lnear fncton: A s of featre eghts o pc the featres! Weght of featre learned or set b hand Ranges over all featres, e.g., =5 nbered featres or = see Det Non naed featres Choose that azes score, Based on Jason Esner's notes Whether, has featre 0 or 1 Or ho an tes t fres 0 Or ho strongl t fres real #

Scorng b Lnear Models Gven soe npt Consder a set of canddate otpts Defne a scorng fncton score, Lnear fncton: A s of featre eghts o pc the featres! learned or set b hand Ths lnear decson rle s soetes called a perceptron. It s a strctred perceptron f t does strctred predcton nber of canddates s nbonded, e.g., gros th. Choose that azes score, Based on Jason Esner's notes

Perceptron Tranng Algorth ntalze θ sall to the zero vector repeat: Pc a tranng eaple, Model predcts * that azes score,* Update eghts b a step of sze ε > 0: θ = θ ε f, f,* If odel predcton as rong *, then e st have score, score,* nstead of > as e ant. Eqvalentl, θ f, θ f,* Eqvalentl, θ f, - f,* 0 bt e ant t postve. Or pdate ncreases t b ε f, f,* 0 Based on Jason Esner's notes 7

Perceptron for Strctred Predcton What e see here s the sae as the reglar perceptron Slar convergence garantee The challenge s the nference part Fndng the that azes the score for gven Cannot resort to brte-force eneraton Mch research goes nto Ho to devse proper featres and effcent algorths for nference Ho to perfor approate nference Ho to learn hen nference s approate