CSC Machne Learnng Lecture : Classfcaton II September, Sam Rowes Probablstc Classfcaton: Baes Classfers Generatve model: p(, ) = p()p( ). p() are called class prors. p( ) are called class-condtonal feature dstrbutons. For the pror we use a Bernoull or multnomal: p( = k π) = π k wth k π k = ; π k >. What classfcaton rule should we use? Pck the class that best models the data, e argma p( )? No! Ths behaves ver badl f the class prors are skewed. MAP s best: argma p( )=argma p(, ) = argma log p( )+log p() How should we ft model parameters? Mamum jont lkelhood. Φ = n log p(n, n ) = n log p(n n ) + log p( n ) ) Sort data nto batches b class label. ) Estmate p() b countng sze of batches (plus regularzaton). ) Estmate p( ) separatel wthn each batch usng ML on the class-condtonal model (also wth regularzaton). Revew: Classfcaton Gven eamples of a dscrete class label and some features. Goal: compute for new. Last class: compute a dscrmnant f( ) and then take ma. Ths class: probablstc classfers. Two approaches: Generatve: model p(, ) = p()p( ); then use Baes rule to nfer condtonal p( ). Dscrmnatve: model posteror p( ) drectl. Generatve approach s related to jont denst estmaton whle dscrmnatve approach s closer to regresson. Three Ke Regularzaton Ideas To avod overfttng, we can put prors on the parameters of the class and class-condtonal feature dstrbutons. We can also te some parameters together so that fewer of them are estmated usng more data. Fnall, we can make factorzaton or ndependence assumptons about the dstrbutons. In partcular, for the class-condtonal dstrbutons we can assume the features are full dependent, partl dependent, or ndependent (!). X X X m X X X m X (a) (b) (c)
Class Prors and Multnomal Smoothng Let s sa ou were trng to estmate the bas of a con. ou flp t K tmes; what s our estmate of the probablt z of heads? One answer: mamum lkelhood. z = #h/k. What f ou flp t tmes and ou get both heads? Do ou thnk that z =? Would ou be nfntel surprsed to see a tal? ML s almost alwas a bad dea. We need to ncorporate a pror belef to modulate the results of small numbers of trals. We do ths wth a technque called smoothng: z = #h+α K+α α are the number of pseudo-counts ou use for our pror. The same stuaton occurs when estmatng class prors from data: p (c) = #c + α N + Cα A ver common settng s α = whch s called Laplace Smoothng. Regularzed Gaussans Idea : assume all the covarances are the same (te parameters). Ths s eactl Fsher s lnear dscrmnant analss. (a) Idea : use a Wshart pror on the covarance matr. (Smoothng!) Ths fattens up the posterors b makng the MAP estmates the sample covarances plus a bt of the dentt matr. Idea : Make ndependence assumptons to get dagonal or dentt-multple covarances. (.e. sparse nverse covarances.) More on ths n a few mnutes... (b) Gaussan Class-Condtonal Dstrbutons If all the nput features are contnuous, a popular choce s a Gaussan class-condtonal model. { p( = k, θ) = πσ / ep } ( µ k)σ ( µ k ) Fttng: use the followng smple but useful fact. The mamum lkelhood ft of a Gaussan to some data s the Gaussan whose mean s equal to the data mean and whose covarance s equal to the sample covarance. [Tr to prove ths as an eercse n understandng lkelhood, algebra, and calculus all at once!] One ver nce feature of ths model s that the mamum lkelhood parameters can be found n closed-form, so we don t have to worr about numercal optmzaton, local mnma, search, etc. Seems eas. And works surprsngl well. But we can do even better wth some smple regularzaton... Gaussan Baes Classfer Mamum lkelhood estmates for parameters: prors π k : use observed frequences of classes (plus smoothng) means µ k : use class means covarance Σ: use data from sngle class or pooled data ( n µ n) to estmate (full/dagonal) covarances Compute the posteror va Baes rule. For equal covarances: p( = k, θ)p( = k π) p( = k, θ) = j p( = j, θ)p( = j π) = ep{µ k Σ µ k Σ µ k / + log π k } j ep{µ j Σ µ j Σ µ j / + log π j } = eβ k j eβ j = ep{β k }/Z where β k = [Σ µ k ; (µ k Σ µ k + log π k )] (last term s bas)
Lnear Geometr Takng the rato of an two posterors (the odds ) shows that the contours of equal parwse probablt are lnear surfaces n the feature space f the covarances of all classes are equal: p( = k, θ) p( = j, θ) = ep { (β k β j ) } The parwse dscrmnaton contours p( k ) = p( j ) are orthogonal to the dfferences of the means n feature space when Σ = σi. For general Σ (shared b/w all classes) the same s true n the transformed feature space u = Σ. The prors do not change the geometr, the onl shft the operatng pont on the logt b the log-odds log(π k /π j ). Summar: for equal class-covarances, we obtan a lnear classfer. If we use dfferent covarances for each class, we have a quadratc classfer wth conc secton decson surfaces. Dscrmnatve Models Observaton: f p( ) are lnear functons of (or monotone transforms), decson surfaces wll be pecewse lnear. Idea: parametrze p( ) drectl, forget p(, ) and Baes rule. Advantages: We don t need to model the denst of the features p() whch often takes lots of parameters and seems redundant snce man denstes gve the same lnear classfer. Dsadvantages: We cannot detect outlers, compare models usng lkelhoods or generate new labelled data. What should our objectve functon be? We ll tr to use one that s closer to the one we care about at test tme (e error rate). Eponental Faml Class-Condtonals Baes Classfer has the same form whenever the class-condtonal denstes are an eponental faml denst: p( = k, η k ) = h() ep{ηk a(η k)} p( = k, η) = p( = k, η k)p( = k π) j p( = j, η j)p( = j π) = ep{η k a(η k)} j ep{η j a(η j)} = eβ k j eβ j where β k = [η k ; a(η k )] and we have augmented wth a constant component alwas equal to (bas term). Resultng classfer s lnear n the suffcent statstcs. Logstc/Softma Regresson Model: s a multnomal random varable whose posteror s the softma of lnear functons of the feature vector. eθ k p( = k, θ) = j eθ j Fttng: now we optmze the condtonal log-lkelhood: l(θ; D) = [ n = k] log p( = k n, θ) = nk nk l = l n k p n θ p n k z n nk k z n. θ = k n. p n p n k (δ k p n. )n.. nk k = n ( n pn )n. n k log pn k....
Softma/Logt The squashng functon s known as the softma or logt: φ k (z) ez k g(η) = j ez j + e η It s nvertble (up to a constant): z k = log φ k + c η = log(g/ g) Dervatve s eas: φ k z j = φ k (δ kj φ j ) dg = g( g) dη φ( z ).... z Artfcal Neural Networks Hstorcall motvated b relatons to bolog, but for our purposes, ANNs are just nonlnear classfcaton machnes of the form: p( = k, θ) = eθ k h() j eθ j h() h j = σ(b j ) where h j = σ(b j ) are known as the hdden unt actvatons; k are the output unts and are the nput unts. The nonlnear scalar functon σ s called an actvaton functon. We usuall use nvertble and dfferentable actvaton functons. If the actvaton functon s lnear, the whole network reduces to a lnear network: equvalent to logstc regresson. [ Onl f there are at least as man hddens as nputs and outputs.] It s often a good dea to add skp weghts drectl connectng nputs to outputs to take care of ths lnear component drectl. More on Logstc Regresson Hardest Part: pckng the feature vector (see net slde!). Amazng fact: the condtonal lkelhood s conve n the parameters θ (assumng regularzaton). Stll no local mnma! But the optmal parameters cannot be computed n closed form. However, the gradent s eas to compute; so eas to optmze. Slow: gradent descent, IIS. Fast: BFGS, Newton-Raphson, IRLS. Regularzaton? Gaussan pror on θ (weght deca): add ɛ θ to the cost functon, whch subtracts ɛθ from each gradent. Logstc regresson could reall be called softma lnear regresson. Log odds (logt) between an two classes s lnear n parameters. Consder what happens f there are two features wth dentcal classfcaton patterns n our tranng data. Logstc Regresson can onl see the sum of the correspondng weghts. Luckl, weght deca wll solve ths. Moral: alwas regularze! Common Actvaton Functons Two common actvaton functons: sgmod and hperbolc tangent ep(z) ep( z) sgmod(z) = tanh(z) = + ep( z) ep(z) + ep( z) sgmod(z) / + tanh(z) + z z For small weghts, these functons wll be operatng near zero and ther behavour wll be almost lnear. Thus, for small weghts, the network behaves essentall lnearl. But for larger weghts, we are effectvel learnng the nput feature functons for a non-lnear verson of logstc regresson. In general we want a saturatng actvaton functon (wh?).
Geometr of ANNs ANNs can be thought of as generalzed lnear models, where the bass functons (hdden unts) are sgmodal clffs. The clff drecton s determned b the nput-to-hdden weghts, and the clffs are combned b the hdden-to-output weghts. We nclude bas unts of course, and these set where the clff s postoned relatve to the orgn. b j h j Dscrete Baesan Classfer If the nputs are dscrete (categorcal), what should we do? The smplest class-condtonal model s a jont multnomal (table): p( = a, = b,... = c) = η c ab... Ths s conceptuall correct, but there s a bg practcal problem. Fttng: ML params are observed counts: ηab... c = n [ n = c][ = a][ = b][...][...] n [ n = c] Consder the dgts at gra levels. How man entres n the table? How man wll be zero? What happens at test tme? Doh! We obvousl need some regularlzaton. Smoothng wll not help much here. Unless we know about the relatonshps between nputs beforehand, sharng parameters s hard also (what to share?). But what about ndependence? Neural Networks for Classfcaton Neural nets wth one hdden laer traned for classfcaton are dong nonlnear logstc regresson: p( = k ) = softma[θ k σ(b)] where θ and B are the frst and second laer weghts and σ() s a squashng functon (e.g. tanh) that operates componentwse. softma σ Gradent of condtonal lkelhood stll easl computable, usng the effcent backpropagaton algorthm whch we ll see later. But: We lose the convet propert local mnma problems. θ B Nave (Idot s) Baes Classfer Assumpton: condtoned on class, attrbutes are ndependent. p( ) = p( ) Sounds craz rght? Rght! But t works. Algorthm: sort data cases nto bns accordng to n. Compute margnal probabltes p( = c) usng frequences. For each class, estmate dstrbuton of th varable: p( = c). At test tme, compute argma c p(c ) usng c() = argma c p(c ) = argma c [log p( c) + log p(c)] = argma c [log p(c) + log p( c)]
Dscrete (Multnomal) Nave Baes Dscrete features, assumed ndependent gven the class label. p( = j = k) = η jk p( = k, η) = η [ =j] jk j Classfcaton rule: p( = k, η) = π k j η[ =j] jk q π q j η[ =j] jq = eβ k q eβ q β k = log[η k... η jk... η jk... log π k ] = [ =; =;... ; =j;... ; ] X X (a) X m Gaussan Nave Baes Ths s just a Gaussan Baes Classfer wth a separate but dagonal covarance matr for each class. Equvalent to fttng a D Gaussan to each nput for each class. NB: Decson surfaces are quadratcs, not lnear... Even better dea: Blend between full, dagonal and dentt covarances. Fttng Dscrete Nave Baes ML parameters are class-condtonal frequenc counts: ηjk = n [ n = j][ n = k] n [n = k] How do we know? Wrte down the lkelhood: l(θ; D) = log p( n π) + log p( n n, η) n n and optmze t b settng ts dervatve to zero (careful! enforce normalzaton wth Lagrange multplers): l(η; D) = [ n = j][ n = k] log η jk + λ k ( j η jk) n jk k l = n [ n = j][ n = k] λ η jk η k jk l = λ η k = [ n = k] ηjk jk n Nos-OR Classfer Man probablstc models can be obtaned as nos versons of formulas from propostonal logc. Nos-OR: each nput actvates output w/some probablt. p( =, α) = α = ep log α Lettng θ = log α we get et another lnear classfer: p( =, θ) = e θ
Jont vs. Condtonal Models Man of the methods we have seen so far have lnear or pecewse lnear decson surfaces n some space : LDA, perceptron, Gaussan Baes, Nave Baes, Nos-OR, KNN,... But the crtera used to fnd ths hperplane s dfferent: KNN/perceptron optmze tranng set classfcaton error. Gauss/Nave Baes are jont models; optmze p(, ) = p()p( ). Logstc Regresson/NN are condtonal: optmze p( ) drectl. Ver mportant pont: n general there s a large dfference between the archtecture used for classfcaton and the objectve functon used to optmze the parameters of the archtecture. See readng... Futher ponts... Last class: non-parametrc (e.g. K-nearest-neghbour). Those classfers return a sngle guess for wthout a dstrbuton. Ths class: probablstc generatve models p(, ) (e.g. Gaussan class-condtonal, Nave Baes) & dscrmnatve (condtonal) models p( ) (e.g. logstc regresson, ANNs, nos-or). (Plus man more we ddn t talk about, e.g. probt regresson, complementar log-log, generalzed lnear models,...) Advanced topc: kernel machne classfers. (e.g. kernel voted perceptron, support vector machnes, Gaussan processes). Advanced topc: combnng multple weak classfers nto a sngle stronger one usng boostng, baggng, stackng... Readngs: Haste et. al, Ch; Duda&Hart, Ch,. Classfcaton va Regresson? We could forget that was a dscrete (categorcal) random varable and just attempt to model p( ) usng regresson. Idea: do regresson to an ndcator matr. (n bnar case p( = ) s also the condtonal epectaton) For two classes, ths s equvalent to LDA. For or more, dsaster... Ver bad dea! Nose models (e.g. Gaussan) for regresson are totall napproprate, and fts are oversenstve to outlers. Furthermore, gves unreasonable predctons < and >........................