Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

Machne Learnng 0-70/5 70/5-78, 78, Fall 008 Theory of Classfcaton and Nonarametrc Classfer Erc ng Lecture, Setember 0, 008 Readng: Cha.,5 CB and handouts Classfcaton Reresentng data: M K Hyothess classfer x n xn x M K n

Outlne What s theoretcally the best classfer Bayesan decson rule for Mnmum Error Nonarametrc Classfer Instance-based learnng Nonarametrc densty estmaton K-nearest-neghbor classfer Otmalty of knn Problem of knn Decson-makng as dvdng a hgh-dmensonal sace Dstrbutons of samles from normal and abnormal machne

3 Unform Probablty Densty Functon Normal Gaussan Probablty Densty Functon The dstrbuton s symmetrc, and s often llustrated as a bell-shaed curve. Two arameters, µ mean and σ standard devaton, determne the locaton and shae of the dstrbuton. The hghest ont on the normal curve s at the mean, whch s also the medan and mode. The mean can be any numercal value: negatve, zero, or ostve. Multvarate Gaussan Contnuous Dstrbutons elsewhere for / 0 b x a a b x σ µ πσ / x e x µ x fx µ x fx Σ Σ Σ µ µ π µ r r r T n ex, ; / / Class-Condtonal Probablty Classfcaton-secfc Dst.: P Y Class ror.e., "weght": PY, ; Σ µ r Y, ; Σ µ r Y

4 The Bayes Rule What we have just dd leads to the followng general exresson: Ths s Bayes Rule P Y Y P Y P The Bayes Decson Rule for Mnmum Error The a osteror robablty of a samle Bayes Test: Lkelhood Rato: Dscrmnant functon: q Y P Y Y P π π h l

Examle of Decson Rules When each class s a normal We can wrte the decson boundary analytcally n some cases homework!! Bayes Error We must calculate the robablty of error the robablty that a samle s assgned to the wrong class Gven a datum, what s the rsk? The Bayes error the exected rsk: 5

More on Bayes Error Bayes error s the lower bound of robablty of classfcaton error Bayes classfer s the theoretcally best classfer that mnmzes robablty of classfcaton error Comutng Bayes error s n general a very comlex roblem. Why? Densty estmaton: Integratng densty functon: Learnng Classfer The decson rule: Learnng strateges Generatve Learnng Parametrc Nonarametrc Dscrmnatve Learnng Parametrc Nonarametrc Instance-based Learnng Store all ast exerence n memory A secal case of nonarametrc classfer 6

Suervsed Learnng K-Nearest-Neghbor Classfer: where the h s reresented by all the data, and by an algorthm Recall: Vector Sace Reresentaton Each document s a vector, one comonent for each term word. Doc Doc Doc 3... Word 3 0 0... Word 0 8... Word 3 0...... 0 3...... 0 0 0... Normalze to unt length. Hgh-dmensonal vector sace: Terms are axes, 0,000+ dmensons, or even 00,000+ Docs are vectors n ths sace 7

Classes n a Vector Sace Sorts Scence Arts Test Document? Sorts Scence Arts 8

K-Nearest Neghbor knn classfer Sorts Scence Arts knn Is Close to Otmal Cover and Hart 967 Asymtotcally, the error rate of -nearest-neghbor classfcaton s less than twce the Bayes rate [error rate of classfer knowng model that generated data] In artcular, asymtotc error rate s 0 f Bayes rate s 0. Decson boundary: 9

Where does knn come from? How to estmaton? Nonarametrc densty estmaton Parzen densty estmate E.g. Kernel densty est.: More generally: Where does knn come from? Nonarametrc densty estmaton Parzen densty estmate knn densty estmate Bayes classfer based on knn densty estmator: Votng knn classfer Pck K and K mlctly by ckng K +K K, V V, N N 0

Votng knn The rocedure Sorts Scence Arts Asymtotc Analyss Condton rsk: r k, NN Test samle NN samle NN Denote the event s class I as I Assumng k When an nfnte number of samles s avalable, NN wll be so close to

Asymtotc Analyss, cont. Recall condtonal Bayes rsk: Thus the asymtotc condton rsk Ths s called the MacLaurn seres exanson It can be shown that Ths s remarkable, consderng that the rocedure does not use any nformaton about the underlyng dstrbutons and only the class of the sngle nearest neghbor determnes the outcome of the decson. In fact Examle:

knn s an nstance of Instance-Based Learnng What makes an Instance-Based Learner? A dstance metrc How many nearby neghbors to look at? A weghtng functon otonal How to relate to the local onts? Eucldean Dstance Metrc L norm: x-x' L norm: max x-x' elementwse Mahalanobs: where Σ s full, and symmetrc Correlaton Angle Hammng dstance, Manhattan dstance D x, x' σ x x ' Or equvalently, T D x, x' x x' Σ x x' Other metrcs: 3

-Nearest Neghbor knn classfer Sorts Scence Arts -Nearest Neghbor knn classfer Sorts Scence Arts 4

3-Nearest Neghbor knn classfer Sorts Scence Arts 5-Nearest Neghbor knn classfer Sorts Scence Arts 5

Nearest-Neghbor Learnng Algorthm Learnng s just storng the reresentatons of the tranng examles n D. Testng nstance x: Comute smlarty between x and all examles n D. Assgn x the category of the most smlar examle n D. Does not exlctly comute a generalzaton or category rototyes. Also called: Case-based learnng Memory-based learnng Lazy learnng Case Study: knn for Web Classfcaton Dataset 0 News Grous 0 classes Download :htt://eole.csal.mt.edu/jrenne/0newsgrous/ 6,8 words, 8,774 documents Class labels descrtons Erc Erc ng ng @ CMU, CMU, 006-008 006-008 6

Exermental Setu Tranng/Test Sets: 50%-50% randomly slt. 0 runs reort average results Evaluaton Crtera: Erc Erc ng ng @ CMU, CMU, 006-008 006-008 Results: Bnary Classes Accuracy alt.athesm vs. com.grahcs rec.autos vs. rec.sort.baseball com.wndows.x vs. rec.motorcycles k 7

Results: Multle Classes Accuracy Random select 5-out-of-0 classes, reeat 0 runs and average All 0 classes k Is knn deal? more later 8

Effect of Parameters Samle sze The more the better Need effcent search algorthm for NN Dmensonalty Curse of dmensonalty Densty How smooth? Metrc The relatve scalngs n the dstance metrc affect regon shaes. Weght Surous or less relevant onts need to be downweghted K Samle sze and dmensonalty From age 36, Fukumaga 9

Neghborhood sze From age 350, Fukumaga Summary Bayes classfer s the best classfer whch mnmzes the robablty of classfcaton error. Nonarametrc and arametrc classfer A nonarametrc classfer does not rely on any assumton concernng the structure of the underlyng densty functon. A classfer becomes the Bayes classfer f the densty estmates converge to the true denstes when an nfnte number of samles are used The resultng error s the Bayes error, the smallest achevable error gven the underlyng dstrbutons. 0