Machne Learnng 10-701/15-781, 781, Fall 2011 Nonparametrc methods Erc Xng Lecture 2, September 14, 2011 Readng: 1 Classfcaton Representng data: Hypothess (classfer) 2 1
Clusterng 3 Supervsed vs. Unsupervsed Learnng 4 2
Unvarate predcton wthout usng a model: good or bad? Nonparametrc Classfer (Instance-based learnng) Nonparametrc densty estmaton K-nearest-neghbor classfer Optmalty of knn Spectrum clusterng Clusterng Graph partton and normalzed cut The spectral clusterng algorthm Very lttle learnng s nvolved n these methods But they are ndeed among the most popular and powerful machne learnng methods 5 Decson-makng as dvdng a hgh-dmensonal space Class-specfc Dst.: P(X Y) p( X Y = 1) = p ( X ; r µ, Σ ) 1 1 1 p( X Y = 2) = p ( X ; r µ, Σ ) 2( 2 Σ2 Class pror (.e., "weght"): P(Y) 6 3
4 The Bayes Decson Rule for Mnmum Error The a posteror probablty of a sample ) ( ) ( ) ( Y X Y P Y X Bayes Test: Lkelhood Rato: ) ( ) ( ) ( ) ( ) ( ) ( ) ( X q Y X p Y X p X p Y P Y X p X Y P = = = = = = = π π Dscrmnant functon: = h(x ) = l(x ) 7 Example of Decson Rules When each class s a normal We can wrte the decson boundary analytcally n some cases homework!! 8
Bayes Error We must calculate the probablty of error the probablty that a sample s assgned to the wrong class Gven a datum X, what s the rsk? The Bayes error (the expected rsk): 9 More on Bayes Error Bayes error s the lower bound of probablty of classfcaton error Bayes classfer s the theoretcally best classfer that mnmze probablty of classfcaton error Computng Bayes error s n general a very complex problem. Why? Densty estmaton: Integratng densty functon: 10 5
Learnng Classfer The decson rule: Learnng strateges Generatve Learnng Dscrmnatve Learnng Instance-based Learnng (Store all past experence n memory) A specal case of nonparametrc classfer K-Nearest-Neghbor Classfer: where the h(x) s represented by ALL the data, and by an algorthm 11 Recall: Vector Space Representaton Each document s a vector, one component for each term (= word). Doc 1 Doc 2 Doc 3... Word 1 3 0 0... Word 2 0 8 1... Word 3 12 1 10...... 0 1 3...... 0 0 0... Normalze to unt length. Hgh-dmensonal vector space: Terms are axes, 10,000+ dmensons, or even 100,000+ Docs are vectors n ths space 12 6
Test Document =? Sports Scence Arts 13 1-Nearest Neghbor (knn) classfer Sports Scence Arts 14 7
2-Nearest Neghbor (knn) classfer Sports Scence Arts 15 3-Nearest Neghbor (knn) classfer Sports Scence Arts 16 8
K-Nearest Neghbor (knn) classfer Votng knn Sports Scence Arts 17 Classes n a Vector Space Sports Scence Arts 18 9
knn Is Close to Optmal Cover and Hart 1967 Asymptotcally, the error rate of 1-nearest-neghbor classfcaton s less than twce the Bayes rate [error rate of classfer knowng model that generated data] In partcular, asymptotc error rate s 0 f Bayes rate s 0. Decson boundary: 19 Where does knn come from? How to estmaton p(x)? Nonparametrc densty estmaton Parzen densty estmate E.g. (Kernel densty est.): More generally: 20 10
Where does knn come from? Nonparametrc densty estmaton Parzen densty estmate knn densty estmate Bayes classfer based on knn densty estmator: Votng knn classfer Pck K 1 and K 2 mplctly by pckng K 1 +K 2 =K, V 1 =V 2, N 1 =N 2 21 Asymptotc Analyss Condton rsk: r k (X,X NN ) Test sample X NN sample X NN Denote the event X s class I as X I Assumng k=1 When an nfnte number of samples s avalable, X NN wll be so close to X 22 11
Asymptotc Analyss, cont. Recall condtonal Bayes rsk: Thus the asymptotc condton rsk Ths s called the MacLaurn seres expanson It can be shown that Ths s remarkable, consderng that the procedure does not use any nformaton about the underlyng dstrbutons and only the class of the sngle nearest neghbor determnes the outcome of the decson. 23 In fact Example: 24 12
knn s an nstance of Instance-Based Learnng What makes an Instance-Based Learner? A dstance metrc How many nearby neghbors to look at? A weghtng functon (optonal) How to relate to the local ponts? 25 Dstance Metrc Eucldean dstance: 2 D ( x, x ') = σ ( x x ') Or equvalently, 2 T D( x, x') = ( x x') Σ( x x') Other metrcs: L 1 norm: x-x' L norm: max x-x' (elementwse ) Mahalanobs: where Σ s full, and symmetrc Correlaton Angle Hammng dstance, Manhattan dstance 26 13
Case Study: knn for Web Classfcaton Dataset 20 News Groups (20 classes) Download :(http://people.csal.mt.edu/jrenne/20newsgroups/) 61,118 words, 18,774 documents Class labels descrptons 27 Expermental Setup Tranng/Test Sets: 50%-50% randomly splt. 10 runs report average results Evaluaton Crtera: 28 14
Results: Bnary Classes Accuracy alt.athesm vs. comp.graphcs rec.autos vs. rec.sport.baseball comp.wndows.x vs. rec.motorcycles k 29 Results: Multple Classes Accuracy Random select 5-out-of-20 classes, repeat 10 runs and average All 20 classes k 30 15
Is knn deal? more later 31 Effect of Parameters Sample sze The more the better Need effcent search algorthm for NN Dmensonalty Curse of dmensonalty Densty How smooth? Metrc The relatve e scalngs n the dstance metrc affect regon shapes. Weght K Spurous or less relevant ponts need to be downweghted 32 16
Sample sze and dmensonalty From page 316, Fukumaga 33 Neghborhood sze From page 350, Fukumaga 34 17
knn for mage classfcaton: basc set-up Antelope? Trombone Jellyfsh Kangaroo German Shepherd 35 Votng? Kangaroo 5 NN Count 3 2 1 0 Antelope Jellyfsh German Shepherd Kangaroo Trombone 36 18
10K classes, 4.5M Queres, 4.5M tranng sy: Antono Torralba Background mage courtes? 37 KNN on 10K classes 10K classes 4.5M queres 4.5M tranng Features BOW GIST Deng, Berg, L & Fe Fe, ECCV 2010 38 19
Nearest Neghbor Search n Hgh Dmensonal Metrc Space Lnear Search: E.g. scannng 4.5M mages! k-d trees: axs parallel parttons of the data Only effectve n low-dmensonal data Large Scale Approxmate Indexng Localty Senstve Hashng (LSH) Spll-Tree NV-Tree All above run on a sngle machne wth all data n memory, and scale to mllons of mages Web-scale Approxmate Indexng Parallel varant of Spll-tree, NV-tree on dstrbuted systems, Scale to Bllons of mages n dsks on multple machnes 39 Localty senstve hashng Approxmate knn Good enough n practce Can get around curse of dmensonalty Localty senstve hashng Near feature ponts (lkely) same hash values Hash table 40 20
Example: Random projecton x h(x) = sgn (x r), r s a random unt vector h(x) gves 1 bt. Repeat and concatenate. Prob[h(x) = h(y)] = 1 θ(x,y) / π x y θ h(x) = 0, h(y) = 1 r hyperplane x y h(x) = 0, h(y) = 0 h(x) = 0, h(y) = 1 y x y 000 101 Hash table 41 Example: Random projecton h(x) = sgn (x r), r s a random unt vector h(x) gves 1 bt. Repeat and concatenate. Prob[h(x) = h(y)] = 1 θ(x,y) / π y y y r x x x θ h(x) = 0, h(y) = 0 h(x) = 0, h(y) = 0 h(x) = 0, h(y) = 0 hyperplane x y 000 101 Hash table 42 21
Localty senstve hashng Retreved NNs Hash table? 43 Localty senstve hashng 1000X speed-up wth 50% recall of top 10-NN 1.2M mages + 1000 dmensons Percentage of exact NN retreved Recall of L1 1Prod at top 10 0.7 0.6 0.5 0.4 0.3 L1Prod LSH + L1Prod rankng RandHP LSH + L1Prod rankng 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Scan cost Percentage of ponts scanned x 10 3 44 22
Summary: Nearest-Neghbor Learnng Algorthm Learnng s just storng the representatons of the tranng examples n D Testng nstance x: Compute smlarty between x and all examples n D. Assgn x the category of the most smlar example n D. Does not explctly compute a generalzaton or category prototype Effcent ndexng needed n hgh dmensonal, large-scale problems Also called: Case-based learnng Memory-based learnng Lazy learnng 45 Summary (contnued) Bayes classfer s the best classfer whch mnmzes the probablty of classfcaton error. Nonparametrc and parametrc classfer A nonparametrc classfer does not rely on any assumpton concernng the structure of the underlyng densty functon. A classfer becomes the Bayes classfer f the densty estmates converge to the true denstes when an nfnte number of samples are used The resultng error s the Bayes error, the smallest achevable error gven the underlyng dstrbutons. 23
Clusterng 47 Data Clusterng Two dfferent crtera Compactness, e.g., k-means, mxture models Connectvty, e.g., spectral clusterng Compactness Connectvty 48 24
Graph-based Clusterng Data Groupng W j j W j W = f d( x, x )) j ( j G = {V,E} Image sgmentaton Affnty matrx: Degree matrx: W = [w w, j ] D = dag( d ) 49 Affnty Functon W = e, j X X j σ 2 2 2 Affntes grow as σ grows How the choce of σ value affects the results? What would be the optmal choce for σ? 50 25
A Spectral Clusterng Algorthm Ng, Jordan, and Wess 2003 Gven a set of ponts S={s 1, s n } Form the affnty matrx 2 S S j 2 2 σ w, j = e, j, w, = 0 Defne dagonal matrx D = Σ κ a k Form the matrx L = D WD 1 / 2 1/ 2 Stack the k largest egenvectors of L to for the columns of the new matrx X: X = x1 x2 L Renormalze each of X s rows to have unt length and get new matrx Y. Cluster rows of Y as ponts n R k x k 51 Why t works? K-means n the spectrum space! 52 26
More formally Spectral clusterng s equvalent to mnmzng a generalzed normalzed cut mn Ncut( A, A K A ) = 1 2 k k r= 1 cut( Ar, Ar ) d Ar segments mn Y T s.t. D WD 1 / 2 1/ 2 T Y Y = I Y Y pxe els W j j 53 Toy examples Images from Matthew Brand (TR-2002-42) 54 27
Spectral Clusterng Algorthms that cluster ponts usng egenvectors of matrces derved from the data Obtan data representaton n the low-dmensonal space that can be easly clustered Varety of methods that use the egenvectors dfferently (we have seen an example) Emprcally very successful Authors dsagree: Whch egenvectors to use How to derve clusters from these egenvectors 55 Summary Two nonparametrc methods: knn classfer Spectrum clusterng A nonparametrc method does not rely on any assumpton concernng the structure of the underlyng densty functon. Good news: Smple and powerful methods; Flexble and easy to apply to many problems. knn classfer asymptotcally approaches the Bayes classfer, whch s theoretcally the best classfer that mnmzes the probablty of classfcaton error. Spectrum clusterng optmzes the normalzed cut Bad news: Hgh memory requrements Very dependant on the scale factor for a specfc problem. 56 28