Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

Similar documents
Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

Lecture 12: Classification

Statistical pattern recognition

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Classification as a Regression Problem

Maximum Likelihood Estimation (MLE)

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Clustering & Unsupervised Learning

Homework Assignment 3 Due in class, Thursday October 15

10-701/ Machine Learning, Fall 2005 Homework 3

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Clustering & (Ken Kreutz-Delgado) UCSD

Kernel Methods and SVMs Extension

Logistic Classifier CISC 5800 Professor Daniel Leeds

Spectral Clustering. Shannon Quinn

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

SDMML HT MSc Problem Sheet 4

The big picture. Outline

Pattern Classification

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Which Separator? Spring 1

Ensemble Methods: Boosting

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Automatic Object Trajectory- Based Motion Recognition Using Gaussian Mixture Models

Composite Hypotheses testing

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Generalized Linear Methods

CSE 252C: Computer Vision III

Machine learning: Density estimation

CS47300: Web Information Search and Management

MDL-Based Unsupervised Attribute Ranking

Lecture Nov

Approximate Nearest Neighbor (ANN) Search - II

Error Bars in both X and Y

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Learning from Data 1 Naive Bayes

Lecture 10 Support Vector Machines II

Nonlinear Classifiers II

Semi-Supervised Learning

Conjugacy and the Exponential Family

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

VQ widely used in coding speech, image, and video

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Unified Subspace Analysis for Face Recognition

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Clustering gene expression data & the EM algorithm

Support Vector Machines

Pattern Classification

Supporting Information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Lecture Notes on Linear Regression

Linear Classification, SVMs and Nearest Neighbors

Boostrapaggregating (Bagging)

Machine Learning. Spectral Clustering. Lecture 23, April 14, Reading: Eric Xing 1

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Absolute chain codes. Relative chain code. Chain code. Shape representations vs. descriptors. Start

Errors for Linear Systems

Support Vector Machines

Mixture o f of Gaussian Gaussian clustering Nov

Explaining the Stein Paradox

Generative classification models

K means B d ase Consensus Cluste i r ng Dr. Dr Junjie Wu Beihang University

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 3: Dual problems and Kernels

Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall, 1980

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Evaluation for sets of classes

Feature Selection: Part 1

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Support Vector Machines

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 13

EM and Structure Learning

Chapter 11: Simple Linear Regression and Correlation

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Natural Language Processing and Information Retrieval

U-Pb Geochronology Practical: Background

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Support Vector Machines

Radial-Basis Function Networks

Multilayer Perceptron (MLP)

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Intro to Visual Recognition

Feb 14: Spatial analysis of data fields

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

Support Vector Machines

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

CSC 411 / CSC D11 / CSC C11

Transcription:

Machne Learnng 10-701/15-781, 781, Fall 2011 Nonparametrc methods Erc Xng Lecture 2, September 14, 2011 Readng: 1 Classfcaton Representng data: Hypothess (classfer) 2 1

Clusterng 3 Supervsed vs. Unsupervsed Learnng 4 2

Unvarate predcton wthout usng a model: good or bad? Nonparametrc Classfer (Instance-based learnng) Nonparametrc densty estmaton K-nearest-neghbor classfer Optmalty of knn Spectrum clusterng Clusterng Graph partton and normalzed cut The spectral clusterng algorthm Very lttle learnng s nvolved n these methods But they are ndeed among the most popular and powerful machne learnng methods 5 Decson-makng as dvdng a hgh-dmensonal space Class-specfc Dst.: P(X Y) p( X Y = 1) = p ( X ; r µ, Σ ) 1 1 1 p( X Y = 2) = p ( X ; r µ, Σ ) 2( 2 Σ2 Class pror (.e., "weght"): P(Y) 6 3

4 The Bayes Decson Rule for Mnmum Error The a posteror probablty of a sample ) ( ) ( ) ( Y X Y P Y X Bayes Test: Lkelhood Rato: ) ( ) ( ) ( ) ( ) ( ) ( ) ( X q Y X p Y X p X p Y P Y X p X Y P = = = = = = = π π Dscrmnant functon: = h(x ) = l(x ) 7 Example of Decson Rules When each class s a normal We can wrte the decson boundary analytcally n some cases homework!! 8

Bayes Error We must calculate the probablty of error the probablty that a sample s assgned to the wrong class Gven a datum X, what s the rsk? The Bayes error (the expected rsk): 9 More on Bayes Error Bayes error s the lower bound of probablty of classfcaton error Bayes classfer s the theoretcally best classfer that mnmze probablty of classfcaton error Computng Bayes error s n general a very complex problem. Why? Densty estmaton: Integratng densty functon: 10 5

Learnng Classfer The decson rule: Learnng strateges Generatve Learnng Dscrmnatve Learnng Instance-based Learnng (Store all past experence n memory) A specal case of nonparametrc classfer K-Nearest-Neghbor Classfer: where the h(x) s represented by ALL the data, and by an algorthm 11 Recall: Vector Space Representaton Each document s a vector, one component for each term (= word). Doc 1 Doc 2 Doc 3... Word 1 3 0 0... Word 2 0 8 1... Word 3 12 1 10...... 0 1 3...... 0 0 0... Normalze to unt length. Hgh-dmensonal vector space: Terms are axes, 10,000+ dmensons, or even 100,000+ Docs are vectors n ths space 12 6

Test Document =? Sports Scence Arts 13 1-Nearest Neghbor (knn) classfer Sports Scence Arts 14 7

2-Nearest Neghbor (knn) classfer Sports Scence Arts 15 3-Nearest Neghbor (knn) classfer Sports Scence Arts 16 8

K-Nearest Neghbor (knn) classfer Votng knn Sports Scence Arts 17 Classes n a Vector Space Sports Scence Arts 18 9

knn Is Close to Optmal Cover and Hart 1967 Asymptotcally, the error rate of 1-nearest-neghbor classfcaton s less than twce the Bayes rate [error rate of classfer knowng model that generated data] In partcular, asymptotc error rate s 0 f Bayes rate s 0. Decson boundary: 19 Where does knn come from? How to estmaton p(x)? Nonparametrc densty estmaton Parzen densty estmate E.g. (Kernel densty est.): More generally: 20 10

Where does knn come from? Nonparametrc densty estmaton Parzen densty estmate knn densty estmate Bayes classfer based on knn densty estmator: Votng knn classfer Pck K 1 and K 2 mplctly by pckng K 1 +K 2 =K, V 1 =V 2, N 1 =N 2 21 Asymptotc Analyss Condton rsk: r k (X,X NN ) Test sample X NN sample X NN Denote the event X s class I as X I Assumng k=1 When an nfnte number of samples s avalable, X NN wll be so close to X 22 11

Asymptotc Analyss, cont. Recall condtonal Bayes rsk: Thus the asymptotc condton rsk Ths s called the MacLaurn seres expanson It can be shown that Ths s remarkable, consderng that the procedure does not use any nformaton about the underlyng dstrbutons and only the class of the sngle nearest neghbor determnes the outcome of the decson. 23 In fact Example: 24 12

knn s an nstance of Instance-Based Learnng What makes an Instance-Based Learner? A dstance metrc How many nearby neghbors to look at? A weghtng functon (optonal) How to relate to the local ponts? 25 Dstance Metrc Eucldean dstance: 2 D ( x, x ') = σ ( x x ') Or equvalently, 2 T D( x, x') = ( x x') Σ( x x') Other metrcs: L 1 norm: x-x' L norm: max x-x' (elementwse ) Mahalanobs: where Σ s full, and symmetrc Correlaton Angle Hammng dstance, Manhattan dstance 26 13

Case Study: knn for Web Classfcaton Dataset 20 News Groups (20 classes) Download :(http://people.csal.mt.edu/jrenne/20newsgroups/) 61,118 words, 18,774 documents Class labels descrptons 27 Expermental Setup Tranng/Test Sets: 50%-50% randomly splt. 10 runs report average results Evaluaton Crtera: 28 14

Results: Bnary Classes Accuracy alt.athesm vs. comp.graphcs rec.autos vs. rec.sport.baseball comp.wndows.x vs. rec.motorcycles k 29 Results: Multple Classes Accuracy Random select 5-out-of-20 classes, repeat 10 runs and average All 20 classes k 30 15

Is knn deal? more later 31 Effect of Parameters Sample sze The more the better Need effcent search algorthm for NN Dmensonalty Curse of dmensonalty Densty How smooth? Metrc The relatve e scalngs n the dstance metrc affect regon shapes. Weght K Spurous or less relevant ponts need to be downweghted 32 16

Sample sze and dmensonalty From page 316, Fukumaga 33 Neghborhood sze From page 350, Fukumaga 34 17

knn for mage classfcaton: basc set-up Antelope? Trombone Jellyfsh Kangaroo German Shepherd 35 Votng? Kangaroo 5 NN Count 3 2 1 0 Antelope Jellyfsh German Shepherd Kangaroo Trombone 36 18

10K classes, 4.5M Queres, 4.5M tranng sy: Antono Torralba Background mage courtes? 37 KNN on 10K classes 10K classes 4.5M queres 4.5M tranng Features BOW GIST Deng, Berg, L & Fe Fe, ECCV 2010 38 19

Nearest Neghbor Search n Hgh Dmensonal Metrc Space Lnear Search: E.g. scannng 4.5M mages! k-d trees: axs parallel parttons of the data Only effectve n low-dmensonal data Large Scale Approxmate Indexng Localty Senstve Hashng (LSH) Spll-Tree NV-Tree All above run on a sngle machne wth all data n memory, and scale to mllons of mages Web-scale Approxmate Indexng Parallel varant of Spll-tree, NV-tree on dstrbuted systems, Scale to Bllons of mages n dsks on multple machnes 39 Localty senstve hashng Approxmate knn Good enough n practce Can get around curse of dmensonalty Localty senstve hashng Near feature ponts (lkely) same hash values Hash table 40 20

Example: Random projecton x h(x) = sgn (x r), r s a random unt vector h(x) gves 1 bt. Repeat and concatenate. Prob[h(x) = h(y)] = 1 θ(x,y) / π x y θ h(x) = 0, h(y) = 1 r hyperplane x y h(x) = 0, h(y) = 0 h(x) = 0, h(y) = 1 y x y 000 101 Hash table 41 Example: Random projecton h(x) = sgn (x r), r s a random unt vector h(x) gves 1 bt. Repeat and concatenate. Prob[h(x) = h(y)] = 1 θ(x,y) / π y y y r x x x θ h(x) = 0, h(y) = 0 h(x) = 0, h(y) = 0 h(x) = 0, h(y) = 0 hyperplane x y 000 101 Hash table 42 21

Localty senstve hashng Retreved NNs Hash table? 43 Localty senstve hashng 1000X speed-up wth 50% recall of top 10-NN 1.2M mages + 1000 dmensons Percentage of exact NN retreved Recall of L1 1Prod at top 10 0.7 0.6 0.5 0.4 0.3 L1Prod LSH + L1Prod rankng RandHP LSH + L1Prod rankng 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Scan cost Percentage of ponts scanned x 10 3 44 22

Summary: Nearest-Neghbor Learnng Algorthm Learnng s just storng the representatons of the tranng examples n D Testng nstance x: Compute smlarty between x and all examples n D. Assgn x the category of the most smlar example n D. Does not explctly compute a generalzaton or category prototype Effcent ndexng needed n hgh dmensonal, large-scale problems Also called: Case-based learnng Memory-based learnng Lazy learnng 45 Summary (contnued) Bayes classfer s the best classfer whch mnmzes the probablty of classfcaton error. Nonparametrc and parametrc classfer A nonparametrc classfer does not rely on any assumpton concernng the structure of the underlyng densty functon. A classfer becomes the Bayes classfer f the densty estmates converge to the true denstes when an nfnte number of samples are used The resultng error s the Bayes error, the smallest achevable error gven the underlyng dstrbutons. 23

Clusterng 47 Data Clusterng Two dfferent crtera Compactness, e.g., k-means, mxture models Connectvty, e.g., spectral clusterng Compactness Connectvty 48 24

Graph-based Clusterng Data Groupng W j j W j W = f d( x, x )) j ( j G = {V,E} Image sgmentaton Affnty matrx: Degree matrx: W = [w w, j ] D = dag( d ) 49 Affnty Functon W = e, j X X j σ 2 2 2 Affntes grow as σ grows How the choce of σ value affects the results? What would be the optmal choce for σ? 50 25

A Spectral Clusterng Algorthm Ng, Jordan, and Wess 2003 Gven a set of ponts S={s 1, s n } Form the affnty matrx 2 S S j 2 2 σ w, j = e, j, w, = 0 Defne dagonal matrx D = Σ κ a k Form the matrx L = D WD 1 / 2 1/ 2 Stack the k largest egenvectors of L to for the columns of the new matrx X: X = x1 x2 L Renormalze each of X s rows to have unt length and get new matrx Y. Cluster rows of Y as ponts n R k x k 51 Why t works? K-means n the spectrum space! 52 26

More formally Spectral clusterng s equvalent to mnmzng a generalzed normalzed cut mn Ncut( A, A K A ) = 1 2 k k r= 1 cut( Ar, Ar ) d Ar segments mn Y T s.t. D WD 1 / 2 1/ 2 T Y Y = I Y Y pxe els W j j 53 Toy examples Images from Matthew Brand (TR-2002-42) 54 27

Spectral Clusterng Algorthms that cluster ponts usng egenvectors of matrces derved from the data Obtan data representaton n the low-dmensonal space that can be easly clustered Varety of methods that use the egenvectors dfferently (we have seen an example) Emprcally very successful Authors dsagree: Whch egenvectors to use How to derve clusters from these egenvectors 55 Summary Two nonparametrc methods: knn classfer Spectrum clusterng A nonparametrc method does not rely on any assumpton concernng the structure of the underlyng densty functon. Good news: Smple and powerful methods; Flexble and easy to apply to many problems. knn classfer asymptotcally approaches the Bayes classfer, whch s theoretcally the best classfer that mnmzes the probablty of classfcaton error. Spectrum clusterng optmzes the normalzed cut Bad news: Hgh memory requrements Very dependant on the scale factor for a specfc problem. 56 28