Lecture 12: Classification

Similar documents
Statistical pattern recognition

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Lecture 10: Dimensionality reduction

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Maximum Likelihood Estimation (MLE)

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

The big picture. Outline

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Composite Hypotheses testing

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Pattern Classification

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Linear Approximation with Regularization and Moving Least Squares

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Pattern Classification

Kernel Methods and SVMs Extension

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Lecture Notes on Linear Regression

Classification as a Regression Problem

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

SDMML HT MSc Problem Sheet 4

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Generative classification models

Lecture 7: Linear and quadratic classifiers

VQ widely used in coding speech, image, and video

Clustering & Unsupervised Learning

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Probability Density Function Estimation by different Methods

Regularized Discriminant Analysis for Face Recognition

Which Separator? Spring 1

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Homework Assignment 3 Due in class, Thursday October 15

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Unified Subspace Analysis for Face Recognition

Generalized Linear Methods

Classification learning II

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

APPENDIX A Some Linear Algebra

10-701/ Machine Learning, Fall 2005 Homework 3

Richard Socher, Henning Peters Elements of Statistical Learning I E[X] = arg min. E[(X b) 2 ]

Clustering & (Ken Kreutz-Delgado) UCSD

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

U-Pb Geochronology Practical: Background

Singular Value Decomposition: Theory and Applications

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

Lecture 3. Ax x i a i. i i

Mixture o f of Gaussian Gaussian clustering Nov

Linear Classification, SVMs and Nearest Neighbors

7: Estimation: The Least-Squares Method

Module 9. Lecture 6. Duality in Assignment Problems

Experimental Study on Classification

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

5. POLARIMETRIC SAR DATA CLASSIFICATION

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Difference Equations

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

EEE 241: Linear Systems

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Learning from Data 1 Naive Bayes

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

The exam is closed book, closed notes except your one-page cheat sheet.

Conjugacy and the Exponential Family

Lecture 12: Discrete Laplacian

Chapter 12 Analysis of Covariance

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

15-381: Artificial Intelligence. Regression and cross validation

2.3 Nilpotent endomorphisms

Lecture 10 Support Vector Machines. Oct

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Statistical Foundations of Pattern Recognition

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Computing MLE Bias Empirically

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

Clustering gene expression data & the EM algorithm

Hierarchical State Estimation Using Phasor Measurement Units

PHYS 450 Spring semester Lecture 02: Dealing with Experimental Uncertainties. Ron Reifenberger Birck Nanotechnology Center Purdue University

Ensemble Methods: Boosting

EEL 6266 Power System Operation and Control. Chapter 3 Economic Dispatch Using Dynamic Programming

Eigenvalues of Random Graphs

Absolute chain codes. Relative chain code. Chain code. Shape representations vs. descriptors. Start

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

SIO 224. m(r) =(ρ(r),k s (r),µ(r))

Online Classification: Perceptron and Winnow

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Lecture Nov

Transcription:

Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty

Dscrmnant functons g A convenent way to represent a pattern classfer s n terms of a famly of dscrmnant functons g (x wth a smple MAX gate as the classfcaton rule Class assgnment Select max Costs g (x g (x g C (x Dscrmnant functons x x x 3 x d Features Assgn x to class f g (x > g (x j j g How do we choose the dscrmnant functons g (x n Depends on the objectve functon to mnmze g Probablty of error g Bayes Rsk Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty

Mnmzng probablty of error g Probablty of error P[error x] s the probablty of assgnng x to the wrong class n For a two-class problem, P[error x] s smply P(error x = P( P( x x f f we decde we decde g It makes sense that the classfcaton rule be desgned to mnmze the average probablty of error P[error] across all possble values of x + P(error = P(error, xdx = P(error xp(xdx g To ensure P(error s mnmum we mnmze P(error x by choosng the class wth maxmum posteror P(ω x at each x n Ths s called the MAXIMUM A POSTERIORI (MAP RULE g And the assocated dscrmnant functons become + MAP g (x = P( x Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 3

Mnmzng probablty of error g We prove the optmalty of the MAP rule graphcally n The rght plot shows the posteror for each of the two classes n The bottom plots shows the P(error for the MAP rule and another rule n Whch one has lower P(error (color-flled area? P(w x x THE MAP RULE THE OTHER RULE Choose RED Choose BLUE Choose RED Choose RED Choose BLUE Choose RED Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 4

Quadratc classfers g Let us assume that the lkelhood denstes are Gaussan P(x = ( n/ / exp (x T (x g Usng Bayes rule, the MAP dscrmnant functons become g (x = P( x = P(x P( P(x = ( n/ / exp (x T (x P( P(x n Elmnatng constant terms g (x = -/ exp (x T (x P( n We take natural logs (the logarthm s monotoncally ncreasng T g (x = (x (x - log + ( log( P( g Ths s known as a Quadratc Dscrmnant Functon g The quadratc term s know as the Mahalanobs dstance Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 5

Mahalanobs dstance g The Mahalanobs dstance can be thought of vector dstance that uses a - norm x Mahalanobs Dstance x - y = (x y T (x y µ x x - = K x - = n - can be thought of as a stretchng factor on the space n Note that for an dentty covarance matrx ( =I, the Mahalanobs dstance becomes the famlar Eucldean dstance g In the followng sldes we look at specal cases of the Quadratc classfer n For convenence we wll assume equprobable prors so we can drop the term log(p(ω Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 6

Specal case I: Σ =σ I g In ths case, the dscrmnant becomes g (x = (x T (x n Ths s known as a MINIMUM DISTANCE CLASSIFIER n Notce the lnear decson boundares Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 7

Specal case : Σ = Σ (Σ dagonal g In ths case, the dscrmnant becomes g (x = (x (x n Ths s known as a MAHALANOBIS DISTANCE CLASSIFIER n Stll lnear decson boundares T Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 8

Specal case 3: Σ =Σ (Σ non-dagonal g In ths case, the dscrmnant becomes g(x = (x (x n Ths s also known as a MAHALANOBIS DISTANCE CLASSIFIER n Stll lnear decson boundares T Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 9

Case 4: Σ =σ I, example g In ths case the quadratc expresson cannot be smplfed any further g Notce that the decson boundares are no longer lnear but quadratc Zoom out Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 0

Case 5: Σ Σ j general case, example g In ths case there are no constrants so the quadratc expresson cannot be smplfed any further g Notce that the decson boundares are also quadratc Zoom out Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty

Lmtatons of quadratc classfers g The fundamental lmtaton s the unmodal Gaussan assumpton n For non-gaussan or multmodal Gaussan, the results may be sgnfcantly sub-optmal g A practcal lmtaton s assocated wth the mnmum requred sze for the dataset n If the number of examples per class s less than the number of dmensons, the covarance matrx becomes sngular and, therefore, ts nverse cannot be computed g In ths case t s common to assume the same covarance structure for all classes and compute the covarance matrx usng all the examples, regardless of class Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty

Conclusons g We can extract the followng conclusons n The Bayes classfer for normally dstrbuted classes s quadratc n The Bayes classfer for normally dstrbuted classes wth equal covarance matrces s a lnear classfer n The mnmum Mahalanobs dstance classfer s optmum for g normally dstrbuted classes and equal covarance matrces and equal prors n The mnmum Eucldean dstance classfer s optmum for g normally dstrbuted classes and equal covarance matrces proportonal to the dentty matrx and equal prors n Both Eucldean and Mahalanobs dstance classfers are lnear g The goal of ths dscusson was to show that some of the most popular classfers can be derved from decson-theoretc prncples and some smplfyng assumptons n It s mportant to realze that usng a specfc (Eucldean or Mahalanobs mnmum dstance classfer mplctly corresponds to certan statstcal assumptons n The queston whether these assumptons hold or don t can rarely be answered n practce; n most cases we are lmted to postng and answerng the queston does ths classfer solve our problem or not? Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 3

K Nearest Neghbor classfer g The knn classfer s based on non-parametrc densty estmaton technques n Let us assume we seek to estmate the densty functon P(x from a dataset of examples n P(x can be approxmated by the expresson P(x k NV where V k s N s s the the the volume total number surroundng number of of examples x examples nsde V n The volume V s determned by the D-dm dstance R kd (x between x and ts k nearest neghbor P(x k NV = N c k R D D k (x x R V=πR P(x = N k R g Where c D s the volume of the unt sphere n D dmensons Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 4

K Nearest Neghbor classfer g We use the prevous result to estmate the posteror probablty n The uncondtonal densty s, agan, estmated wth P(x k N V n And the prors can be estmated by N P( = N n The posteror probablty then becomes k N P(x P( NV N k P( x = = = P(x k k n Yeldng dscrmnant functons NV g Ths s known as the k Nearest Neghbor classfer g (x = = k k Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 5

K Nearest Neghbor classfer g The knn classfer s a very ntutve method n Examples are classfed based on ther smlarty wth tranng data g For a gven unlabeled example x u R D, fnd the k closest labeled examples n the tranng data set and assgn x u to the class that appears most frequently wthn the k- subset g The knn only requres n An nteger k n A set of labeled examples n A measure of closeness axs 5 0-5 -0-5 -0 0 0 0 0 0 0 0 0 00 0 0? 7 7 77 7 7 77 7 7 77 7 7 7 4 4 4 4 4 44 44 4 4 6 6 6 66 66 6 6 6 6 3 3 33 3 33 3 3 3 5 8 3 5 3 555 55 8 8 9 8 5 8 8 8 9 9 99 99 59 8 9 9-0.06-0.05-0.04-0.03-0.0-0.0 0 0.0 0.0 0.03 axs Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 6

knn n acton: example g We generate data for a -dmensonal 3- class problem, where the class-condtonal denstes are mult-modal, and non-lnearly separable g We used knn wth n k = fve n Metrc = Eucldean dstance Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 7

knn n acton: example g We generate data for a -dm 3-class problem, where the lkelhoods are unmodal, and are dstrbuted n rngs around a common mean n These classes are also non-lnearly separable g We used knn wth n k = fve n Metrc = Eucldean dstance Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 8

knn versus NN -NN 5-NN 0-NN Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 9

Characterstcs of the knn classfer g Advantages n Analytcally tractable, smple mplementaton n Nearly optmal n the large sample lmt (N g P[error] Bayes >P[error] -NNR <P[error] Bayes n Uses local nformaton, whch can yeld hghly adaptve behavor n Lends tself very easly to parallel mplementatons g Dsadvantages n Large storage requrements n Computatonally ntensve recall n Hghly susceptble to the curse of dmensonalty g NN versus knn n The use of large values of k has two man advantages g Yelds smoother decson regons g Provdes probablstc nformaton: The rato of examples for each class gves nformaton about the ambguty of the decson n However, too large values of k are detrmental g It destroys the localty of the estmaton g In addton, t ncreases the computatonal burden Intellgent Sensor Systems Rcardo Guterrez-Osuna Wrght State Unversty 0