P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Similar documents
Lecture 12: Classification

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Statistical pattern recognition

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Pattern Classification

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Pattern Classification

Composite Hypotheses testing

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Maximum Likelihood Estimation (MLE)

Regularized Discriminant Analysis for Face Recognition

The big picture. Outline

Lecture Notes on Linear Regression

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Communication with AWGN Interference

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Linear Approximation with Regularization and Moving Least Squares

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

10-701/ Machine Learning, Fall 2005 Homework 3

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Lecture 2: Prelude to the big shrink

Classification as a Regression Problem

Homework Assignment 3 Due in class, Thursday October 15

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Engineering Risk Benefit Analysis

CHAPTER 3: BAYESIAN DECISION THEORY

Experimental Study on Classification

Clustering & Unsupervised Learning

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Lecture 20: Hypothesis testing

Generative classification models

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Error Probability for M Signals

Linear Classification, SVMs and Nearest Neighbors

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Tracking with Kalman Filter

ERROR RATES STABILITY OF THE HOMOSCEDASTIC DISCRIMINANT FUNCTION

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Statistical Foundations of Pattern Recognition

The exam is closed book, closed notes except your one-page cheat sheet.

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

3.1 ML and Empirical Distribution

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Bayesian predictive Configural Frequency Analysis

Lecture 3: Probability Distributions

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Kernel Methods and SVMs Extension

Eigenvalues of Random Graphs

Lecture 6 More on Complete Randomized Block Design (RBD)

Heteroscedastic Variance Covariance Matrices for. Unbiased Two Groups Linear Classification Methods

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Clustering & (Ken Kreutz-Delgado) UCSD

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

APPENDIX A Some Linear Algebra

Mixture o f of Gaussian Gaussian clustering Nov

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Lecture Nov

Multilayer Perceptron (MLP)

Support Vector Machines CS434

Tornado and Luby Transform Codes. Ashish Khisti Presentation October 22, 2003

The Second Anti-Mathima on Game Theory

University of Washington Department of Chemistry Chemistry 453 Winter Quarter 2015

Limited Dependent Variables

Lecture 10 Support Vector Machines II

Relevance Vector Machines Explained

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Classification learning II

Lecture 4: September 12

( ) [ ] MAP Decision Rule

Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall, 1980

Goodness of fit and Wilks theorem

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Learning from Data 1 Naive Bayes

Lecture 10 Support Vector Machines. Oct

Lecture 10: Dimensionality reduction

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Machine learning: Density estimation

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

Ensemble Methods: Boosting

Basically, if you have a dummy dependent variable you will be estimating a probability.

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Generalized Linear Methods

10) Activity analysis

Transcription:

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons of Pattern Recognton A pror / A posteror prob. Loss functon Bayes decson rule The lkelhood rato test Maxmum a posteror (MAP) Crteron Mn. error rate classfcaton Dscrmnant functons Error bounds and prob. Background: Pattern Classfcaton, Duda, Hart and Stork, Copyrght John Wley and Sons, 00, PR logo Copyrght Rob Polkar, 00

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Today n PR Revew of Bayes theorem Bayes Decson Theory Bayes rule Loss functon & expected loss Mnmum error rate classfcaton Classfcaton usng dscrmnant functons Error bounds & probabltes

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Bayes Rule Suppose, we know P(ω ), P(ω ), P(x ω ) and P(x ω ), and that we have observed the value of the feature (a random varable) x How would you decde on the state of nature type of fsh, based on ths nfo? Bayes theory allows us to compute the posteror probabltes from pror and classcondtonal probabltes Lkelhood: The (class-condtonal) probablty of observng a feature value of x, gven that the correct class s ω. All thngs beng equal, the category wth hgher class condtonal probablty s more lkely to be the correct class. Posteror Probablty: The (condtonal) probablty of correct class beng ω, gven that feature value x has been observed P( ω x) ( x Iω ) P P( x ω ) P( ω ) = = C P( x) P( x ω ) P( ω ) k = k Pror Probablty: The total probablty of correct class beng class ω determned based on pror experence k Evdence: The total probablty of observng the feature value as x

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Bayes Decson Rule Choose ω f P(ω x) > P(ω x) for all,,=,,,c If there are multple features, x={x, x,, x d } Choose ω f P(ω x) > P(ω x) for all,=,,,c

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ The Loss Functon Mathematcal descrpton of how costly each acton (makng a class decson) s. Are certan mstakes costler than others? {ω, ω,, ω c }: Set of states of nature (classes) {α, α, α a }: Set of possble actons. Note that a need not be same as c. Because we may make more (or less) number of actons than the number of classes. For example, not makng a decson (reecton) s also an acton. {λ, λ, λ a }: Losses assocated wth each acton λ(α ω }: The loss functon: Loss ncurred by takng acton when the true state of nature s n fact. R(α x): Condtonal rsk - Expected loss for takng acton c R( α x) = λ( α ω ) P( ω = x) Bayes decson takes the acton that mnmzes the condtonal rsk!

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Bayes Decson Rule Usng Condtonal Rsk. Compute condtonal rsk R( α x) = λ( α ω ) P( ω x) for each acton taken. Select the acton that has the mnmum condtonal rsk. Let ths be acton k c = 3. The overall rsk s then R = x X R(α k x ) p( x) dx Integrated over all possble values of x Condtonal rsk assocated wth takng acton α(x) based on the observaton x. Probablty that x wll be observed 4. Ths s the Bayes Rsk, the mnmum possble rsk that can be taken by any classfer!

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Two-Class Specal Case Defntons: α : Decde on ω, α : Decde on ω, λ : λ(α ω ) Loss for decdng on ω when the SON s ω Condtonal rsk: R(α x) = λ P(ω x)+ λ P(ω x) R(α x) = λ P(ω x)+ λ P(ω x) Note that λ and λ need not be zero, though we expect λ < λ, λ < λ Decde on ω f R(α x) < R(α x), decde on ω, otherwse ( x ω ) ( x ω ) p ω Λ( x ) = > p < ω ( λ λ ) ( λ λ ) P P ( ω ) ( ω ) The Lkelhood Rato Test (LRT): Pck ω f the LRT s greater then a threshold that s ndependent of x. Ths rule, whch mnmzes the Bayes rsk, s also called the Bayes Crteron.

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Example From R. Guterrez @ TAMU

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Example (to be Fully solved on Request on Frday) λ λ λ λ Modfed from R. Guterrez @ TAMU

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Mnmum Error-Rate Classfcaton: Multclass Case If we assocate takng acton as selectng class, and f all errors are equally lkely, we obtan the zero-one loss (symmetrcal cost functon) ( ω ) λ α 0, =, Ths loss functon assgns no loss to correct classfcaton, and assgns to msclassfcaton. The rsk correspondng to ths loss functon s then R( α x) = P( ω x) What does ths tell us? To mnmze ths rsk (average probablty of error), we need to choose the class that maxmzes the posteror probablty P(ω x) f f = P( ω x) = =,..., c Λ ( x) = p p ( x ω ) ( x ω ) > < ω ω P P ( ω ) ( ω ) P P ( ω x) ( ω x) ω > < ω Maxmum a posteror (MAP) crteron Maxmum lkelhood crteron for equal prors

Error Probabltes (Bayes Rule Rules!) In a two class case, there are two sources of error: x s n R, yet SON s ω, or vce versa P( error) = = R p ( error x) p( x) dx ( x) p( x) dx + p( x) p( x) dx p ω ω R x B : Optmal Bayes soluton x*: Non-optmal soluton P(error) = + P( x R, ω) = P( x R ω) P( ω) Theory and Applcatons of Pattern Recognton P( x R, ω) = P( x R ω) P( ω) 003, Rob Polkar, Rowan Unversty, Glassboro, NJ

Probablty of Error In mult-class case, there are more ways to be wrong then to be rght, so we explot the fact that P(error)=-P(correct), where P C ( correct) = P( x R, ω ) = P( x R ω ) P( ω ) = = C = x R P ( x ω ) P( ω ) dx = P( ω x) P( x) dx = x R Of course, n order to mnmze the P(error), we need to maxmze P(correct) for whch we need to maxmze each and every one of the ntegrals. Note that P(x) s common to all ntegrals, therefore the expresson wll be maxmzed by choosng the decson regons R where the posteror probabltes P(ω x) are maxmum: C = C Theory and Applcatons of Pattern Recognton From R. Guterrez @ TAMU x 003, Rob Polkar, Rowan Unversty, Glassboro, NJ

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Dscrmnant Based Classfcaton A dscrmnant s a functon g(x), that dscrmnates between classes. Ths functon assgns the nput vector to a class accordng to ts defnton: Choose class f g ( x) > g ( x),, =,,..., c Bayes rule can be mplemented n terms of dscrmnant functons g ( x) = P( ω x) Each dscrmnant functon generates c decson regons, R,,R c, whch are separated by decson boundares. Decson regons need NOT be contguous. The decson boundary satsfes g ( x) = g ( x) x R, g, ( x) > g =,,..., c ( x)

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Dscrmnant Functons We may vew the classfer as an automated machne that computes c dscrmnants and selects the category correspondng to the largest dscrmnant A neural network s one such classfer for Bayes classfer wth non-unform rsks, R(α x): for MAP classfer (of unform rsks): for maxmum lkelhood classfer (of equal prors): g g g ( x ) = R( α x) ( x ) = P( ω x) ( x ) = P( x ω )

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Dscrmnant Functons In fact, multplyng every DF wth the same constant, or addng/subtractng a constant to all DFs does not change the decson boundary In general every g (x) can be replaced by f (g (x) ), where f(.) s any monotoncally ncreasng functon wthout affectng the actual decson boundary Some lnear or non-lnear transformatons of the prevously stated DFs may greatly smplfy the desgn of the classfer What examples can you thnk of?

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Normal Denstes If lkelhood probabltes are normally dstrbuted, then a number of smplfcatons can be made. In partcular, the dscrmnant functon can be wrtten as n ths greatly smplfed form (!) p( x ω ) = p d / / ( π ) Σ ( x ω ) ~ N( µ, Σ ) e T [( x µ ) Σ ( x µ )] g ( x) = [ T ( ) ( )] d x µ Σ x µ lnπ ln Σ + ln P( ω ) There are three dstnct cases that can occur:

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Σ = σ Case : I Features are statstcally ndependent, and all features have the same varance: Dst. are sphercal n d dmensons, the boundary s a generalzed hyperplane (lnear dscrmnant) of d- dmensons, and features create equal szed hypersphercal clusters. Examples of such hypersphercal clusters are: The general form of the dscrmnant s then g x µ ( x) = + ln P( ω ) σ If prors are the same: g ( x) = T ( x µ ) ( x µ ) σ Mnmum Dstance Classfer

Σ = σ Case : I Ths case results n lnear dscrmnants that can be wrtten n the form w T = µ, 0 ln ( ) w = µ µ + P ω σ σ g + T ( x) = w x w 0 Threshold (Bas) of the th category -D case 3-D case -D case Note how prors shft the dscrmnant functon away from the more lkely mean!!! Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Σ = Σ Case : Covarance matrces are arbtrary, but equal to each other for all classes. Features then form hyperellpsodal clusters of equal sze and shape. Ths also results n lnear dscrmnant functons whose decson boundares are agan hyperplanes: T g ( ) x = ( x µ ) Σ ( x µ ) + ln P ω g + T ( x) = w x w 0 w = Σ µ, [ ] ( ) T w 0 = µ Σ µ + ln P( ω )

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Case 3: Σ = Arbtrary All bets are off!in two class case, the decson boundares form hyperquadratcs. The dscrmnant functons are now, n general, quadratc (nor lnear) and non-contguous T T W g ( x) = x W x + w x + w = Σ, w = Σ µ 0 T w 0 = µ Σ µ ln Σ + ln P( ω ) Hyperbolc Parabolc Lnear Ellpsodal Crcular

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Case 3: Σ = Arbtrary For the mult class case, the boundares wll look even more complcated. As an example Decson Boundares

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Case 3: Σ = Arbtrary In 3-D

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Conclusons The Bayes classfer for normally dstrbuted classes s n general a quadratc classfer and can be computed The Bayes classfer for normally dstrbuted classes wth equal covarance matrces s a lnear classfer For normally dstrbuted classes wth equal covarance matrces and equal prors s a mnmum Mahalanobs dstance classfer For normally dstrbuted classes wth equal covarance matrces proportonal to the dentty matrx and wth equal prors s a mnmum Eucldean dstance classfer Note that usng a mnmum Eucldean or Mahalanobs dstance classfer mplctly makes certan assumptons regardng statstcal propertes of the data, whch may or may not and n general are not true. However, n many cases, certan smplfcatons and approxmatons can be made that warrant makng such assumptons even f they are not true. The bottom lne n practce n decdng whether the assumptons are warranted s does the damn thng solve my classfcaton problem?

Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Error Bounds It s dffcult, at best f possble, to analytcally compute the error probabltes, Partcularly when the decson regons are not contguous. However, upper bounds for ths error can be obtaned: The Chernoff bound and ts approxmaton Bhattacharya bound are two such bounds that are often used. If the dstrbutons are Gaussan, these expressons are relatvely easer to compute Often tmes even non-gaussan cases are consdered as Gaussan.