Minimum Error-Rate Discriminant

Similar documents
Bayesian Decision Theory Lecture 2

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω

Lecture Notes on the Gaussian Distribution

Pattern Classification

Parametric Techniques

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Parametric Techniques Lecture 3

Bayesian Decision Theory

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Bayes Decision Theory

p(d θ ) l(θ ) 1.2 x x x

Machine Learning 2017

Bayesian Decision Theory

Bayesian Decision Theory

44 CHAPTER 2. BAYESIAN DECISION THEORY

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Nonparametric Methods Lecture 5

Multivariate statistical methods and data mining in particle physics

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Bayesian Learning. Bayesian Learning Criteria

Pattern Recognition 2

Bayesian Decision and Bayesian Learning

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Expect Values and Probability Density Functions

CS 195-5: Machine Learning Problem Set 1

Naïve Bayes classification

Learning Methods for Linear Detectors

Probability Theory Review Reading Assignments

Advanced statistical methods for data analysis Lecture 1

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Linear Classifiers as Pattern Detectors

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Graphical Models - Part I

Dimension Reduction and Component Analysis

Contents 2 Bayesian decision theory

Quantifying uncertainty & Bayesian networks

Detection theory 101 ELEC-E5410 Signal Processing for Communications

01 Probability Theory and Statistics Review

Building C303, Harbin Institute of Technology, Shenzhen University Town, Xili, Shenzhen,Guangdong Yonghui Wu, Yaoyun Zhang;

Dimension Reduction and Component Analysis Lecture 4

5. Discriminant analysis

Introduction to Machine Learning

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

COM336: Neural Computing

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Algorithm Independent Topics Lecture 6

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Bayes Rule for Minimizing Risk

SGN (4 cr) Chapter 5

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Clustering VS Classification

Machine Learning Lecture 2

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Introduction to Signal Detection and Classification. Phani Chavali

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Bayesian Methods: Naïve Bayes

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine Learning Lecture 2

LDA, QDA, Naive Bayes

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

Algorithmisches Lernen/Machine Learning

Minimum Error Rate Classification

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Bayesian networks. Chapter 14, Sections 1 4

BAYESIAN DECISION THEORY

PATTERN CLASSIFICATION

Artificial Intelligence Uncertainty

Introduction to Bayesian Statistics

Probability Models for Bayesian Recognition

STA 4273H: Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Pengju

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Support Vector Machines

Detection theory. H 0 : x[n] = w[n]

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Fusion in simple models

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Grundlagen der Künstlichen Intelligenz

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

1 Using standard errors when comparing estimated values

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Uncertainty. Chapter 13, Sections 1 6

CS 561: Artificial Intelligence

Introduction to Machine Learning

Pattern Recognition. Parameter Estimation of Probability Density Functions

Belief and Desire: On Information and its Value

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Transcription:

Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision Theory 22 / 59

Discriminants Uniqueness Of Discriminants Is the choice of discriminant functions unique? J. Corso (SUNY at Buffalo) Bayesian Decision Theory 23 / 59

Discriminants Uniqueness Of Discriminants Is the choice of discriminant functions unique? No! Multiply by some positive constant. Shift them by some additive constant. J. Corso (SUNY at Buffalo) Bayesian Decision Theory 23 / 59

Discriminants Visualizing Discriminants Decision Regions The effect of any decision rule is to divide the feature space into decision regions. Denote a decision region R i for ω i. One not necessarily connected region is created for each category and assignments is according to: If g i (x) >g j (x) j i, then x is in R i. (33) Decision boundaries separate the regions; they are ties among the discriminant functions. J. Corso (SUNY at Buffalo) Bayesian Decision Theory 24 / 59

Discriminants Visualizing Discriminants Decision Regions 0.3 p(x ω 1 )P(ω 1 ) p(x ω 2 )P(ω 2 ) 0.2 0.1 0 R 2 5 R 1 decision boundary 0 R 2 0 5 J. Corso (SUNY at Buffalo) Bayesian Decision Theory 25 / 59

Discriminants Two-Category Discriminants Dichotomizers In the two-category case, one considers single discriminant What is a suitable decision rule? g(x) =g 1 (x) g 2 (x). (34) J. Corso (SUNY at Buffalo) Bayesian Decision Theory 26 / 59

Discriminants Two-Category Discriminants Dichotomizers In the two-category case, one considers single discriminant g(x) =g 1 (x) g 2 (x). (34) The following simple rule is then used: Decide ω 1 if g(x) > 0; otherwise decide ω 2. (35) J. Corso (SUNY at Buffalo) Bayesian Decision Theory 26 / 59

Discriminants Two-Category Discriminants Dichotomizers In the two-category case, one considers single discriminant g(x) =g 1 (x) g 2 (x). (34) The following simple rule is then used: Decide ω 1 if g(x) > 0; otherwise decide ω 2. (35) Various manipulations of the discriminant: g(x) =P (ω 1 x) P (ω 2 x) (36) g(x) =ln p(x ω 1) p(x ω 2 ) +lnp (ω 1) P (ω 2 ) (37) J. Corso (SUNY at Buffalo) Bayesian Decision Theory 26 / 59

The Normal Density Entropy Entropy is the uncertainty in the random samples from a distribution. H(p(x)) = p(x)lnp(x)dx (44) The normal density has the maximum entropy for all distributions have a given mean and variance. What is the entropy of the uniform distribution? J. Corso (SUNY at Buffalo) Bayesian Decision Theory 31 / 59

The Normal Density Entropy Entropy is the uncertainty in the random samples from a distribution. H(p(x)) = p(x)lnp(x)dx (44) The normal density has the maximum entropy for all distributions have a given mean and variance. What is the entropy of the uniform distribution? The uniform distribution has maximum entropy (on a given interval). J. Corso (SUNY at Buffalo) Bayesian Decision Theory 31 / 59

The Normal Density The Covariance Matrix Σ E[(x µ)(x µ) T ]= (x µ)(x µ) T p(x)dx Symmetric. Positive semi-definite (but DHS only considers positive definite so that the determinant is strictly positive). The diagonal elements σ ii are the variances of the respective coordinate x i. The off-diagonal elements σ ij are the covariances of x i and x j. What does a σ ij =0imply? That coordinates x i and x j are statistically independent. J. Corso (SUNY at Buffalo) Bayesian Decision Theory 33 / 59

The Normal Density The Covariance Matrix Σ E[(x µ)(x µ) T ]= (x µ)(x µ) T p(x)dx Symmetric. Positive semi-definite (but DHS only considers positive definite so that the determinant is strictly positive). The diagonal elements σ ii are the variances of the respective coordinate x i. The off-diagonal elements σ ij are the covariances of x i and x j. What does a σ ij =0imply? That coordinates x i and x j are statistically independent. What does Σ reduce to if all off-diagonals are 0? J. Corso (SUNY at Buffalo) Bayesian Decision Theory 33 / 59

The Normal Density The Covariance Matrix Σ E[(x µ)(x µ) T ]= (x µ)(x µ) T p(x)dx Symmetric. Positive semi-definite (but DHS only considers positive definite so that the determinant is strictly positive). The diagonal elements σ ii are the variances of the respective coordinate x i. The off-diagonal elements σ ij are the covariances of x i and x j. What does a σ ij =0imply? That coordinates x i and x j are statistically independent. What does Σ reduce to if all off-diagonals are 0? The product of the d univariate densities. J. Corso (SUNY at Buffalo) Bayesian Decision Theory 33 / 59

The Normal Density Simple Case: Statistically Independent Features with Same Variance What do the decision boundaries look like if we assume Σ i = σ 2 I? They are hyperplanes. 0.15-2 0 2 4 ω 1 ω 2 0 1 2 p(x ω i ) 0.4 0.3 ω 1 ω 2 0.1 0.05 0 2 1 ω 1 P(ω 2 )=.5 ω 2 0.2 0 R 2 0.1-2 0 2 4 R 1 R 2 P(ω 1 )=.5 P(ω 2 )=.5 x -2 P(ω 1 )=.5 0 2 R 1 R 2 4 P(ω 2 )=.5-1 -2-2 -1 P(ω 1 )=.5 0 1 R 1 2 Let s see why... J. Corso (SUNY at Buffalo) Bayesian Decision Theory 37 / 59

The Normal Density General Case: Arbitrary Σ i J. Corso (SUNY at Buffalo) Bayesian Decision Theory 44 / 59

The Normal Density General Case for Multiple Categories R 4 R 3 R 2 R 4 R 1 nquite regions A Complicated for four normal Decisiondistribut Surface! J. Corso (SUNY at Buffalo) Bayesian Decision Theory 45 / 59

Receiver Operating Characteristics Signal Detection Theory The detector uses x to decide if the external signal is present. Discriminability characterizes how difficult it will be to decide if the external signal is present without knowing x. d = µ 2 µ 1 σ (63) Even if we do not know µ 1, µ 2, σ, orx,wecanfindd by using a receiver operating characteristic or ROC curve, as long as we know the state of nature for some experiments J. Corso (SUNY at Buffalo) Bayesian Decision Theory 47 / 59

Receiver Operating Characteristics Receiver Operating Characteristics Definitions A Hit is the probability that the internal signal is above x given that the external signal is present P (x >x x ω 2 ) (64) J. Corso (SUNY at Buffalo) Bayesian Decision Theory 48 / 59

Receiver Operating Characteristics Receiver Operating Characteristics Definitions A Hit is the probability that the internal signal is above x given that the external signal is present P (x >x x ω 2 ) (64) A Correct Rejection is the probability that the internal signal is below x given that the external signal is not present. P (x <x x ω 1 ) (65) A False Alarm is the probability that the internal signal is above x despite there being no external signal present. P (x >x x ω 1 ) (66) J. Corso (SUNY at Buffalo) Bayesian Decision Theory 48 / 59

Missing or Bad Features Missing Features Suppose we have built a classifier on multiple features, for example the lightness and width. What do we do if one of the features is not measurable for a particular case? For example the lightness can be measured but the width cannot because of occlusion. J. Corso (SUNY at Buffalo) Bayesian Decision Theory 50 / 59

Missing or Bad Features Missing Features Suppose we have built a classifier on multiple features, for example the lightness and width. What do we do if one of the features is not measurable for a particular case? For example the lightness can be measured but the width cannot because of occlusion. Marginalize! Let x be our full feature feature and x g be the subset that are measurable (or good) and let x b be the subset that are missing (or bad/noisy). We seek an estimate of the posterior given just the good features x g. J. Corso (SUNY at Buffalo) Bayesian Decision Theory 50 / 59

Missing or Bad Features Missing Features P (ω i x g )= p(ω i, x g ) p(x g ) (68) = = p(ωi, x g, x b )dx b p(x g ) p(ωi x)p(x)dx b p(x g ) (69) (70) = gi (x)p(x)dx b p(x)dxb (71) We will cover the Expectation-Maximization algorithm later. This is normally quite expensive to evaluate unless the densities are special (like Gaussians). J. Corso (SUNY at Buffalo) Bayesian Decision Theory 51 / 59

Bayesian Belief Networks Statistical Independence Two variables x i and x j are independent if p(x i,x j )=p(x i )p(x j ) (72) x 3 x 1 x 2 FIGURE 2.23. Athree-dimensionaldistributionwhichobeysp(x 1, x 3 ) = p(x 1 )p(x 3 ); thus here x 1 and x 3 are statistically independent but the other feature pairs are not. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. J. Corso (SUNY at Buffalo) Bayesian Decision Theory 52 / 59

Bayesian Belief Networks Simple Example of Conditional Independence From Russell and Norvig Consider a simple example consisting of four variables: the weather, the presence of a cavity, the presence of a toothache, and the presence of other mouth-related variables such as dry mouth. The weather is clearly independent of the other three variables. And the toothache and catch are conditionally independent given the cavity (one as no effect on the other given the information about the cavity). Weather Cavity Toothache Catch J. Corso (SUNY at Buffalo) Bayesian Decision Theory 53 / 59

Bayesian Belief Networks Naïve Bayes Rule If we assume that all of our individual features x i,i=1,...,d are conditionally independent given the class, then we have p(ω k x) d i=1 p(x i ω k ) (73) Circumvents issues of dimensionality. Performs with surprising accuracy even in cases violating the underlying independence assumption. J. Corso (SUNY at Buffalo) Bayesian Decision Theory 54 / 59