University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Similar documents
University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Bayesian Decision Theory

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Bayesian Decision Theory

Bayes Decision Theory

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Bayes Decision Theory

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Minimum Error-Rate Discriminant

Bayesian Decision Theory Lecture 2

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Machine Learning Linear Classification. Prof. Matteo Matteucci

Bayesian Decision and Bayesian Learning

Introduction to Machine Learning

Bayesian Decision Theory

Lecture 3: Pattern Classification

Minimum Error Rate Classification

01 Probability Theory and Statistics Review

Introduction to Probability and Stocastic Processes - Part I

Linear Classification: Probabilistic Generative Models

Lecture Note 1: Probability Theory and Statistics

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

5. Discriminant analysis

44 CHAPTER 2. BAYESIAN DECISION THEORY

Machine Learning (CS 567) Lecture 5

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Introduction to Machine Learning

Introduction to Machine Learning

MACHINE LEARNING ADVANCED MACHINE LEARNING

Contents 2 Bayesian decision theory

L2: Review of probability and statistics

COM336: Neural Computing

p(x ω i 0.4 ω 2 ω

A Probability Review

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Bayes Rule for Minimizing Risk

L11: Pattern recognition principles

p(x ω i 0.4 ω 2 ω

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

CS 195-5: Machine Learning Problem Set 1

L5: Quadratic classifiers

Lecture 3: Pattern Classification. Pattern classification

Naïve Bayes classification

Lecture Notes on the Gaussian Distribution

Introduction to Machine Learning

1: PROBABILITY REVIEW

16.584: Random Vectors

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Multivariate statistical methods and data mining in particle physics

Probabilistic generative models

Discriminant analysis and supervised classification

CSC411 Fall 2018 Homework 5

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Whitening and Coloring Transformations for Multivariate Gaussian Data. A Slecture for ECE 662 by Maliha Hossain

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Syllabus. University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 1: Introduction & Decision Rules

Bayes Decision Theory - I

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

MACHINE LEARNING ADVANCED MACHINE LEARNING

Review (Probability & Linear Algebra)

Introduction to Statistical Inference

Machine Learning Lecture 2

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Lecture 8: Signal Detection and Noise Assumption

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Cheng Soon Ong & Christian Walder. Canberra February June 2018

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 1: Introduction & Decision Rules

Naive Bayes and Gaussian Bayes Classifier

Announcements (repeat) Principal Components Analysis

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayesian Learning. Bayesian Learning Criteria

Lecture 1: Systems of linear equations and their solutions

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

CSC 411: Lecture 09: Naive Bayes

Machine Learning Lecture 2

2.3. The Gaussian Distribution

Detection and Estimation Theory

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Motivating the Covariance Matrix

Classification Methods II: Linear and Quadratic Discrimminant Analysis

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Pattern Classification

LECTURE NOTE #10 PROF. ALAN YUILLE

Statistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit

Transcription:

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent 5

. The Multivariate Gaussian & Decision Boundaries 1 Multi-Dimensional Data In the previous lecture we looked at classifying vowels using formants. The features extracted from the waveform were the frequency of formant 1, x, and the frequency of formant, y. We have a -dimensional feature vector, x x = x y We need to be able to define class-conditional densities for these multi-dimensional feature vectors. We have already seen joint distributions of the form p(x, y). This is now written in vector notation (and may be generalised to more than two variables) p(x). mean µ = E {x} = E {x} E {y} covariance Σ = E { (x µ)(x µ) } = σ xx σ xy σ xy σ yy independent: p(x, y) = p(x)p(y) uncorrelated: (compare to independent) E {xy} = E {x} E {y} One common form is the multivariate Gaussian.

Engineering Part IIA: 3F3 Multivariate Gaussian For a d-dimensional data feature vector, x, the multivariate Gaussian distribution is 1 p (x) = (π) d/ Σ 1/exp 1 (x µ) Σ 1 (x µ) 1 Σ = 3 1 1.5 Σ = 1 1 1 1 8 9 8 7 6 6 5 3 1 3 5 6 7 8 6 8 1.35.3.5..15.1.5..15.1.5 15 1 5 5 5 5 1 15 1 8 6 6 8 1

. The Multivariate Gaussian & Decision Boundaries 3 Properties The distribution is characterised by the mean vector, µ, and the covariance matrix, Σ, defined as µ = E {x} Σ = E {(x µ)(x µ) } The covariance matrix is symmetric and for d dimensions is described by d(d + 1)/ parameters. The diagonal elements of the covariance matrix σ ii are the variances in the individual dimesions (commonly written σi ). The off-diagonal elements determine the correlation. If all off-diagonal elements are zero, the data is uncorrelated p (x) = d i=1 1 exp (x i µ i ) πσ i σi There are only d parameters for the covariance matrix. If the PDF of x is a multivariate Gaussian and then p(y) is Gaussian with y = A x + b µ y = A µ x + b Σ y = A Σ x A

Engineering Part IIA: 3F3 Decision Boundaries Decision boundaries partition the feature space into regions with a class label associated with each region. Only two class problems will be considered. We would like the decision boundary to reflect the Bayes decision rule from the previous lecture. For a two class problem the class of observation x is selected as Decide Class ω 1 Class ω if P (ω 1 x) > P (ω x) otherwise For this two class problem it is clear that the decision boundary will occur when the posterior probability of class 1 and class are equal. This tells us that the decision boundary occurs when the posterior probability is P (ω 1 x) = 1 = p(x ω 1 )P (ω 1 ) p(x ω 1 )P (ω 1 ) + p(x ω )P (ω ) We can therefore say that a point x lies on the decision boundary when it satisfies Commonly the ln() is taken p(x ω 1 )P (ω 1 ) = p(x ω )P (ω ) ln(p(x ω 1 )) + ln(p (ω 1 )) = ln(p(x ω )) + ln(p (ω )) It is interesting to see what forms the decision boundaries have when the class-conditional PDFs are multivariate Gaussians.

. The Multivariate Gaussian & Decision Boundaries 5 Example Case: Σ i = σ I This is the case where the covariance matrices of both classes are equal, diagonal and the variances of all features are the same. In two dimensions, the scatter of the two classes would look circular. The class conditional multivariate Gaussian distribution for class ω 1 is µ 1 and for class ω µ. The discriminant function in this case can be reduced to a simple expression. The determinant and inverse of the covariance matrix can be reduced to very simple expressions. Σ i = σ d and Σ 1 i = 1 σ I. Hence equating the two expressions to find the position of the decision boundary ln P (ω 1 ) (σ d (π) d + 1 1 ) σ (x µ 1) (x µ 1 ) = ln P (ω ) (σ d (π) d + 1 ) σ 1 (x µ ) (x µ ) Cancelling terms yields 1 σ (µ 1 µ ) x = 1 σ (µ 1µ 1 µ µ ) + ln P (ω ) P (ω 1 ) Since the means are fixed, this may be written as w x = b Which is the equation of a line.

6 Engineering Part IIA: 3F3 Example Consider the case of Gaussian distributions with identity matrix covariance matrices Σ i = I. The priors on the two classes are equal. So we have µ 1 = 1 5, µ = Hence for previous slide w = 1 1 1 5 1 5, Σ 1 = Σ = I = 5 1 = 1 1 and b =. The lines of equal likelihood and the decision boundary are: 8 6 6 8

. The Multivariate Gaussian & Decision Boundaries 7 Special Case: Σ i = Σ Here the covariance matrices are common but full. The distance from class ω i can be expressed as g i (x) = 1 (x µ i) Σ 1 (x µ i ) + ln Again cancelling terms from either side (µ 1 µ ) Σ 1 x = 1 ( µ A linear decision boundary P (ω i ) (π) d/ Σ 1/ 1Σ 1 µ 1 µ Σ 1 ) µ + ln P (ω ) P (ω 1 ) w x = b The Mahalanobis distance is related to this (x µ) Σ 1 (x µ) The Mahalanobis distance both weights the effect of individual features (by their inverse variance) and accounts for inter-feature correlations. In the case of common covariances a classifier can be more simply constructed by transforming the input space so that the features are decorrelated. For example using eigenvector eigenvalue decomposition the Mahalanobis distance may be expressed as (Ax Aµ) (Ax Aµ) Thus by transforming the feature space the Mahalanobis distance becomes a Euclidean distance.

8 Engineering Part IIA: 3F3 For the general case we define General Case g i (x) = 1 (x µ i) Σ 1 i (x µ i ) + ln P (ω i ) (π) d/ Σ i 1/ This function is quadratic in nature. Equating co-efficients for the two-class problem reveals a hyperquadrics decision boundary of the form where and the constant is given by x Ax + w x + b = A = Σ 1 1 Σ 1 w = ( µ Σ 1 µ 1Σ 1 ) 1 b = µ 1Σ 1 1 µ 1 µ Σ 1 µ + ln P (ω ) Σ 1 1/ p(ω 1 ) Σ 1/

. The Multivariate Gaussian & Decision Boundaries 9 Examples of General Case Arbitrary Gaussian distributions can lead to general hyperquadratic boundaries. The following figures (from DHS) indicate this. Note that the boundaries can of course be straight lines and the regions may not be simply connected..1.5.1 Pp.5 p 1-1 -1 1 1-1 - -1 1.3.5...3.1 p..1 p 1 1-1 -1 1-1 1-1

1 Engineering Part IIA: 3F3 Example Decision Boundary Assume two classes, equal priors, with µ 1 = 3 6 ; Σ 1 = 1/ µ = 3 The inverse covariance matrices are then Σ 1 1 = Σ 1 1/ = Equating g 1 (x) = g (x) yields: [ x1 x ] 3/ x 1 x + [ 9/ ] Σ = 1/ 1/ x 1 x 1.5x 1 9x 1 8x + 8.11 = +36 6.5 ln() = x = 3.51 1.15x 1 +.1875x 1 which is a parabola with a minimum at (3,1.83). From (DHS) 1 x 7.5 5 µ 1.5-6 8 1 x 1 -.5 µ Note that the boundary does not pass through the mid-point beytween the means.

. The Multivariate Gaussian & Decision Boundaries 11 Cost of Mis-Classification We have assumed that the goal is to minimise the average probability of classification error. Recall that for the two-class problem, the Bayes minimum average error decision rule can be written as: P (ω 1 x) P (ω x) or using the likelihood ratio: ω 1 > < ω 1 p(x ω 1 ) p(x ω ) ω 1 > P (ω ) < ω P (ω 1 ) Sometimes, the cost (or loss) for misclassification is specified (or can be estimated) and different types of classification error may not have equal cost. C 1 Cost of choosing ω 1 given x from ω C 1 Cost of choosing ω given x from ω 1 and C ii is the cost of correct classification (may be zero). The aim now is to minimise the Bayes Risk which is the expected value of the classification cost.

1 Engineering Part IIA: 3F3 Minimising Classification Cost Let the decision region associated with class ω j be denoted R j. Consider all the patterns that belong to class ω 1. The expected cost (or risk) for these patterns R 1 is given by R 1 = i=1 The overall cost R is found as R = j=1 = = j=1 i=1 i=1 C i1 R i p(x ω 1 )dx R j P (ω j ) Ri C ij R i p(x ω i )dxp (ω j ) j=1 C ij p(x ω j )P (ω j )dx Minimise integrand at all points, choose R 1 so j=1 C 1j p(x ω j )P (ω j ) < j=1 In the case that C 11 = C = we obtain C 1 P (ω 1 x) C 1 P (ω x) or using the likelihood ratio p(x ω 1 ) p(x ω ) ω 1 ω 1 C j p(x ω j )P (ω j ) > < ω 1 > P (ω )C 1 < ω P (ω 1 )C 1 Note that decision rule to minimise the Bayes Risk is the minimum error rule when C 1 = C 1 = 1 and correct classification has zero cost.

. The Multivariate Gaussian & Decision Boundaries 13 ROC curves In some problems, such as in medical diagnostics, there is is a target class that you want to separate from the rest of the population (i.e. it is a detection problem). Four types of outcomes can be identified (let class ω be positive, ω 1 be negative) True Positive (Hit) True Negative False Positive (False Alarm) False Negative As the decision threshold is changed the ratio of True Positive to False Positive changes. This trade-off is often plotted in a Receiver Operating Characteristic or ROC curve. The ROC curve is a plot of probability of true positive (hit) against probability of False Positive (false alarm). This allows a designer to see an overview of the characteristics of a system.

1 Engineering Part IIA: 3F3 ROC curves (Example) Example 1-d data, equal variances and equal priors: the threshold for minimum error would be (µ 1 + µ )/.. Error Probabilities 1 Receiver Operating Characteristic Curve.18 9.16 8.1 7.1.1.8.6.. 8 6 6 8 1 1 True Positive 6 5 3 1 1 3 5 6 7 8 9 1 False Positive Left are the plots of p(x ω i ) for classes ω and ω 1. each value of x gives there is a probability for each outcome. for x = the probabilities are shown Right is the associated ROC curve obtained by varyinjg x (here % rather probability is given on the axis). curves going into the top left corner are good a straight line at 5 degrees is random