University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent 5
. The Multivariate Gaussian & Decision Boundaries 1 Multi-Dimensional Data In the previous lecture we looked at classifying vowels using formants. The features extracted from the waveform were the frequency of formant 1, x, and the frequency of formant, y. We have a -dimensional feature vector, x x = x y We need to be able to define class-conditional densities for these multi-dimensional feature vectors. We have already seen joint distributions of the form p(x, y). This is now written in vector notation (and may be generalised to more than two variables) p(x). mean µ = E {x} = E {x} E {y} covariance Σ = E { (x µ)(x µ) } = σ xx σ xy σ xy σ yy independent: p(x, y) = p(x)p(y) uncorrelated: (compare to independent) E {xy} = E {x} E {y} One common form is the multivariate Gaussian.
Engineering Part IIA: 3F3 Multivariate Gaussian For a d-dimensional data feature vector, x, the multivariate Gaussian distribution is 1 p (x) = (π) d/ Σ 1/exp 1 (x µ) Σ 1 (x µ) 1 Σ = 3 1 1.5 Σ = 1 1 1 1 8 9 8 7 6 6 5 3 1 3 5 6 7 8 6 8 1.35.3.5..15.1.5..15.1.5 15 1 5 5 5 5 1 15 1 8 6 6 8 1
. The Multivariate Gaussian & Decision Boundaries 3 Properties The distribution is characterised by the mean vector, µ, and the covariance matrix, Σ, defined as µ = E {x} Σ = E {(x µ)(x µ) } The covariance matrix is symmetric and for d dimensions is described by d(d + 1)/ parameters. The diagonal elements of the covariance matrix σ ii are the variances in the individual dimesions (commonly written σi ). The off-diagonal elements determine the correlation. If all off-diagonal elements are zero, the data is uncorrelated p (x) = d i=1 1 exp (x i µ i ) πσ i σi There are only d parameters for the covariance matrix. If the PDF of x is a multivariate Gaussian and then p(y) is Gaussian with y = A x + b µ y = A µ x + b Σ y = A Σ x A
Engineering Part IIA: 3F3 Decision Boundaries Decision boundaries partition the feature space into regions with a class label associated with each region. Only two class problems will be considered. We would like the decision boundary to reflect the Bayes decision rule from the previous lecture. For a two class problem the class of observation x is selected as Decide Class ω 1 Class ω if P (ω 1 x) > P (ω x) otherwise For this two class problem it is clear that the decision boundary will occur when the posterior probability of class 1 and class are equal. This tells us that the decision boundary occurs when the posterior probability is P (ω 1 x) = 1 = p(x ω 1 )P (ω 1 ) p(x ω 1 )P (ω 1 ) + p(x ω )P (ω ) We can therefore say that a point x lies on the decision boundary when it satisfies Commonly the ln() is taken p(x ω 1 )P (ω 1 ) = p(x ω )P (ω ) ln(p(x ω 1 )) + ln(p (ω 1 )) = ln(p(x ω )) + ln(p (ω )) It is interesting to see what forms the decision boundaries have when the class-conditional PDFs are multivariate Gaussians.
. The Multivariate Gaussian & Decision Boundaries 5 Example Case: Σ i = σ I This is the case where the covariance matrices of both classes are equal, diagonal and the variances of all features are the same. In two dimensions, the scatter of the two classes would look circular. The class conditional multivariate Gaussian distribution for class ω 1 is µ 1 and for class ω µ. The discriminant function in this case can be reduced to a simple expression. The determinant and inverse of the covariance matrix can be reduced to very simple expressions. Σ i = σ d and Σ 1 i = 1 σ I. Hence equating the two expressions to find the position of the decision boundary ln P (ω 1 ) (σ d (π) d + 1 1 ) σ (x µ 1) (x µ 1 ) = ln P (ω ) (σ d (π) d + 1 ) σ 1 (x µ ) (x µ ) Cancelling terms yields 1 σ (µ 1 µ ) x = 1 σ (µ 1µ 1 µ µ ) + ln P (ω ) P (ω 1 ) Since the means are fixed, this may be written as w x = b Which is the equation of a line.
6 Engineering Part IIA: 3F3 Example Consider the case of Gaussian distributions with identity matrix covariance matrices Σ i = I. The priors on the two classes are equal. So we have µ 1 = 1 5, µ = Hence for previous slide w = 1 1 1 5 1 5, Σ 1 = Σ = I = 5 1 = 1 1 and b =. The lines of equal likelihood and the decision boundary are: 8 6 6 8
. The Multivariate Gaussian & Decision Boundaries 7 Special Case: Σ i = Σ Here the covariance matrices are common but full. The distance from class ω i can be expressed as g i (x) = 1 (x µ i) Σ 1 (x µ i ) + ln Again cancelling terms from either side (µ 1 µ ) Σ 1 x = 1 ( µ A linear decision boundary P (ω i ) (π) d/ Σ 1/ 1Σ 1 µ 1 µ Σ 1 ) µ + ln P (ω ) P (ω 1 ) w x = b The Mahalanobis distance is related to this (x µ) Σ 1 (x µ) The Mahalanobis distance both weights the effect of individual features (by their inverse variance) and accounts for inter-feature correlations. In the case of common covariances a classifier can be more simply constructed by transforming the input space so that the features are decorrelated. For example using eigenvector eigenvalue decomposition the Mahalanobis distance may be expressed as (Ax Aµ) (Ax Aµ) Thus by transforming the feature space the Mahalanobis distance becomes a Euclidean distance.
8 Engineering Part IIA: 3F3 For the general case we define General Case g i (x) = 1 (x µ i) Σ 1 i (x µ i ) + ln P (ω i ) (π) d/ Σ i 1/ This function is quadratic in nature. Equating co-efficients for the two-class problem reveals a hyperquadrics decision boundary of the form where and the constant is given by x Ax + w x + b = A = Σ 1 1 Σ 1 w = ( µ Σ 1 µ 1Σ 1 ) 1 b = µ 1Σ 1 1 µ 1 µ Σ 1 µ + ln P (ω ) Σ 1 1/ p(ω 1 ) Σ 1/
. The Multivariate Gaussian & Decision Boundaries 9 Examples of General Case Arbitrary Gaussian distributions can lead to general hyperquadratic boundaries. The following figures (from DHS) indicate this. Note that the boundaries can of course be straight lines and the regions may not be simply connected..1.5.1 Pp.5 p 1-1 -1 1 1-1 - -1 1.3.5...3.1 p..1 p 1 1-1 -1 1-1 1-1
1 Engineering Part IIA: 3F3 Example Decision Boundary Assume two classes, equal priors, with µ 1 = 3 6 ; Σ 1 = 1/ µ = 3 The inverse covariance matrices are then Σ 1 1 = Σ 1 1/ = Equating g 1 (x) = g (x) yields: [ x1 x ] 3/ x 1 x + [ 9/ ] Σ = 1/ 1/ x 1 x 1.5x 1 9x 1 8x + 8.11 = +36 6.5 ln() = x = 3.51 1.15x 1 +.1875x 1 which is a parabola with a minimum at (3,1.83). From (DHS) 1 x 7.5 5 µ 1.5-6 8 1 x 1 -.5 µ Note that the boundary does not pass through the mid-point beytween the means.
. The Multivariate Gaussian & Decision Boundaries 11 Cost of Mis-Classification We have assumed that the goal is to minimise the average probability of classification error. Recall that for the two-class problem, the Bayes minimum average error decision rule can be written as: P (ω 1 x) P (ω x) or using the likelihood ratio: ω 1 > < ω 1 p(x ω 1 ) p(x ω ) ω 1 > P (ω ) < ω P (ω 1 ) Sometimes, the cost (or loss) for misclassification is specified (or can be estimated) and different types of classification error may not have equal cost. C 1 Cost of choosing ω 1 given x from ω C 1 Cost of choosing ω given x from ω 1 and C ii is the cost of correct classification (may be zero). The aim now is to minimise the Bayes Risk which is the expected value of the classification cost.
1 Engineering Part IIA: 3F3 Minimising Classification Cost Let the decision region associated with class ω j be denoted R j. Consider all the patterns that belong to class ω 1. The expected cost (or risk) for these patterns R 1 is given by R 1 = i=1 The overall cost R is found as R = j=1 = = j=1 i=1 i=1 C i1 R i p(x ω 1 )dx R j P (ω j ) Ri C ij R i p(x ω i )dxp (ω j ) j=1 C ij p(x ω j )P (ω j )dx Minimise integrand at all points, choose R 1 so j=1 C 1j p(x ω j )P (ω j ) < j=1 In the case that C 11 = C = we obtain C 1 P (ω 1 x) C 1 P (ω x) or using the likelihood ratio p(x ω 1 ) p(x ω ) ω 1 ω 1 C j p(x ω j )P (ω j ) > < ω 1 > P (ω )C 1 < ω P (ω 1 )C 1 Note that decision rule to minimise the Bayes Risk is the minimum error rule when C 1 = C 1 = 1 and correct classification has zero cost.
. The Multivariate Gaussian & Decision Boundaries 13 ROC curves In some problems, such as in medical diagnostics, there is is a target class that you want to separate from the rest of the population (i.e. it is a detection problem). Four types of outcomes can be identified (let class ω be positive, ω 1 be negative) True Positive (Hit) True Negative False Positive (False Alarm) False Negative As the decision threshold is changed the ratio of True Positive to False Positive changes. This trade-off is often plotted in a Receiver Operating Characteristic or ROC curve. The ROC curve is a plot of probability of true positive (hit) against probability of False Positive (false alarm). This allows a designer to see an overview of the characteristics of a system.
1 Engineering Part IIA: 3F3 ROC curves (Example) Example 1-d data, equal variances and equal priors: the threshold for minimum error would be (µ 1 + µ )/.. Error Probabilities 1 Receiver Operating Characteristic Curve.18 9.16 8.1 7.1.1.8.6.. 8 6 6 8 1 1 True Positive 6 5 3 1 1 3 5 6 7 8 9 1 False Positive Left are the plots of p(x ω i ) for classes ω and ω 1. each value of x gives there is a probability for each outcome. for x = the probabilities are shown Right is the associated ROC curve obtained by varyinjg x (here % rather probability is given on the axis). curves going into the top left corner are good a straight line at 5 degrees is random