L5: Quadratic classifiers

Size: px

Start display at page:

Download "L5: Quadratic classifiers"

Oliver Ward
6 years ago
Views:

1 L5: Quadratic classifiers Bayes classifiers for Normally distributed classes Case 1: Σ i = σ 2 I Case 2: Σ i = Σ (Σ diagonal) Case 3: Σ i = Σ (Σ non-diagonal) Case 4: Σ i = σ 2 i I Case 5: Σ i Σ j (general case) Numerical example Linear and quadratic classifiers: conclusions CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 1

2 Bayes classifiers for Gaussian classes Recap On L4 we showed that the decision rule that minimized P[error] could be formulated in terms of a family of discriminant functions For normally Gaussian classes, these DFs reduce to simple expressions The multivariate Normal pdf is Discriminant functions Features f X x = 2π N/2 Σ 1/2 e 1 2 x μ T Σ 1 x μ Using Bayes rule, the DFs become g i x = P ω i x = P ω i p x ω i p x x 1 g 1 (x) Class assignment Select max g 2 (x) x 2 x 3 g C (x) x d Costs = 2π N/2 Σ i 1/2 e 1 2 x μ i T Σ i 1 x μi P ω i p x Eliminating constant terms g i x = Σ i 1/2 e 1 2 x μ i T Σ i 1 x μi P ω i And taking natural logs g i x = 1 2 x μ i T Σ i 1 x μ i 1 2 log Σ i + logp(ω i ) This expression is called a quadratic discriminant function CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 2

3 Minimum Selector Case 1: Σ i = σ 2 I This situation occurs when features are statistically independent with equal variance for all classes In this case, the quadratic DFs become g i x = 1 2 x μ i T σ 2 I 1 x μ i 1 2 log σ2 I + logp i 1 2σ 2 x μ i T x μ i + logp i Expanding this expression g i x = 1 2σ 2 xt x 2μ T i x + μ T i μ i + logp i Eliminating the term x T x, which is constant for all classes g i x = 1 2σ 2 2μ i T x + μ T i μ i + logp i = w T i x + w 0 So the DFs are linear, and the boundaries g i (x) = g j (x) are hyper-planes If we assume equal priors g i x = 1 2σ 2 x μ i T x μ i This is called a minimum-distance or nearest mean classifier The equiprobable contours are hyper-spheres For unit variance (σ 2 = 1), g i x is the Euclidean distance Distance CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 3 x 1 2 C Distance Distance [Schalkoff, 1992] class

4 Example Three-class 2D problem with equal priors μ 1 = 3 2 T μ 2 = 7 4 T μ 3 = 2 5 T Σ 1 = 2 2 Σ 1 = 2 2 Σ 1 = 2 2 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 4

5 Case 2: Σ i = Σ (diagonal) Classes still have the same covariance, but features are allowed to have different variances In this case, the quadratic DFs becomes g i x = 1 2 x μ i T Σ i 1 x μ i 1 2 log Σ i + logp i = x N k μ i,k k=1 2 σ k 1 2 log k=1 N σ 2 k + logp i Eliminating the term x k 2, which is constant for all classes g i x = 1 2 N k=1 2x kμ i,k + μ i,k σ k 2 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU logπ k=1 N σ 2 k + logp i This discriminant is also linear, so the decision boundaries g i (x) = g j (x) will also be hyper-planes The equiprobable contours are hyper-ellipses aligned with the reference frame Note that the only difference with the previous classifier is that the distance of each axis is normalized by its variance

6 Example Three-class 2D problem with equal priors μ 1 = 3 2 T μ 2 = 5 4 T μ 3 = 2 5 T Σ 1 = 1 2 Σ 2 = 1 2 Σ 3 = 1 2 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 6

7 Case 3: Σ i = Σ (non-diagonal) Classes have equal covariance matrix, but no longer diagonal The quadratic discriminant becomes g i x = 1 2 x μ i T Σ 1 x μ i 1 2 log Σ + logp i Eliminating the term log Σ, which is constant for all classes, and assuming equal priors g i x = 1 2 x μ i T Σ 1 x μ i The quadratic term is called the Mahalanobis distance, a very important concept in statistical pattern recognition The Mahalanobis distance is a vector distance that uses a Σ 1 norm, Σ 1 acts as a stretching factor on the space Note that when Σ = I, the Mahalanobis distance becomes the familiar Euclidean distance x 1 x 2 2 xi - μ -1 = K å 2 xi - μ = Κ CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 7

8 Minimum Selector Expanding the quadratic term g i x = 1 2 xt Σ 1 x 2μ i T Σ 1 x + μ i T Σ 1 μ i Removing the term x T Σ 1 x, which is constant for all classes g i x = 1 2 2μ i T Σ 1 x + μ i T Σ 1 μ i = w 1 T x + w 0 So the DFs are still linear, and the decision boundaries will also be hyper-planes The equiprobable contours are hyper-ellipses aligned with the eigenvectors of Σ This is known as a minimum (Mahalanobis) distance classifier x å 1 Distance 2 Distance class C Distance CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 8

Example Three-class 2D problem with equal priors μ 1 = 3 2 T μ 2 = 5 4 T μ 3 = 2 5 T Σ 1 = 1.

9 Example Three-class 2D problem with equal priors μ 1 = 3 2 T μ 2 = 5 4 T μ 3 = 2 5 T Σ 1 = Σ 2 = Σ 3 = CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 9

10 Case 4: Σ i = σ i 2 I In this case, each class has a different covariance matrix, which is proportional to the identity matrix The quadratic discriminant becomes g i x = 1 2 x μ i T σ i 2 x μ i 1 2 Nlog σ i 2 + logp i This expression cannot be reduced further The decision boundaries are quadratic: hyper-ellipses The equiprobable contours are hyper-spheres aligned with the feature axis CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 10

11 Example Three-class 2D problem with equal priors μ 1 = 3 2 T μ 2 = 5 4 T μ 3 = 2 5 T Σ 1 =.5.5 Σ 2 = 1 1 Σ 3 = 2 2 Zoom out CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 11

12 Case 5: Σ i Σ j (general case) We already derived the expression for the general case g i x = 1 2 x μ i T Σ i 1 x μ i 1 2 log Σ i + logp i Reorganizing terms in a quadratic form yields g i x = x T W 2,i x + w T 1,i x + w 0,i W 2,i = 1 2 Σ i 1 where w 1,i = Σ i 1 μ i w o,i = 1 2 μ i T Σ i 1 μ i 1 2 log Σ i + logp i The equiprobable contours are hyper-ellipses, oriented with the eigenvectors of Σ i for that class The decision boundaries are again quadratic: hyper-ellipses or hyperparabolloids Notice that the quadratic expression in the discriminant is proportional to the Mahalanobis distance for covariance Σ i CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 12

13 Example Three-class 2D problem with equal priors μ 1 = 3 2 T μ 2 = 5 4 T μ 3 = 3 4 T Σ 1 = Σ 2 = Σ 3 = Zoom out CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 13

14 Numerical example Derive a linear DF for the following 2-class 3D problem g 1 x Solution μ 1 = T ; μ 2 = T ; Σ 1 = Σ 2 = g 1 x = 1 2σ 2 x μ 1 T x μ 1 + logp 1 = 1 2 g 2 x = 1 2 x 1 y 1 z 1 T 4 ω 1 > < ω 2 g 2 x 2 x 2 + y 2 + z 2 + lg 1 3 x + y + z x 0 y 0 z T 4 x 1 y 1 z 1 ; P 2 = 2P log x 0 y 0 z 0 + log 1 3 ω 1 > < ω 2 2 x y z lg 2 3 ω 2 > 6 log2 < 4 ω 1 Classify the test example x u = T = 1.6 = 1.32 ω 2 > < ω x u ω 2 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 14

15 Conclusions The examples in this lecture illustrate the following points The Bayes classifier for Gaussian classes (general case) is quadratic The Bayes classifier for Gaussian classes with equal covariance is linear The Mahalanobis distance classifier is Bayes-optimal for normally distributed classes and equal covariance matrices and equal priors The Euclidean distance classifier is Bayes-optimal for normally distributed classes and equal covariance matrices proportional to the identity matrix and equal priors Both Euclidean and Mahalanobis distance classifiers are linear classifiers Thus, some of the simplest and most popular classifiers can be derived from decision-theoretic principles Using a specific (Euclidean or Mahalanobis) minimum distance classifier implicitly corresponds to certain statistical assumptions The question whether these assumptions hold or don t can rarely be answered in practice; in most cases we can only determine whether the classifier solves our problem CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 15

L11: Pattern recognition principles

L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction