Overview Gaussians estimated from training data Guido Sanguinetti Informatics B Learning and Data Lecture 1 9 March 1 Today s lecture Posterior probabilities, decision regions and minimising the probability of Gaussians and discriminant functions Linear discriminants Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 3 Decision Regions Decision regions for 3 class example Consider a 1-d feature space (x) and two classes C 1 and C x Minimal s Decision Boundaries Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 5 Informatics B: Learning and Data Lecture 1 Consider a 1-d feature space (x) and two classes C 1 and C Consider a 1-d feature space (x) and two classes C 1 and C Consider a 1-d feature space (x) and two classes C 1 and C Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1
Consider a 1-d feature space (x) and two classes C 1 and C 1 assigning x to C 1 when it belongs to C (x is in R 1 when it belongs to C ) Consider a 1-d feature space (x) and two classes C 1 and C 1 assigning x to C 1 when it belongs to C (x is in R 1 when it belongs to C ) assigning x to C when it belongs to C 1 (x is in R when it belongs to C 1) Consider a 1-d feature space (x) and two classes C 1 and C 1 assigning x to C 1 when it belongs to C (x is in R 1 when it belongs to C ) assigning x to C when it belongs to C 1 (x is in R when it belongs to C 1) Total probability of error: P(error) = P(x R, C 1 ) + P(x R 1, C ) = P(x R C 1 )P(C 1 ) + P(x R 1 C )P(C ) = p(x C 1 )P(C 1 )dx + p(x C )P(C )dx R R1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 Minimising probability of...1.1.1.1.1.1 P(error) = R p(x C 1 )P(C 1 )dx + R1 p(x C )P(C )dx.1.1.1.1........ 1 1 1 1 Informatics B: Learning and Data Lecture 1 7 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 9 Minimising probability of.. P(error) = R p(x C 1 )P(C 1 )dx + R1 p(x C )P(C )dx To minimize P(error): For a given x if p(x C 1 )P(C 1 ) > p(x C )P(C ), then point x should be in region R 1 The probability of is thus minimised by assigning each point to the class with the maximum posterior probability This justification for the maximum posterior probability may be extended to d-dimensional feature vectors and K classes.1.1.1.1.1.... 1 1.1.1.1.1.1.... 1 1 Informatics B: Learning and Data Lecture 1 9 Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 11
Equation of a Decision Boundary Discriminant Functions If we assign x to class C with the highest posterior probability P(C x), then the posterior probability or the log posterior probability forms a suitable discriminant function: y c (x) = ln P(C x) ln p(x C) + ln P(C) Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 13 Informatics B: Learning and Data Lecture 1 13 If we assign x to class C with the highest posterior probability P(C x), then the posterior probability or the log posterior probability forms a suitable discriminant function: y c (x) = ln P(C x) ln p(x C) + ln P(C) are defined when the discriminant functions are equal: y k (x) = y l (x) If we assign x to class C with the highest posterior probability P(C x), then the posterior probability or the log posterior probability forms a suitable discriminant function: y c (x) = ln P(C x) ln p(x C) + ln P(C) are defined when the discriminant functions are equal: y k (x) = y l (x) Considering discriminant functions changes the emphasis from how individual classes are modelled to how the boundaries between classes are estimated Informatics B: Learning and Data Lecture 1 13 Informatics B: Learning and Data Lecture 1 13 Informatics B: Learning and Data Lecture 1 1 If the discriminant function is the log posterior probability: y c (x) = ln p(x C) + ln P(C) If the discriminant function is the log posterior probability: y c (x) = ln p(x C) + ln P(C) Then, substituting in the log probability of a Gaussian and dropping constant terms we obtain: y c (x) = 1 (x µ c) T Σ 1 c (x µ c ) 1 ln Σ c + ln P(C) If the discriminant function is the log posterior probability: y c (x) = ln p(x C) + ln P(C) Then, substituting in the log probability of a Gaussian and dropping constant terms we obtain: y c (x) = 1 (x µ c) T Σ 1 c (x µ c ) 1 ln Σ c + ln P(C) This function is quadratic in x Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 1
Gaussians estimated from training data Decision Regions Decision regions for 3 class example x Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 15 Non-equal covariance Gaussians Equal Covariance Gaussians estimated from training data Testing data (Equal Covariance Gaussians) Informatics B: Learning and Data Lecture 1 19 Informatics B: Learning and Data Lecture 1 17 Informatics B: Learning and Data Lecture 1 Results Non-equal covariance Gaussians True class Test Data A B C Predicted A 77 5 9 class B 15 C 7 9 Fraction correct: (77 + + 9)/3 = 5/3 =.5. Results Non-equal covariance Gaussians True class Test Data A B C Predicted A 77 5 9 class B 15 C 7 9 Fraction correct: (77 + + 9)/3 = 5/3 =.5. Equal covariance Gaussians True class Test Data A B C Predicted A 1 x Testing data (Non-equal covariance) Decision regions: Equal Covariance Gaussians Decision Regions: equal covariance Informatics B: Learning and Data Lecture 1 1
Decision Regions: non-equal covariance Decision regions for 3 class example By dropping terms that are now constant we can simplify the discriminant function to x y c (x) = (µ T c Σ 1 )x 1 µt c Σ 1 µ c + ln P(C) This is a linear function of x Informatics B: Learning and Data Lecture 1 3 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 By dropping terms that are now constant we can simplify the discriminant function to y c (x) = (µ T c Σ 1 )x 1 µt c Σ 1 µ c + ln P(C) This is a linear function of x We can define two variables in terms of µ c, P(C) and Σ: w T c = µ T c Σ 1 w c = 1 µt c Σ 1 µ c + ln P(C) Informatics B: Learning and Data Lecture 1 By dropping terms that are now constant we can simplify the discriminant function to y c (x) = (µ T c Σ 1 )x 1 µt c Σ 1 µ c + ln P(C) This is a linear function of x We can define two variables in terms of µ c, P(C) and Σ: w T c = µ T c Σ 1 w c = 1 µt c Σ 1 µ c + ln P(C) Substituting w c and w c into the expression for y c (x): Linear discriminant function y c (x) = w T c x + w c Informatics B: Learning and Data Lecture 1 Linear discriminant: decision boundary for equal covariance Gaussians x In two dimensions the boundary is a line C1 In three dimensions it is a plane In d dimensions it is a hyperplane C x1 y1(x)=y(x) Informatics B: Learning and Data Lecture 1 5 Spherical Gaussians with Equal Covariance Spherical Gaussians with Equal Covariance Spherical Gaussians with Equal Covariance Spherical Gaussians have a diagonal covariance matrix, with the same variance in each dimension Σ = σ I Spherical Gaussians have a diagonal covariance matrix, with the same variance in each dimension Σ = σ I If we further assume that the prior probabilities of each class are equal, we can write the discriminant function as y c (x) = x µ c σ Assigns a test vector to the class whose mean is closest. Spherical Gaussians have a diagonal covariance matrix, with the same variance in each dimension Σ = σ I If we further assume that the prior probabilities of each class are equal, we can write the discriminant function as y c (x) = x µ c σ Assigns a test vector to the class whose mean is closest. In this case the class means may be regarded as class templates or prototypes. Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1
Two-class linear discriminants For a two class problem, the log odds can be used as a single discriminant function: y(x) = ln p(x C 1 ) ln p(x C ) + ln P(C 1 ) ln P(C ) The decision boundary is given by y(x) = Two-class linear discriminants For a two class problem, the log odds can be used as a single discriminant function: y(x) = ln p(x C 1 ) ln p(x C ) + ln P(C 1 ) ln P(C ) The decision boundary is given by y(x) = If the pdf is Gaussian, we have a linear discriminant y(x) = w T x + w w and w are functions of µ 1, µ, Σ, P(C 1 ) and P(C ) Two-class linear discriminants For a two class problem, the log odds can be used as a single discriminant function: y(x) = ln p(x C 1 ) ln p(x C ) + ln P(C 1 ) ln P(C ) The decision boundary is given by y(x) = If the pdf is Gaussian, we have a linear discriminant y(x) = w T x + w w and w are functions of µ 1, µ, Σ, P(C 1 ) and P(C ) Let x a and x b be two points on the decision boundary. Then: y(x a ) = = y(x b ) w T x a + w = = w T x b + w And a little rearranging gives us: w T (x a x b ) = Informatics B: Learning and Data Lecture 1 7 Informatics B: Learning and Data Lecture 1 7 Informatics B: Learning and Data Lecture 1 7 Geometry of a two-class linear discriminant Summary x w y( x)= w is normal to any vector on the hyperplane decision boundary If x is a point on the hyperplane, then the normal distance from the hyperplane to the origin is given by: l = wt x w = w w w w (using y(x) = ) Informatics B: Learning and Data Lecture 1 Obtaining decision boundaries from probability models and a decision rule Minimising the probability of error and Gaussian pdfs Linear discriminants and Next lecture: training discriminants directly Informatics B: Learning and Data Lecture 1 9