Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham
Outline We have already seen how Bayes rule can be turned into a classifier In all our examples so far we had discrete valued attributes (e.g. in { sunny, rainy }, {+,-}) Today we learn how to do this when the data attributes are continuous valued
Frequency Example Task: predict gender of individuals based on their heights Given 40 35 30 25 Empirical data for male Empirical data for female 100 height examples of women 100 height examples of man 20 15 10 5 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Height (meters)
Class priors We can encode the values of the hypothesis (class) as 1 (male) and 0 (female). So, h 0,1. Since in this example we had the same number of males and females, we have P(h=1)=P(h=0)=0.5. These are the prior probabilities of class membership because they can be set before measuring any data. Note that in cases when the class proportions are imbalanced, we can use the priors to make predictions even before seeing any data.
Class-conditional likelihood Our measurements are heights. This is our data, x. Class-conditional likelihoods: p(x h=1): probability that a male has height x meters p(x h=0):
Class posterior As before, from Bayes rule we can obtain the class posteriors: p x h=1 P(h=1) P h = 1 x = p x h=1 P h=1 +p x h=0 P(h=0) Meaning of the denominator is the probability of measuring the height value x irrespective of the class. If we can compute this then we can use it for predicting the gender from the height measurement
Frequency Discriminant function When does our prediction switch from predicting h=0 vs predicting h=1? 40 Empirical data for male 35 Empirical data for female 30 25 20 15 10 5 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Height (meters) "When the measured hight passes a certain threshold " more precisely, when p h = 0 x = p h = 1 x
Discriminant function If we make a measurement, say we get x = 1.7 m We compute the posteriors and find P h = 1 x = 1.7 > P(h = 0 x = 1.7) Then we decide to predict h = 1, i.e., male If we measured x = 1.2 m, we will get P h = 1 x = 1.2 < P(h = 0 x = 1.2)
Discriminant function We can define a discriminant function as: P(h = 1 x) f1 x = P(h = 0 x) and compare the function value to 1. More convenient to have the switching at 0 rather than at 1. Define discriminant function as the log of f1: P(h = 1 x) f x = log P(h = 0 x) Then the sign of this function defines the prediction (if f(x)>0 => male, if f(x)<0 => female)
How do we compute it? Let s write it out using Bayes rule: P(h = 1 x) p x h = 1 P(h = 1) f x = log = log P(h = 0 x) p x h = 0 P(h = 0) Now, we need the class conditional likelihood terms, p x h = 0 and p x h = 1. Note that x now takes continuous real values. We will model each class by a Gaussian distribution. (Note, there are other ways to do it, this is a generic problem that Density Estimation deals with. Here consider the specific case of using Gaussian, which is fairly commonly done in practice.)
Frequency Illustration our 1D example 40 35 30 Empirical data for male Fitted distributionfor male Empirical data for female Fitted distribution for female 25 20 15 10 5 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Height (meters)
Gaussian - univariate p x = 1 2πσ 2 exp (x m) 2σ 2 2 Where m is the mean (center), and σ 2 is the variance (spread). These are the parameters that describe the distributions. We will have a separate Gaussian for each class. So, the female class will have m 0 as its mean, and σ 0 2 as its variance. The male class will have m 1 as its mean, and σ 1 2 as its variance. We need to estimate these parameters from the data.
Gaussian - multivariate Let x = (x 1, x 2,, x d ). So x has d attributes. Let k in {0,1}. p(x h = k)= 1 (2π) d Σ k exp { 1 2 (x m k) T Σ k 1 (x m k )} Where m k are the mean vectors, and Σ k is the covariance matrices. These are the parameters that describe the distributions, and they are estimated from the data.
Gaussian - multivariate
Attribute 2 2D example with 2 classes Attribute 1
Naïve Bayes Notice the full covariances are d d. In many situations there is not enough data to estimate the full covariance e.g. when d is large. The Naïve Bayes assumption is again an easy simplification that we can make and tends to work well in practice. In the Gaussian model it means that the covariance matrix is diagonal. For the brave: Check this last statement for yourself! 3% extra credit if you hand in a correct solution to me before next Thursday s class!
Are we done? How do we estimate the parameters, i.e. the means m k and the variance/ covariance Σ k? If we use the Naïve Bayes assumption, we can compute the estimates of the mean and variance in each class separately for each feature. If d is small, and you have many points in your training set, then working with full covariance is expected to work better. In MatLab there are built-in functions that you can use: mean, cov, var.
Multi-class classification We may have more than 2 classes e.g. healthy, disease type 1, disease type 2. Our Gaussian classifier is easy to use in multiclass problems. We compute the posterior probability for each of the classes We predict the class whose posterior probability is highest.
Summing up This type of classifier is called generative, because it rests on the assumption that the cloud of points in each class can be seen as generated by some distribution, e.g. a Gaussian, and works out its decisions based on estimating these distributions. One could instead model the discriminant function directly! That type of classifier is called discriminative. For the brave: Try to work out the form of the discriminant function by plugging into it the form of the Gaussian class conditional densities. You will get a quadratic function of x in general. When does it reduce to a linear functon? Recommended reading: Rogers & Girolami, Chapter 5.