Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Size: px

Start display at page:

Download "Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham"

Philippa Barton
5 years ago
Views:

1 Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham

2 Outline We have already seen how Bayes rule can be turned into a classifier In all our examples so far we had discrete valued attributes (e.g. in { sunny, rainy }, {+,-}) Today we learn how to do this when the data attributes are continuous valued

3 Frequency Example Task: predict gender of individuals based on their heights Given Empirical data for male Empirical data for female 100 height examples of women 100 height examples of man Height (meters)

4 Class priors We can encode the values of the hypothesis (class) as 1 (male) and 0 (female). So, h 0,1. Since in this example we had the same number of males and females, we have P(h=1)=P(h=0)=0.5. These are the prior probabilities of class membership because they can be set before measuring any data. Note that in cases when the class proportions are imbalanced, we can use the priors to make predictions even before seeing any data.

5 Class-conditional likelihood Our measurements are heights. This is our data, x. Class-conditional likelihoods: p(x h=1): probability that a male has height x meters p(x h=0):

6 Class posterior As before, from Bayes rule we can obtain the class posteriors: p x h=1 P(h=1) P h = 1 x = p x h=1 P h=1 +p x h=0 P(h=0) Meaning of the denominator is the probability of measuring the height value x irrespective of the class. If we can compute this then we can use it for predicting the gender from the height measurement

7 Frequency Discriminant function When does our prediction switch from predicting h=0 vs predicting h=1? 40 Empirical data for male 35 Empirical data for female Height (meters) "When the measured hight passes a certain threshold " more precisely, when p h = 0 x = p h = 1 x

8 Discriminant function If we make a measurement, say we get x = 1.7 m We compute the posteriors and find P h = 1 x = 1.7 > P(h = 0 x = 1.7) Then we decide to predict h = 1, i.e., male If we measured x = 1.2 m, we will get P h = 1 x = 1.2 < P(h = 0 x = 1.2)

9 Discriminant function We can define a discriminant function as: P(h = 1 x) f1 x = P(h = 0 x) and compare the function value to 1. More convenient to have the switching at 0 rather than at 1. Define discriminant function as the log of f1: P(h = 1 x) f x = log P(h = 0 x) Then the sign of this function defines the prediction (if f(x)>0 => male, if f(x)<0 => female)

10 How do we compute it? Let s write it out using Bayes rule: P(h = 1 x) p x h = 1 P(h = 1) f x = log = log P(h = 0 x) p x h = 0 P(h = 0) Now, we need the class conditional likelihood terms, p x h = 0 and p x h = 1. Note that x now takes continuous real values. We will model each class by a Gaussian distribution. (Note, there are other ways to do it, this is a generic problem that Density Estimation deals with. Here consider the specific case of using Gaussian, which is fairly commonly done in practice.)

11 Frequency Illustration our 1D example Empirical data for male Fitted distributionfor male Empirical data for female Fitted distribution for female Height (meters)

12 Gaussian - univariate p x = 1 2πσ 2 exp (x m) 2σ 2 2 Where m is the mean (center), and σ 2 is the variance (spread). These are the parameters that describe the distributions. We will have a separate Gaussian for each class. So, the female class will have m 0 as its mean, and σ 0 2 as its variance. The male class will have m 1 as its mean, and σ 1 2 as its variance. We need to estimate these parameters from the data.

13 Gaussian - multivariate Let x = (x 1, x 2,, x d ). So x has d attributes. Let k in {0,1}. p(x h = k)= 1 (2π) d Σ k exp { 1 2 (x m k) T Σ k 1 (x m k )} Where m k are the mean vectors, and Σ k is the covariance matrices. These are the parameters that describe the distributions, and they are estimated from the data.

14 Gaussian - multivariate

15 Attribute 2 2D example with 2 classes Attribute 1

16 Naïve Bayes Notice the full covariances are d d. In many situations there is not enough data to estimate the full covariance e.g. when d is large. The Naïve Bayes assumption is again an easy simplification that we can make and tends to work well in practice. In the Gaussian model it means that the covariance matrix is diagonal. For the brave: Check this last statement for yourself! 3% extra credit if you hand in a correct solution to me before next Thursday s class!

17 Are we done? How do we estimate the parameters, i.e. the means m k and the variance/ covariance Σ k? If we use the Naïve Bayes assumption, we can compute the estimates of the mean and variance in each class separately for each feature. If d is small, and you have many points in your training set, then working with full covariance is expected to work better. In MatLab there are built-in functions that you can use: mean, cov, var.

18 Multi-class classification We may have more than 2 classes e.g. healthy, disease type 1, disease type 2. Our Gaussian classifier is easy to use in multiclass problems. We compute the posterior probability for each of the classes We predict the class whose posterior probability is highest.

19 Summing up This type of classifier is called generative, because it rests on the assumption that the cloud of points in each class can be seen as generated by some distribution, e.g. a Gaussian, and works out its decisions based on estimating these distributions. One could instead model the discriminant function directly! That type of classifier is called discriminative. For the brave: Try to work out the form of the discriminant function by plugging into it the form of the Gaussian class conditional densities. You will get a quadratic function of x in general. When does it reduce to a linear functon? Recommended reading: Rogers & Girolami, Chapter 5.

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture? Bayes Rule CS789: Machine Learning and Neural Network Bayesian learning P (Y X) = P (X Y )P (Y ) P (X) Jakramate Bootkrajang Department of Computer Science Chiang Mai University P (Y ): prior belief, prior