Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Similar documents
Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Introduction to Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Minimum Error Rate Classification

Introduction to Machine Learning

Bayes Decision Theory

Bayesian Decision Theory

5. Discriminant analysis

Machine Learning Linear Classification. Prof. Matteo Matteucci

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

The generative approach to classification. A classification problem. Generative models CSE 250B

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

6.867 Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Motivating the Covariance Matrix

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Linear Regression and Discrimination

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Support Vector Machines

Multivariate statistical methods and data mining in particle physics

Linear Classification: Probabilistic Generative Models

Machine Learning Gaussian Naïve Bayes Big Picture

Naive Bayes & Introduction to Gaussians

Machine Learning (CS 567) Lecture 5

Naive Bayes and Gaussian Bayes Classifier

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Naive Bayes and Gaussian Bayes Classifier

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayesian Decision Theory

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Generative Learning algorithms

CS 195-5: Machine Learning Problem Set 1

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Introduction to Machine Learning

Bayesian Decision Theory

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Bayesian Learning. Bayesian Learning Criteria

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Data Mining Part 4. Prediction

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Machine Learning Spring 2018 Note 18

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Lecture 8: Classification

Machine Learning Lecture 2

Relevance Vector Machines

Machine Learning

Inf2b Learning and Data

LEC 4: Discriminant Analysis for Classification

Bayesian Learning (II)

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Naive Bayes and Gaussian Bayes Classifier

The Bayes classifier

Computational Genomics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Algorithmisches Lernen/Machine Learning

Machine Learning

Machine Learning. Naïve Bayes classifiers

CSC 411: Lecture 09: Naive Bayes

Naïve Bayes classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Classification

Machine Learning, Fall 2009: Midterm

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

LDA, QDA, Naive Bayes

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

CSC411 Fall 2018 Homework 5

CMU-Q Lecture 24:

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Exercise Sheet 4: Covariance and Correlation, Bayes theorem, and Linear discriminant analysis

Midterm: CS 6375 Spring 2015 Solutions

Machine Learning, Fall 2012 Homework 2

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

T Machine Learning: Basic Principles

Introduction to Machine Learning

Algorithms for Classification: The Basic Methods

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Machine Learning for Signal Processing Bayes Classification

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Logistic Regression. Seungjin Choi

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning

Gaussian Models

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Logistic Regression. Machine Learning Fall 2018

Transcription:

Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham

Outline We have already seen how Bayes rule can be turned into a classifier In all our examples so far we had discrete valued attributes (e.g. in { sunny, rainy }, {+,-}) Today we learn how to do this when the data attributes are continuous valued

Frequency Example Task: predict gender of individuals based on their heights Given 40 35 30 25 Empirical data for male Empirical data for female 100 height examples of women 100 height examples of man 20 15 10 5 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Height (meters)

Class priors We can encode the values of the hypothesis (class) as 1 (male) and 0 (female). So, h 0,1. Since in this example we had the same number of males and females, we have P(h=1)=P(h=0)=0.5. These are the prior probabilities of class membership because they can be set before measuring any data. Note that in cases when the class proportions are imbalanced, we can use the priors to make predictions even before seeing any data.

Class-conditional likelihood Our measurements are heights. This is our data, x. Class-conditional likelihoods: p(x h=1): probability that a male has height x meters p(x h=0):

Class posterior As before, from Bayes rule we can obtain the class posteriors: p x h=1 P(h=1) P h = 1 x = p x h=1 P h=1 +p x h=0 P(h=0) Meaning of the denominator is the probability of measuring the height value x irrespective of the class. If we can compute this then we can use it for predicting the gender from the height measurement

Frequency Discriminant function When does our prediction switch from predicting h=0 vs predicting h=1? 40 Empirical data for male 35 Empirical data for female 30 25 20 15 10 5 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Height (meters) "When the measured hight passes a certain threshold " more precisely, when p h = 0 x = p h = 1 x

Discriminant function If we make a measurement, say we get x = 1.7 m We compute the posteriors and find P h = 1 x = 1.7 > P(h = 0 x = 1.7) Then we decide to predict h = 1, i.e., male If we measured x = 1.2 m, we will get P h = 1 x = 1.2 < P(h = 0 x = 1.2)

Discriminant function We can define a discriminant function as: P(h = 1 x) f1 x = P(h = 0 x) and compare the function value to 1. More convenient to have the switching at 0 rather than at 1. Define discriminant function as the log of f1: P(h = 1 x) f x = log P(h = 0 x) Then the sign of this function defines the prediction (if f(x)>0 => male, if f(x)<0 => female)

How do we compute it? Let s write it out using Bayes rule: P(h = 1 x) p x h = 1 P(h = 1) f x = log = log P(h = 0 x) p x h = 0 P(h = 0) Now, we need the class conditional likelihood terms, p x h = 0 and p x h = 1. Note that x now takes continuous real values. We will model each class by a Gaussian distribution. (Note, there are other ways to do it, this is a generic problem that Density Estimation deals with. Here consider the specific case of using Gaussian, which is fairly commonly done in practice.)

Frequency Illustration our 1D example 40 35 30 Empirical data for male Fitted distributionfor male Empirical data for female Fitted distribution for female 25 20 15 10 5 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Height (meters)

Gaussian - univariate p x = 1 2πσ 2 exp (x m) 2σ 2 2 Where m is the mean (center), and σ 2 is the variance (spread). These are the parameters that describe the distributions. We will have a separate Gaussian for each class. So, the female class will have m 0 as its mean, and σ 0 2 as its variance. The male class will have m 1 as its mean, and σ 1 2 as its variance. We need to estimate these parameters from the data.

Gaussian - multivariate Let x = (x 1, x 2,, x d ). So x has d attributes. Let k in {0,1}. p(x h = k)= 1 (2π) d Σ k exp { 1 2 (x m k) T Σ k 1 (x m k )} Where m k are the mean vectors, and Σ k is the covariance matrices. These are the parameters that describe the distributions, and they are estimated from the data.

Gaussian - multivariate

Attribute 2 2D example with 2 classes Attribute 1

Naïve Bayes Notice the full covariances are d d. In many situations there is not enough data to estimate the full covariance e.g. when d is large. The Naïve Bayes assumption is again an easy simplification that we can make and tends to work well in practice. In the Gaussian model it means that the covariance matrix is diagonal. For the brave: Check this last statement for yourself! 3% extra credit if you hand in a correct solution to me before next Thursday s class!

Are we done? How do we estimate the parameters, i.e. the means m k and the variance/ covariance Σ k? If we use the Naïve Bayes assumption, we can compute the estimates of the mean and variance in each class separately for each feature. If d is small, and you have many points in your training set, then working with full covariance is expected to work better. In MatLab there are built-in functions that you can use: mean, cov, var.

Multi-class classification We may have more than 2 classes e.g. healthy, disease type 1, disease type 2. Our Gaussian classifier is easy to use in multiclass problems. We compute the posterior probability for each of the classes We predict the class whose posterior probability is highest.

Summing up This type of classifier is called generative, because it rests on the assumption that the cloud of points in each class can be seen as generated by some distribution, e.g. a Gaussian, and works out its decisions based on estimating these distributions. One could instead model the discriminant function directly! That type of classifier is called discriminative. For the brave: Try to work out the form of the discriminant function by plugging into it the form of the Gaussian class conditional densities. You will get a quadratic function of x in general. When does it reduce to a linear functon? Recommended reading: Rogers & Girolami, Chapter 5.