University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Similar documents
University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

COM336: Neural Computing

Parametric Techniques

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian Decision and Bayesian Learning

Parametric Techniques Lecture 3

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Bayes Decision Theory

Naive Bayes and Gaussian Bayes Classifier

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Naive Bayes and Gaussian Bayes Classifier

CS 195-5: Machine Learning Problem Set 1

Naive Bayes and Gaussian Bayes Classifier

CMU-Q Lecture 24:

p(d θ ) l(θ ) 1.2 x x x

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Inf2b Learning and Data

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

6.867 Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

LDA, QDA, Naive Bayes

ECE521 week 3: 23/26 January 2017

Lecture 3: Pattern Classification

Introduction to Machine Learning

Maximum Likelihood Estimation. only training data is available to design a classifier

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

L11: Pattern recognition principles

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Methods for Prediction

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

44 CHAPTER 2. BAYESIAN DECISION THEORY

Introduction to Machine Learning

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gaussian and Linear Discriminant Analysis; Multiclass Classification

COS513 LECTURE 8 STATISTICAL CONCEPTS

CSC321 Lecture 18: Learning Probabilistic Models

Pattern Recognition. Parameter Estimation of Probability Density Functions

Generative v. Discriminative classifiers Intuition

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Density Estimation. Seungjin Choi

Lecture 2 Machine Learning Review

Machine Learning and Pattern Recognition Density Estimation: Gaussians

Naive Bayes & Introduction to Gaussians

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

L5: Quadratic classifiers

STA 4273H: Statistical Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning (CS 567) Lecture 5

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Introduction to Machine Learning Spring 2018 Note 18

Lecture 5: Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Minimum Error Rate Classification

Naïve Bayes classification

Lecture 4: Probabilistic Learning

Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning

Bias-Variance Tradeoff

Machine Learning, Fall 2012 Homework 2

Multi-layer Neural Networks

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Multivariate Statistics

Machine learning - HT Maximum Likelihood

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A Probability Review

The Bayes classifier

Support Vector Machines

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Ch 4. Linear Models for Classification

MACHINE LEARNING ADVANCED MACHINE LEARNING

Introduction to Probabilistic Graphical Models: Exercises

Hidden Markov Models and Gaussian Mixture Models

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

The Gaussian distribution

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Lecture 5: LDA and Logistic Regression

Least Squares Regression

Lecture 3: Pattern Classification. Pattern classification

Bayesian Learning (II)

1 Data Arrays and Decompositions

Transcription:

University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2

2 Engineering Part IIB: Module 4F Statistical Pattern Processing Generative Model Decision Boundaries The previous lecture discussed Bayes decision rule and how it may be used with generative models to yield a classifier and decision boundaries. In generative models the joint distribution is modelled as p(x,ω) = p(x ω)p(ω) The decision boundary will depend on p(x ω): the class-conditional PDF P(ω): the prior distribution (continuous observation feature vectors,x, are considered) A large of number of class-conditional PDFs could be used univariate Gaussian: parametersµ,σ 2 p(x) = uniform: parametersa,b etc etc p(x) = exp (x µ)2 2πσ 2 2σ 2 b a a x b otherwise This lecture will look at the multivariate Gaussian: p(x ω i ) = N(x;µ i,σ i ) nature of the distribution and decision boundaries estimating the model parameters

2. Multivariate Gaussians 3 Multivariate Gaussian Distribution p(x) = (2π) d/2 Σ /2exp 2 (x µ) Σ (x µ) The distribution is characterised by: the mean vectorµ the covariance matrixσ Σ = 3 Σ =.5 2 8 6 9 8 7 6 5 4 2 2 2 3 4 5 6 7 8 4 3 2 2 4 6 8.35.3.25.2.5..5.2.5..5 5 5 5 5 5 5 8 6 4 2 2 4 6 8

4 Engineering Part IIB: Module 4F Statistical Pattern Processing Properties The mean and covariance matrix are defined as µ = E {x} Σ = E {(x µ)(x µ) } The matrix is clearly symmetric and for d dimensions is described by d(d + )/2 parameters. The diagonal elements of the covariance matrix σ ii are the variances in the individual dimensions σ 2 i, the offdiagonal elements determine the correlation. If all off-diagonal elements are zero, the covariance matrix is uncorrelated, this is equivalent to a univariate Gaussian in each dimension p(x) = d exp (x µ i) 2 2πσ 2 i i= For a full covariance matrix correlations cause the contours of equal probability density, which are ellipses, to be angled to the axes of the feature space. An important property that we will return to is the effect of a linear transformation on a Gaussian distribution. Given that the distribution of vectors x is Gaussian and that y = Ax + b (and A is non-singular) then µ y = Aµ x +b Σ y = AΣ x A 2σ 2 i

2. Multivariate Gaussians 5 Binary Decision Boundaries For a two class problem, what is the form of the decision boundary when multivariate Gaussian distributions are used for the class-conditional PDFs? Here the minimum probability of error decision boundary will be computed thus P(ω x) P(ω 2 x) ω > < ω 2, p(x ω ) p(x ω 2 ) ω > P(ω 2 ) < ω 2 P(ω ) Normally logs are taken of both sides, so a point, x, on the decision boundary satisfies log(p(x ω )) log(p(x ω 2 )) = log(p(ω 2 )) log(p(ω )) Substituting in p(x ω ) = N(x;µ,Σ ) p(x ω 2 ) = N(x;µ 2,Σ 2 ) Yields the following quadratic equation 2 (x µ ) Σ (x µ )+ 2 (x µ 2) Σ 2 (x µ 2 ) + 2 log Σ 2 P(ω 2 ) = log Σ P(ω ) Which can be expressed as x (Σ Σ 2 )x+2(σ 2 µ 2 Σ µ ) x +µ Σ µ µ 2 Σ 2 µ 2 log i.e. of the form x Ax+b x+c = Σ 2 2log Σ which gives the equation of the decision boundary. P(ω ) = P(ω 2 )

6 Engineering Part IIB: Module 4F Statistical Pattern Processing Examples of General Case Arbitrary Gaussian distributions can lead to general hyperquadratic boundaries. The following figures (from DHS) indicate this. Note that the boundaries can of course be straight lines and the regions may not be simply connected...5. Pp.5 p 2 - - 2 2 - -2-2.3.5.4.2.3. p.2. p - - 2-2 -

2. Multivariate Gaussians 7..5 Pp 2 - - 2..5 p 2 - -2-2

8 Engineering Part IIB: Module 4F Statistical Pattern Processing Example Decision Boundary Assume two classes with µ = 3 6 ; Σ = /2 2 µ 2 = 3 2 The inverse covariance matrices are then Σ = 2 /2 Σ 2 = Σ 2 = /2 /2 2 2 Substituting into the general expression for Gaussian boundaries yields: [ x x 2 ] 3/2 +2 [ 9/2 4 ] x x 2 x x 2 +36 6.5 log4 =.5x 2 9x 8x 2 +28. = x 2 = 3.54.25x +.875x 2 which is a parabola with a minimum at (3,.83). This is illustrated (from DHS) below. The graph shows 4 sample points from each class, the means and the decision boundary. Note that the boundary does not pass through the mid-point between the means.

2. Multivariate Gaussians 9 x 2 7.5 5 µ 2.5-2 2 4 6 8 x -2.5 µ 2

Engineering Part IIB: Module 4F Statistical Pattern Processing Constrained Case: Σ i = Σ 8 6 4 2 2 4 4 2 2 4 6 8 Constraining both class-conditional PDF covariance matrices to be the same simplifies the decision boundary 2(µ 2 µ ) Σ x+µ Σ µ µ 2Σ µ 2 2log This is a linear decision boundary b x+c = P(ω ) = P(ω 2 ) Here the classifier computes a weighted distance called the Mahalanobis distance from the input data x to the mean.

2. Multivariate Gaussians Posterior for Σ i = Σ Interesting to look at the posteriors P(ω x) = p(x ω )P(ω ) p(x ω )P(ω )+p(x ω 2 )P(ω 2 ) = + ( p(x ω 2 )P(ω 2 ) p(x ω )P(ω ) Simply comparing to the decision working (previous slide) P(ω x) = +exp(b x+c) b = 2Σ (µ 2 µ ) This looks like multivariate sigmoid +exp( ρx) ).9.8.7.6.5.4.3.2. 8 6 4 2 2 4 6 8 From 3F3 this can be compared to logistic regression/classification P(ω x) = +exp( b x c) = +exp( b x) x = [x ], b = [b c], one is a generative model, one is discriminative training criteria for the two differ

2 Engineering Part IIB: Module 4F Statistical Pattern Processing Training Generative Models So far the parameters are assumed to be known. In practice this is seldom the case, need to estimate class-conditional PDF,p(x ω) class priors,p(ω) Supervised training, so all the training examples associated with a particular class can be extracted and used to train these models. The performance of a generative model based classifier is highly dependent on how good the models are. The classifier is the minimum error classifier if form of the class-conditional PDF is correct training sample set is infinite training algorithm finds the correct parameters correct prior is used None of these are usually true!, but things still work (sometimes...) Priors can be simply estimated using, for example n n +n 2 how to find the parameters of the PDF

2. Multivariate Gaussians 3 Maximum Likelihood Estimation We need to estimate the vector parameters of the class conditional PDFsθ from training data. The underlying assumption for ML estimates is that the parameter values are fixed but unknown. Assume that the parameters are to be estimated from a training/design data set, D, with n example patterns D = {x,,x n } and note θ depends on D. If these training vectors are drawn independently i.e. are independent and identically distributed or IID, the joint probability density of the training set is given by p(d θ) = n i= p(x i θ) p(d θ) viewed as a function of θ is called the likelihood of θ givend. In ML estimation, the value of θ is chosen which is most likely to give rise to the observed training data. Often the log likelihood function, L(θ), is maximised instead for convenience i.e. L(θ) = logp(d θ) = n logp(x i θ) This value can either be maximised by iterative techniques (e.g. gradient descent and expectation-maximisation algorithms : see later in the course) or in some cases by a direct closed form solution exists. Either way we need to differentiate the log likelihood function with respect to the unknown parameters and equate to zero. i=

4 Engineering Part IIB: Module 4F Statistical Pattern Processing Gaussian Log-Likelihood Functions As an example consider estimating the parameters of a univariate Gaussian distribution with data generated from a Gaussian distribution with mean=2. and variance=.6..5 2 2.5 3 3.5 4 4.5 5.5.5 2 2.5 3 3.5 4 The variation of log-likelihood with the mean is shown above (assuming that the correct variance is known)..2.4.6.8 2 2.2 2.4 2.6 2.8..2.3.4.5.6.7.8.9 Similarly the variation with the variance (assuming that the correct mean is known).

2. Multivariate Gaussians 5 Mean of a Gaussian distribution Now we would like to obtain an analytical expression for the estimate of the mean of a Gaussian distribution. Consider a single dimensional observation (d = ). Consider estimating the mean, so θ = µ First the log-likelihood may be written as L(µ) = n i= log(p(x i µ)) = n Differentiating this gives i= L(µ) = µ L(µ) = n 2 log(2πσ2 ) (x i µ) 2 2σ 2 i= (x i µ) σ 2 We now want to find the value of the model parameters that the gradient is. Thus n i= (x i µ) σ 2 = So (much as expected!) the ML estimate of the mean ˆµ is ˆµ = n n x i i= Similarly the ML estimate of the variance can be derived.

6 Engineering Part IIB: Module 4F Statistical Pattern Processing Multivariate Gaussian Case For the general case the set of model parameters associated with a Gaussian distribution are θ = µ vec(σ) We will not go into the details of the derivation here (do this as an exercise), but it can be shown that the ML solutions for the mean (ˆµ) and the covariance matrix (ˆΣ) are and ˆµ = n n x i i= ˆΣ = n n i= (x i ˆµ)(x i ˆµ) Note that when deriving ML estimates for multivariate distributions, the following matrix calculus equalities are useful (given for reference only): A (b Ac) = bc a (a Ba) = 2Ba a (a Bc) = Bc (log( A )) = A A

2. Multivariate Gaussians 7 Biased Estimators You will previously have found that the unbiased estimate of the covariance matrix, ˆΣ, with an unknown value of the mean is ˆΣ = n (x i ˆµ)(x i ˆµ) n i= There is a difference between this and the ML solution ( n and ). In the limit asn the two values are the same. n So which is correct/wrong? Neither - they re just different. There are two important statistical properties illustrated here.. Unbiased estimators: the expected value over a large number of estimates of the parameters is the true parameter. 2. Consistent estimators: in the limit as the number of points tends to infinity the estimate is the true estimate. It can be shown that the ML estimate of the mean is unbiased, the variance is only consistent.

8 Engineering Part IIB: Module 4F Statistical Pattern Processing Iris data Famous (standard) database from machine learning/pattern recognition literature. Measurements taken from three forms of iris sepal length and width petal length and width Only petal information considered here. 3 2.5 2.5 2 2 Petal Width.5.5 Petal Width.5 Setosa Versicolour Virginica 2 3 4 5 6 7 Petal Length.5 2 3 4 Petal Length 5 6 7 Use multivariate Gaussians to model each class. Plots show data points lines at and 2 standard deviations from the mean regions assigned using Bayes decision rule all priors are assumed to be equal

2. Multivariate Gaussians 9 3 2.5 2 Petal Width.5.5 Setosa Versicolour Virginica 2 3 4 5 6 7 Petal Length

2 Engineering Part IIB: Module 4F Statistical Pattern Processing Logistic Regression/Classification When the covariance matrices of the class-condition Gaussian PDFs are constrained to be the same, the posteriors look like logistic regression - but very different training. The criterion for training the logistic regression parameters b aims to maximise the likelihood of producing the class labels rather than the observations. L( b) = N where i= = N i= log(p(y i x i, b)) y i log y i = +exp( b x +( y i )log i ), x i generated by classω, x i generated by classω 2 where (noting P(ω x)+p(ω 2 x) = ) P(ω x) = +exp( b x) P(ω 2 x) = +exp( b x) +exp( b x i ) Optimised using gradient descent, Newton s method etc (discussed later in the course).

2. Multivariate Gaussians 2 MAP Estimation It is sometimes useful to use a prior over the model parameters high dimensional observation feature space limited training data Both related to curse of dimensionality and how well the classifier will generalise. Consider a prior on the multivariate Gaussian mean, µ, of the form p(µ) = (2π) d 2 Σ p 2 exp ( (µ µ p ) Σ p (µ µ p ) ) where µ p and Σ p are the parameters of the prior. The MAP criterion for the mean is F(µ) = L(µ)+log(p(µ)) Differentiating and equating F(µ) = ˆµ = ( nσ +Σ ) p Σ p µ p +Σ n x i i= asσ p tends to ML solution asn tends to ML solution asn tends to prior mean

22 Engineering Part IIB: Module 4F Statistical Pattern Processing Curse of dimensionality Given a powerful-enough classifier, or high-enough dimensional observation feature-space, the training data can always be perfectly classified think of a look-up-table BUT we care about performance on held-out data (we know the labels of the training data!) Classification of previously unseen data is generalisation. Error Rate (%) Future "test" data Training data Number of Parameters Often, when designing classifiers, it is convenient to have set of held-out training data that can be used to determine the appropriate complexity of the classifier. This is often called a holdout or validation set.