COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk 1/36
Objectives 1. To extend discussion to multidimensional and continuous inputs 2. To investigate pattern recognition based on density estimation and the use of Bayes Rule to produce prior probabilities Representation Parametric probability density functions: the Gaussian distribution and mixtures of Gaussians Inference Maximum likelihood estimation; the EM algorithm Reading: Bishop, chapter 2 (sections 2.1, 2.2, 2.3, 2.6) 2/36
Multivariate Input Data So far we have considered single variable inputs Many (most) problems have multidimensional inputs Handwriting recognition: Preprocessed features: height/width ratio; amount of black ink; curves; angles Raw features image pixel values Curse of dimensionality the size of the input space increases exponentially with the dimension 3/36
Continuous Input Features In many cases we want to represent the input using continuous random variables We can describe the behaviour of a continuous RV X using the probability density function, p(x). p(x) is not the probability that X has value x. But the pdf is proportional to the probability that X lies in a small region centred on x. In one dimension, the probability that x lies between a and b is given by: P(a x b) = b a p(x)dx In several dimensions the probability that x lies in a region R is given by: P(x R ) = p(x)dx R 4/36
Expectation of a Continuous RV For a continuous random variable X with pdf p(x) we must integrate to compute the mean (or expectation): E[X] = xp(x)dx Similarly the expectation of a function Q(x) is given by: E[Q] = Q(x)p(x)dx In both cases the integral is over all of x-space We can approximate the expectation by averaging over a sample of data points (x 1,...,x N ) drawn from the distribution of X: E[Q] = Q(x)p(x)dx 1 N N Q(x n ) n=1 5/36
Bayes Theorem in General In the case of continuous input we replace the class conditional probabilities with class conditional probability densities, p(x C k ) Bayes theorem can now be written as: P(C k x) = p(x C k)p(c k ) p(x) p(x) is the unconditional density: p(x) = K p(x C k )p(c k ) k=1 The class conditional probability density is often called the likelihood: likelihood prior posterior = normalizer 6/36
Probability Density Estimation In a generative approach to pattern recognition, the crucial thing is the estimation of the likelihood function p(x C k ) The prior can be estimated from the training data by counting For classification p(x) does not need to be computed since it does not depend on the class We will look at two approaches to density estimation: Parametric Model (eg Gaussian) Assume the density function is normally distributed and fit the parameters accordingly Mixture Model Assume the data comes from a mixture or combination of Gaussians (or other parametric distribution) and again fit the parameters accordingly 7/36
Gaussian Distribution (1) For a scalar, the Gaussian or Normal distribution function is written as: p(x) = 1 ( ) (x µ) 2 exp 2πσ 2 2σ 2 µ is the mean; σ 2 is the variance. It can be shown that: µ = E[x]; σ 2 = E[(x µ) 2 ] In d dimensions, the multivariate Gaussian density function is: ( 1 p(x) = exp 1 ) (2π) d/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) µ is the d-dimensional mean vector Σ is the d d covariance matrix 8/36
Gaussian Distribution (2) µ and Σ satisfy: µ = E[x] Σ = E[(x µ)(x µ) T ] Note that Σ is a symmetric matrix Thus the multivariate Gaussian has d + d(d + 1)/2 = d(d + 3)/2 parameters Note that 2 = (x µ) T Σ 1 (x µ) is sometimes called the Mahalanobis distance between x and µ. 9/36
Some Properties of the Gaussian Distribution Straightforward analytical properties possible to obtain many useful results explicitly Central Limit Theorem: the sum of independent and identically distributed random variables will tend to a Gaussian distribution Marginal densities (obtained by integrating out some variables) of a Gaussian are also Gaussian Conditional densities (obtained by holding some variables fixed) of a Gaussian are also Gaussian Decision boundaries between classes are quadratic If covariance matrices of all classes are constrained to be equal then decision boundaries are linear linear discriminants (See level 2 pattern processing notes, or Bishop for more details) 10/36
Gaussian Classifier P(x C k ;µ k,σ k ) = ( 1 exp 1 ) (2π) d/2 Σ k 1/2 2 (x µ k) T Σ 1 k (x µ k) Each class C k is parameterised by a mean vector µ k and a covariance matrix Σ k. Possible constraints: Grand covariance matrix: All classes have individual means and share the same covariance matrix Diagonal covariance matrix: Assumes the components of the input vector are independent so the off-diagonal terms are 0 (reduces number of covariance parameters from d(d + 1)/2 to d) Spherical covariance matrix: Input components are independent and have equal variances (reduces to 1 covariance parameter): Σ j = σ 2 ji 11/36
Maximum Likelihood Estimation (1) Given a density function p(x θ), where θ represents the parameters (µ and Σ for a Gaussian), the inference problem is to estimate the parameters θ given training data X = {x 1,...,x n } Define a likelihood function L(θ): L(θ) = p(x θ) = N p(x n θ) n=1 Maximum likelihood estimation (MLE) aims to adjust θ so as to maximise the likelihood of generating the training data (i.e. maximise L(θ) with respect to θ given the training data) Negative log likelihood E: E = lnl(θ) = N n=1 ln p(x n θ) Interpret the negative log likelihood as an error function 12/36
MLE (2) Maximizing L(θ) (or minimising E) requires finding where the derivative is zero. This normally requires an iterative procedure, but it can be done analytically for a Gaussian It is possible to show that the mean vector ˆµ and covariance matrix ˆΣ that maximise the likelihood given the training data are given by: ˆµ = 1 N ˆΣ = 1 N N x n n=1 N n=1 (x n ˆµ)(x n ˆµ) T This is intuitive, since the mean is estimated by the sample mean and the covariance by the sample covariance 13/36
Example (1) A pattern recognition problem has two classes, S and T, which are assumed to follow a Gaussian distribution. Some labelled observations are available for each class, detailed in the table below: Class S 10 8 10 10 11 11 Class T 12 9 15 10 13 13 Using the above data estimate the parameters of the pdf for each class. Sketch the pdf for each class. 14/36
15/36
Example (2) The following unlabelled data points are available: x 1 = 10 x 2 = 11 x 3 = 6 To which class should each of the data points be assigned? (Assume the two classes have equal prior probabilities.) Now assume that the two classes do not have equal prior probabilities, in fact: P(S) = 0.3 P(T ) = 0.7 16/36 Including this prior information, to which class should each of the above test data points (x 1,x 2,x 3 ) now be assigned?
17/36
Another Example (1) Consider the following data Length 38 44 41 36 47 38 38 42 39 45 39 45 Using the above data estimate the parameters of the pdf for this data. Sketch this pdf. 18/36
Another Example (2) 4 3.5 3 2.5 2 1.5 1 0.5 19/36 0 36 37 38 39 40 41 42 43 44 45 46 47
Another Example (3) 4 3.5 3 2.5 2 1.5 1 0.5 20/36 0 36 37 38 39 40 41 42 43 44 45 46 47
Another Example (4) 4 3.5 3 2.5 2 1.5 1 0.5 21/36 0 36 37 38 39 40 41 42 43 44 45 46 47
Mixture Models (1) Gaussian models assume that the density has a single mode but we want multi-modal densities We can have as many modes as we like by combining a set of component densities, p(x j): p(x) = M p(x j)p( j) j=1 This is an M-component mixture model Coefficients P( j) are called the mixing parameters This is also a generative model: to generate a data point from a mixture distribution, first choose a mixture component j with probability P( j), then generate a data point from the corresponding component density p(x j). Given enough components, mixture densities can approximate any continuous density to arbitrary accuracy 22/36
Mixture Models (2) The mixture component j is missing or hidden data given a data point, we do not know which component was responsible for generating it But we can write down the posterior probability of a mixture component given a data point, using Bayes theorem: P( j x) = p(x j)p( j) p(x) We shall mainly consider mixture models with Gaussian components and spherical covariances (Σ j = σ 2 ji): ( 1 p(x j) = (2πσ 2 exp x µ ) j 2 j )d/2 2σ 2 j 23/36
Network Diagram of a Mixture Model p(x) p(x 1) P(1) x 1 x2 P(M) x M-1 x M p(x M) 24/36
Mixture Model: MLE Estimate mixture model parameters using maximum likelihood Negative log likelihood is: E = lnl = N n=1 ln p(x n ) = ( N M ) ln p(x n j)p( j) n=1 j=1 If we knew which mixture component was responsible for generating each training data point, then maximum likelihood estimation would be straightforward the same as for a Gaussian classifier with each mixture component corresponding to a class. Each Training data point x n is labelled with the mixture component that generated it, c n. 25/36
Labelled Component Case: MLE (1) Let ˆµ j and ˆ Σ j be the parameters of mixture component i: ˆµ j = 1 N N j δ jc nx n n=1 Σˆ j = 1 N N j δ jc n(x n ˆµ)(x n ˆµ) T n=1 where δ jc n is the Kronecker delta and N j is the number of samples generated from component j: δ ab = 1 δ ab = 0 N j = N δ jc n n=1 if a = b if a b 26/36
Labelled Component Case: MLE (2) Estimate the mixture parameter for component j, ˆP( j), as the proportion of the training data generated by this component: ˆP( j) = 1 N = N j N N δ jc n n=1 Note that this enforces the sum-to-one constraint: M j=1 ˆP( j) = 1 27/36
General Case The power of a mixture model lies in the fact that the mixture component which generated the data point is missing data Minimizing E in this case is not straightforward Iterative Optimization Unlike the labelled component case there is no closed form solution for the parameters numerical methods are required. Singular Solutions It is possible for the likelihood to go to infinity (eg if there are the same number of components as data points, and the mean of each component falls on a data point and σ j 0) Local Minima The iterative process may converge on a locally optimal solution that is not globally optimal 28/36
The EM Algorithm (1) Iterative scheme for finding parameter values that minimize E Start with a guess for the parameter values, then update these old parameter values to obtain a revised estimate for them new parameter values This process is then iterated: the power of the algorithm is that it can be proven to decrease E at each iteration until a local minimum is found The algorithm for estimating Gaussian mixture models is a special case of a more general algorithm: the EM Algorithm EM = Expectation-Maximization 29/36
The EM Algorithm (2) The basic idea of the EM Algorithm is to use the posterior probability (or responsibility) P( j x; θ) of mixture component j being responsible for data point x. The responsibilities also depend on the current parameter estimates Each iteration of the EM algorithm has two steps: E step Re-estimate the responsibilities P( j x; θ) given the parameter estimates M step Given the responsibility estimates re-estimate the parameters, θ = (µ,σ,p( j)) 30/36
E Step: Estimating the Responsibilities On iteration (t + 1) we use the parameter values estimated at iteration t to estimate P (t+1) ( j x;θ t ): P (t+1) ( j x;θ t ) = p(x j;µt,σ t )P t ( j) p(x;θ t ) = p(x j;µt,σ t )P t ( j) M i=1 p(x i;µ t,σ t )P t (i) 31/36
M Step: Estimating the Parameters The M-step of the EM algorithm re-estimates the parameters by re-estimating them so that they maximize the joint likelihood p(x, c θ) of generating the data and the component sequence. p(x,c θ) = = N n=1 p(x n,c n θ) N p(x n µ;σ)p(c n ) n=1 It turns out that this can be maximized using an auxiliary function: Q(θ θ t ) = M N j=1 n=1 P (t+1) ( j x;θ t )ln p(x,c θ) 32/36 (See further reading for details on this.)
Update Equations This results in the following equations for updating the parameters: P (t+1) ( j) = 1 N N P (t+1) ( j x n ;θ t ) n=1 µ (t+1) = N n=1 P (t+1) ( j x n ;θ t )x n N n=1 P (t+1) ( j x n ;θ t ) σ (t+1) = 1 N n=1 P (t+1) ( j x n ;θ t ) x n µ (t+1) 2 d N n=1 P (t+1) ( j x n ;θ t ) This is an intuitive result since it may be viewed as a soft version of the case where the component label sequence was known, with the posterior probability (responsibility) taking care of the uncertainty over which component was responsible for generating each data point. 33/36
Mixture Models: Summary In mixture models the component which generates the data points is hidden Direct maximization of the likelihood is not possible; however, an iterative process may be applied. In practise, initialisation of parameters is important to avoid singularities, etc. Note that these approaches do not estimate the number of mixture components M: this must be pre-specified See the practical exercises for further examples 34/36
Mixture Models: Further Reading Bishop, section 2.6 Yoshi Gotoh has written a clear (but technical) review at ftp://ftp.dcs.shef.ac.uk/share/spandh/pubs/yg/em.ps.gz Mixture models have been used for many different applications, for example: Speech recognition Image processing Financial modelling Astronomical modelling Biomedical modelling 35/36
Summary Bayes Theorem enables class-conditional probability density functions to be used for classification (together with a prior) Density functions may be estimated from training data by maximising the likelihood of the data given the parameters Gaussian density functions are convenient and mathematically tractable but not suitable for every situation Mixture models are more general and are able to model any density function given enough components Maximum likelihood estimation of mixture models requires the use of an iterative algorithm the EM algorithm Next Lecture: Single Layer Networks 36/36