COM336: Neural Computing

Similar documents
Hidden Markov Models and Gaussian Mixture Models

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Hidden Markov Models and Gaussian Mixture Models

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Parametric Techniques

Parametric Techniques Lecture 3

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 4: Probabilistic Learning

Bayesian Decision and Bayesian Learning

The Expectation-Maximization Algorithm

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Linear Classification: Probabilistic Generative Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

Introduction to Machine Learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

L11: Pattern recognition principles

Machine Learning for Signal Processing Bayes Classification and Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayes Decision Theory

Inf2b Learning and Data

Naive Bayes & Introduction to Gaussians

Expectation Maximization

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Lecture 3: Pattern Classification

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

K-Means and Gaussian Mixture Models

Expectation Maximization

PATTERN RECOGNITION AND MACHINE LEARNING

Finite Singular Multivariate Gaussian Mixture

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

CSC411 Fall 2018 Homework 5

Bayesian Decision Theory

Naïve Bayes classification

Introduction to Machine Learning

Linear Regression and Discrimination

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Curve Fitting Re-visited, Bishop1.2.5

Gaussian Mixture Models, Expectation Maximization

Machine Learning Lecture 5

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Mixture Models and Expectation-Maximization

Gaussian Mixture Models

Unsupervised Learning with Permuted Data

Latent Variable Models and Expectation Maximization

Maximum Likelihood Estimation. only training data is available to design a classifier

CSC 411: Lecture 09: Naive Bayes

Gaussian Mixture Models

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Linear Models for Classification

The generative approach to classification. A classification problem. Generative models CSE 250B

Latent Variable Models and Expectation Maximization

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

STA 4273H: Statistical Machine Learning

GWAS V: Gaussian processes

Mixtures of Gaussians. Sargur Srihari

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

An Introduction to Expectation-Maximization

Clustering with k-means and Gaussian mixture distributions

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Probabilistic generative models

Machine Learning and Pattern Recognition Density Estimation: Gaussians

Bayesian Machine Learning

Computing the MLE and the EM Algorithm

Naive Bayes and Gaussian Bayes Classifier

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

MIXTURE MODELS AND EM

Latent Variable Models and EM Algorithm

Introduction to Machine Learning

Multi-layer Neural Networks

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

STA 4273H: Statistical Machine Learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

CS 195-5: Machine Learning Problem Set 1

Machine Learning Lecture 2

Lecture 3: Pattern Classification. Pattern classification

Clustering with k-means and Gaussian mixture distributions

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Latent Variable View of EM. Sargur Srihari

Mixture of Gaussians Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Pattern Recognition

STA 414/2104: Machine Learning

The Bayes classifier

Motivating the Covariance Matrix

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Lecture Notes on the Gaussian Distribution

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Linear Dynamical Systems

CSCI-567: Machine Learning (Spring 2019)

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Introduction to Machine Learning

Transcription:

COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk 1/36

Objectives 1. To extend discussion to multidimensional and continuous inputs 2. To investigate pattern recognition based on density estimation and the use of Bayes Rule to produce prior probabilities Representation Parametric probability density functions: the Gaussian distribution and mixtures of Gaussians Inference Maximum likelihood estimation; the EM algorithm Reading: Bishop, chapter 2 (sections 2.1, 2.2, 2.3, 2.6) 2/36

Multivariate Input Data So far we have considered single variable inputs Many (most) problems have multidimensional inputs Handwriting recognition: Preprocessed features: height/width ratio; amount of black ink; curves; angles Raw features image pixel values Curse of dimensionality the size of the input space increases exponentially with the dimension 3/36

Continuous Input Features In many cases we want to represent the input using continuous random variables We can describe the behaviour of a continuous RV X using the probability density function, p(x). p(x) is not the probability that X has value x. But the pdf is proportional to the probability that X lies in a small region centred on x. In one dimension, the probability that x lies between a and b is given by: P(a x b) = b a p(x)dx In several dimensions the probability that x lies in a region R is given by: P(x R ) = p(x)dx R 4/36

Expectation of a Continuous RV For a continuous random variable X with pdf p(x) we must integrate to compute the mean (or expectation): E[X] = xp(x)dx Similarly the expectation of a function Q(x) is given by: E[Q] = Q(x)p(x)dx In both cases the integral is over all of x-space We can approximate the expectation by averaging over a sample of data points (x 1,...,x N ) drawn from the distribution of X: E[Q] = Q(x)p(x)dx 1 N N Q(x n ) n=1 5/36

Bayes Theorem in General In the case of continuous input we replace the class conditional probabilities with class conditional probability densities, p(x C k ) Bayes theorem can now be written as: P(C k x) = p(x C k)p(c k ) p(x) p(x) is the unconditional density: p(x) = K p(x C k )p(c k ) k=1 The class conditional probability density is often called the likelihood: likelihood prior posterior = normalizer 6/36

Probability Density Estimation In a generative approach to pattern recognition, the crucial thing is the estimation of the likelihood function p(x C k ) The prior can be estimated from the training data by counting For classification p(x) does not need to be computed since it does not depend on the class We will look at two approaches to density estimation: Parametric Model (eg Gaussian) Assume the density function is normally distributed and fit the parameters accordingly Mixture Model Assume the data comes from a mixture or combination of Gaussians (or other parametric distribution) and again fit the parameters accordingly 7/36

Gaussian Distribution (1) For a scalar, the Gaussian or Normal distribution function is written as: p(x) = 1 ( ) (x µ) 2 exp 2πσ 2 2σ 2 µ is the mean; σ 2 is the variance. It can be shown that: µ = E[x]; σ 2 = E[(x µ) 2 ] In d dimensions, the multivariate Gaussian density function is: ( 1 p(x) = exp 1 ) (2π) d/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) µ is the d-dimensional mean vector Σ is the d d covariance matrix 8/36

Gaussian Distribution (2) µ and Σ satisfy: µ = E[x] Σ = E[(x µ)(x µ) T ] Note that Σ is a symmetric matrix Thus the multivariate Gaussian has d + d(d + 1)/2 = d(d + 3)/2 parameters Note that 2 = (x µ) T Σ 1 (x µ) is sometimes called the Mahalanobis distance between x and µ. 9/36

Some Properties of the Gaussian Distribution Straightforward analytical properties possible to obtain many useful results explicitly Central Limit Theorem: the sum of independent and identically distributed random variables will tend to a Gaussian distribution Marginal densities (obtained by integrating out some variables) of a Gaussian are also Gaussian Conditional densities (obtained by holding some variables fixed) of a Gaussian are also Gaussian Decision boundaries between classes are quadratic If covariance matrices of all classes are constrained to be equal then decision boundaries are linear linear discriminants (See level 2 pattern processing notes, or Bishop for more details) 10/36

Gaussian Classifier P(x C k ;µ k,σ k ) = ( 1 exp 1 ) (2π) d/2 Σ k 1/2 2 (x µ k) T Σ 1 k (x µ k) Each class C k is parameterised by a mean vector µ k and a covariance matrix Σ k. Possible constraints: Grand covariance matrix: All classes have individual means and share the same covariance matrix Diagonal covariance matrix: Assumes the components of the input vector are independent so the off-diagonal terms are 0 (reduces number of covariance parameters from d(d + 1)/2 to d) Spherical covariance matrix: Input components are independent and have equal variances (reduces to 1 covariance parameter): Σ j = σ 2 ji 11/36

Maximum Likelihood Estimation (1) Given a density function p(x θ), where θ represents the parameters (µ and Σ for a Gaussian), the inference problem is to estimate the parameters θ given training data X = {x 1,...,x n } Define a likelihood function L(θ): L(θ) = p(x θ) = N p(x n θ) n=1 Maximum likelihood estimation (MLE) aims to adjust θ so as to maximise the likelihood of generating the training data (i.e. maximise L(θ) with respect to θ given the training data) Negative log likelihood E: E = lnl(θ) = N n=1 ln p(x n θ) Interpret the negative log likelihood as an error function 12/36

MLE (2) Maximizing L(θ) (or minimising E) requires finding where the derivative is zero. This normally requires an iterative procedure, but it can be done analytically for a Gaussian It is possible to show that the mean vector ˆµ and covariance matrix ˆΣ that maximise the likelihood given the training data are given by: ˆµ = 1 N ˆΣ = 1 N N x n n=1 N n=1 (x n ˆµ)(x n ˆµ) T This is intuitive, since the mean is estimated by the sample mean and the covariance by the sample covariance 13/36

Example (1) A pattern recognition problem has two classes, S and T, which are assumed to follow a Gaussian distribution. Some labelled observations are available for each class, detailed in the table below: Class S 10 8 10 10 11 11 Class T 12 9 15 10 13 13 Using the above data estimate the parameters of the pdf for each class. Sketch the pdf for each class. 14/36

15/36

Example (2) The following unlabelled data points are available: x 1 = 10 x 2 = 11 x 3 = 6 To which class should each of the data points be assigned? (Assume the two classes have equal prior probabilities.) Now assume that the two classes do not have equal prior probabilities, in fact: P(S) = 0.3 P(T ) = 0.7 16/36 Including this prior information, to which class should each of the above test data points (x 1,x 2,x 3 ) now be assigned?

17/36

Another Example (1) Consider the following data Length 38 44 41 36 47 38 38 42 39 45 39 45 Using the above data estimate the parameters of the pdf for this data. Sketch this pdf. 18/36

Another Example (2) 4 3.5 3 2.5 2 1.5 1 0.5 19/36 0 36 37 38 39 40 41 42 43 44 45 46 47

Another Example (3) 4 3.5 3 2.5 2 1.5 1 0.5 20/36 0 36 37 38 39 40 41 42 43 44 45 46 47

Another Example (4) 4 3.5 3 2.5 2 1.5 1 0.5 21/36 0 36 37 38 39 40 41 42 43 44 45 46 47

Mixture Models (1) Gaussian models assume that the density has a single mode but we want multi-modal densities We can have as many modes as we like by combining a set of component densities, p(x j): p(x) = M p(x j)p( j) j=1 This is an M-component mixture model Coefficients P( j) are called the mixing parameters This is also a generative model: to generate a data point from a mixture distribution, first choose a mixture component j with probability P( j), then generate a data point from the corresponding component density p(x j). Given enough components, mixture densities can approximate any continuous density to arbitrary accuracy 22/36

Mixture Models (2) The mixture component j is missing or hidden data given a data point, we do not know which component was responsible for generating it But we can write down the posterior probability of a mixture component given a data point, using Bayes theorem: P( j x) = p(x j)p( j) p(x) We shall mainly consider mixture models with Gaussian components and spherical covariances (Σ j = σ 2 ji): ( 1 p(x j) = (2πσ 2 exp x µ ) j 2 j )d/2 2σ 2 j 23/36

Network Diagram of a Mixture Model p(x) p(x 1) P(1) x 1 x2 P(M) x M-1 x M p(x M) 24/36

Mixture Model: MLE Estimate mixture model parameters using maximum likelihood Negative log likelihood is: E = lnl = N n=1 ln p(x n ) = ( N M ) ln p(x n j)p( j) n=1 j=1 If we knew which mixture component was responsible for generating each training data point, then maximum likelihood estimation would be straightforward the same as for a Gaussian classifier with each mixture component corresponding to a class. Each Training data point x n is labelled with the mixture component that generated it, c n. 25/36

Labelled Component Case: MLE (1) Let ˆµ j and ˆ Σ j be the parameters of mixture component i: ˆµ j = 1 N N j δ jc nx n n=1 Σˆ j = 1 N N j δ jc n(x n ˆµ)(x n ˆµ) T n=1 where δ jc n is the Kronecker delta and N j is the number of samples generated from component j: δ ab = 1 δ ab = 0 N j = N δ jc n n=1 if a = b if a b 26/36

Labelled Component Case: MLE (2) Estimate the mixture parameter for component j, ˆP( j), as the proportion of the training data generated by this component: ˆP( j) = 1 N = N j N N δ jc n n=1 Note that this enforces the sum-to-one constraint: M j=1 ˆP( j) = 1 27/36

General Case The power of a mixture model lies in the fact that the mixture component which generated the data point is missing data Minimizing E in this case is not straightforward Iterative Optimization Unlike the labelled component case there is no closed form solution for the parameters numerical methods are required. Singular Solutions It is possible for the likelihood to go to infinity (eg if there are the same number of components as data points, and the mean of each component falls on a data point and σ j 0) Local Minima The iterative process may converge on a locally optimal solution that is not globally optimal 28/36

The EM Algorithm (1) Iterative scheme for finding parameter values that minimize E Start with a guess for the parameter values, then update these old parameter values to obtain a revised estimate for them new parameter values This process is then iterated: the power of the algorithm is that it can be proven to decrease E at each iteration until a local minimum is found The algorithm for estimating Gaussian mixture models is a special case of a more general algorithm: the EM Algorithm EM = Expectation-Maximization 29/36

The EM Algorithm (2) The basic idea of the EM Algorithm is to use the posterior probability (or responsibility) P( j x; θ) of mixture component j being responsible for data point x. The responsibilities also depend on the current parameter estimates Each iteration of the EM algorithm has two steps: E step Re-estimate the responsibilities P( j x; θ) given the parameter estimates M step Given the responsibility estimates re-estimate the parameters, θ = (µ,σ,p( j)) 30/36

E Step: Estimating the Responsibilities On iteration (t + 1) we use the parameter values estimated at iteration t to estimate P (t+1) ( j x;θ t ): P (t+1) ( j x;θ t ) = p(x j;µt,σ t )P t ( j) p(x;θ t ) = p(x j;µt,σ t )P t ( j) M i=1 p(x i;µ t,σ t )P t (i) 31/36

M Step: Estimating the Parameters The M-step of the EM algorithm re-estimates the parameters by re-estimating them so that they maximize the joint likelihood p(x, c θ) of generating the data and the component sequence. p(x,c θ) = = N n=1 p(x n,c n θ) N p(x n µ;σ)p(c n ) n=1 It turns out that this can be maximized using an auxiliary function: Q(θ θ t ) = M N j=1 n=1 P (t+1) ( j x;θ t )ln p(x,c θ) 32/36 (See further reading for details on this.)

Update Equations This results in the following equations for updating the parameters: P (t+1) ( j) = 1 N N P (t+1) ( j x n ;θ t ) n=1 µ (t+1) = N n=1 P (t+1) ( j x n ;θ t )x n N n=1 P (t+1) ( j x n ;θ t ) σ (t+1) = 1 N n=1 P (t+1) ( j x n ;θ t ) x n µ (t+1) 2 d N n=1 P (t+1) ( j x n ;θ t ) This is an intuitive result since it may be viewed as a soft version of the case where the component label sequence was known, with the posterior probability (responsibility) taking care of the uncertainty over which component was responsible for generating each data point. 33/36

Mixture Models: Summary In mixture models the component which generates the data points is hidden Direct maximization of the likelihood is not possible; however, an iterative process may be applied. In practise, initialisation of parameters is important to avoid singularities, etc. Note that these approaches do not estimate the number of mixture components M: this must be pre-specified See the practical exercises for further examples 34/36

Mixture Models: Further Reading Bishop, section 2.6 Yoshi Gotoh has written a clear (but technical) review at ftp://ftp.dcs.shef.ac.uk/share/spandh/pubs/yg/em.ps.gz Mixture models have been used for many different applications, for example: Speech recognition Image processing Financial modelling Astronomical modelling Biomedical modelling 35/36

Summary Bayes Theorem enables class-conditional probability density functions to be used for classification (together with a prior) Density functions may be estimated from training data by maximising the likelihood of the data given the parameters Gaussian density functions are convenient and mathematically tractable but not suitable for every situation Mixture models are more general and are able to model any density function given enough components Maximum likelihood estimation of mixture models requires the use of an iterative algorithm the EM algorithm Next Lecture: Single Layer Networks 36/36