Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 824
Part XI Mixture Models and EM 1 419of 824
Marginalisation Sum rule p(a, B) = C p(a, B, C) Product rule p(a, B) = p(a B)p(B) Why do we optimize the log likelihood? 42of 824
Strategy in this course Estimate best predictor = training = learning Given data (x 1, y 1 ),..., (x n, y n ), find a predictor f w ( ). 1 Identify the type of input x and output y data 2 Propose a (linear) mathematical model for f w 3 Design an objective function or likelihood 4 Calculate the optimal parameter (w) 5 Model uncertainty using the Bayesian approach 6 Implement and compute (the algorithm in python) 7 Interpret and diagnose results We will study unsupervised learning this week 421of 824
Mixture Models and EM Complex marginal distributions over observed variables can be expressed via more tractable joint distributions over the expanded space of observed and latent variables. Mixture Models can also be used to cluster data. General technique for finding maximum likelihood estimators in latent variable models: expectation-maximisation (EM) algorithm. 422of 824
Example - Wallaby Distribution Introduced very recently to show....3.25.2.15.1.5. -2-15 -1-5 5 1 15 2 423of 824
Example - Wallaby Distribution... that already a mixture of three Gaussian can be fun. p(x) = 3 1 N (x 5,.5) + 3 4 N (x 9, 2) + N (x 2, 2) 1 1.3.25.2.15.1.5. -2-15 -1-5 5 1 15 2 424of 824
Example - Wallaby Distribution Use µ, σ as latent variables and define a distribution 3 1 if (µ, σ) = (5,.5) 3 p(µ, σ) = 1 if (µ, σ) = (9, 2) 4 1 if (µ, σ) = (2, 2) otherwise. p(x) = = p(x, µ, σ) dµ dσ p(x µ, σ) p(µ, σ) dµ dσ = 3 1 N (x 5,.5) + 3 4 N (x 9, 2) + N (x 2, 2) 1 1.3.25.2.15.1.5. -2-15 -1-5 5 1 15 2 425of 824
Given a set of data {x 1,..., x N } where x n R D, n = 1,..., N. Goal: Partition the data into K clusters. 426of 824
Given a set of data {x 1,..., x N } where x n R D, n = 1,..., N. Goal: Partition the data into K clusters. Each cluster contains points close to each other. Introduce a prototype µ k R D for each cluster. Goal: Find 1 a set prototypes µ k, k = 1,..., K, each representing a different cluster. 2 an assignment of each data point to exactly one cluster. 427of 824
- The Algorithm Start with arbitrary chosen prototypes µ k, k = 1,..., K. 1 Assign each data point to the closest prototype. 2 Calculate new prototypes as the mean of all data points assigned to each of them. In the following, we will formalise this introducing a notation which will be useful later. 428of 824
- Notation Binary indicator variables { 1, if data point x n belongs to cluster k r nk =, otherwise using the 1-of-K coding scheme. Define a distortion measure N K J = r nk x n µ k 2 n=1 k=1 Find the values for {r nk } and {µ k } so as to minimise J. 429of 824
- Notation Find the values for {r nk } and {µ k } so as to minimise J. But {r nk } depends on {µ k }, and {µ k } depends on {r nk }. 43of 824
- Notation Find the values for {r nk } and {µ k } so as to minimise J. But {r nk } depends on {µ k }, and {µ k } depends on {r nk }. Iterate until no further change 1 Minimise J w.r.t. r nk while keeping {µ k } fixed, r nk = { 1, if k = arg min j x n µ j 2, otherwise. n = 1,..., N Expectation step 2 Minimise J w.r.t. {µ k } while keeping r nk fixed, = 2 µ k = N r nk(x n µ k ) n=1 N n=1 rnkxn N n=1 rnk Maximisation step 431of 824
- Example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 (e) 2 (f) 2 2 2 2 2 2 2 2 2 432of 824
- Example 2 (d) 2 (e) 2 (f) 2 2 2 2 2 2 2 2 2 2 (g) 2 (h) 2 (i) 2 2 2 2 2 2 2 2 2 433of 824
- Cost Function 1 J 5 1 2 3 4 Cost function J after each E step (blue points) and M step (red points). 434of 824
- Notes Initial condition crucial for convergence. What happens, if at least one cluster centre is too far from all data points? Complex step: Finding the nearest neighbour. (Use triangle inequality; build K-D trees,...) Generalise to non-euclidean dissimilarity measures V(x n, µ k ) (called K-medoids algorithm), N K J = r nk V(x n, µ k ). n=1 k=1 Online stochastic algorithm 1 Draw data point x n and locate nearest prototype µ k. 2 Update only µ k using decreasing learning rate η n µ new k = µ old k + η n(x n µ old k ). 435of 824
- Image Segmentation Segment an image into regions of reasonable homogeneous appearance. Each pixel is a point in R 3 (red, blue, green). (Note that the pixel intensities are bounded in the range [, 1] and therefore this space is strictly speaking not Euclidean). Run K-means on all points of the image until convergence. Replace all pixels with the corresponding mean µ k. Results in an image with a palette only K different colours. There are much better approaches to image segmentation (but it is an active research topic), this here serves only to illustrate K-means. 436of 824
Illustrating - Segmentation K = 2 K = 1 K = 3 Original image 437of 824
Illustrating - Segmentation 438of 824
- Compression Lossy data compression: accept some errors in the reconstruction as trade-off for higher compression. Apply K-means to the data. Store the code-book vectors µ k. Store the data in the form of references (labels) to the code-book. Each data point has a label in the range [1,..., K]. New data points are also compressed by finding the closest code-book vector and then storing only the label. This technique is also called vector quantisation. 439of 824
Illustrating - Compression K = 2 K = 3 4.2% 8.3% K = 1 Original image 16.7% 1 % 44of 824
Latent variable modeling We have already seen a mixture of two Gaussians for linear classification However in the clustering scenario, we do not observe the class membership Strategy (this is vague) We have a difficult distribution p(x) We introduce a new variable z to get p(x, z) Model p(x, z) = p(x z)p(z) with easy distributions 441of 824
A Gaussian mixture distribution is a linear superposition of Gaussians of the form p(x) = K π k N (x µ k, Σ k ). k=1 As p(x) dx = 1, if follows K k=1 π k = 1. Let us write this with the help of a latent variable z. Definition (Latent variables) Latent variables (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed and directly measured. They are also sometimes called hidden variables, model parameters, or hypothetical variables. 442of 824
Let z {, 1} K and K k=1 z k = 1. In words, z is a K-dimensional vector in 1-of-K representation. There are exactly K different possible vectors z depending on which of the K entries is 1. Define the joint distribution p(x, z) in terms of a marginal distribution p(z) and a conditional distribution p(x z) as p(x, z) = p(z) p(x z) z x 443of 824
Set the marginal distribution to p(z k = 1) = π k where π k 1 together with K k=1 π k = 1. Because z uses 1-of-K coding, we can also write p(z) = K k=1 π zk k. Set the conditional distribution of x given a particular z to p(x z k = 1) = N (x µ k, Σ k ), or K p(x z) = N (x µ k, Σ k ) zk, k=1 444of 824
The marginal distribution over x is now found by summing the joint distribution over all possible states of z p(x) = p(z) p(x z) = z z K = π k N (x µ k, Σ k ) k=1 K k=1 π zk k K N (x µ k, Σ k ) zk k=1 The marginal distribution of x is a Gaussian mixture. For several observations x 1,..., x N we need one latent variable z n per observation. What have we gained? Can now work with the joint distribution p(x, z). Will lead to significant simplification later, especially for EM algorithm. 445of 824
Conditional probability of z given x by Bayes theorem γ(z k ) = p(z k = 1 x) = p(z k = 1) p(x z k = 1) K j=1 p(z j = 1) p(x z j = 1) = π k N (x µ k, Σ k ) K j=1 π j N (x µ j, Σ j ) γ(z k ) is the responsibility of component k to explain the observation x. 446of 824
- Ancestral Sampling Goal: Generate random samples distributed according to the mixture model. 1 Generate a sample ẑ from the distribution p(z). 2 Generate a value x from the conditional distribution p(x ẑ). Example: Mixture of 3 Gaussians, 5 points. 1 (a) 1 (b) 1 (c).5.5.5.5 1.5 1.5 1 Original states of z. Marginal p(x). (R, G, B) - colours mixed according to γ(z nk ). 447of 824
- Maximum Likelihood Given N data points, each of dimension D, we have the data matrix X R N D where each row contains one data point. Similarly, we have the matrix of latent variables Z R N K with rows z T n. Assume the data are drawn i.i.d., the distribution for the data can be represented by a graphical model. π z n x n µ Σ N 448of 824
- Maximum Likelihood The log of the likelihood function is then { N K } ln p(x π, µ, Σ) = ln π k N (x µ k, Σ k ) n=1 k=1 Significant problem: If a mean µ j sits directly on a data point x n then N (x n x n, σ 2 j I) = 1 1. (2π) 1/2 σ j Here we assumed Σ k = σk 2 I. But problem is general, just think of a main axis transformation for Σ k. Overfitting (in disguise) occuring again with the maximum likelihood approach. Use heuristics to detect this situation and reset the mean of the corresponding component of the mixture. 449of 824
- Maximum Likelihood A K component mixture has a total of K! equivalent solutions corresponding to the K! ways of assigning K sets of parameters to K solutions. Also called identifiability problem. Needs to be considered when the parameters discovered by a model are interpreted. Maximising the log likelihood of a Gaussian mixture is more complex then for a single Gaussian. Summation over all K components inside of the logarithm make it harder. Setting the derivatives of the log likelihood to zero does not longer result in a closed form. May use gradient-based optimisation. Or EM algorithm. Stay tuned. 45of 824