Last updated: November 6, 212 MIXTURE MODELS AND EM
Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis, University of Athens & Konstantinos Koutroumbas, National Observatory of Athens
Mixture Models and EM: Topics 3 1. Intuition 2. Equations 3. Examples 4. Applications
Mixture Models and EM: Topics 4 1. Intuition 2. Equations 3. Examples 4. Applications
Motivation 5 What do we do if a distribution is not wellapproximated by a standard parametric model?
Mixtures of Gaussians 6 Combine simple models into a complex model: Component Mixing coefficient K=3
Mixtures of Gaussians 7
Mixtures of Gaussians 8 Determining parameters log likelihood µ, σ and π using maximum Log of a sum; no closed form maximum. Solution: use standard, iterative, numeric optimization methods or the expectation maximization algorithm.
Hidden Variable Interpretation 9 p x n Assumptions for each training observation x n there is a hidden variable z n. z n = 1,,K represents which Gaussian x n came from With this interpretation, we have: p x n K ( ) = π k N ( x n ; µ k, Σ ) k k =1 K ( ) = p ( x n z n = k) p ( z n = k) k =1 where p ( z n = k) = π k and p ( x n z n = k) N x n ; µ k, Σ k ( ) x z = 1 z = 2 z = 3 z
Hidden Variable Interpretation 1 ( ) = π k N ( x n ; µ k, Σ ) k p x n π 1, π K, µ 1, µ K, Σ 1, Σ K K = p z n = k k =1 K k =1 ( ) p ( x n z n = k) x z = 1 z = 2 z = 3 z = 1 z = 2 z = 3 z x z
11 Hidden Variable Interpretation OUR GOAL To estimate the parameters θ: The means µ k The covariances Σ k The weights (mixing coefficients) π k for all K components of the model. ( ) = π k N ( x n ; µ k, Σ ) k p x n π 1, π K, µ 1, µ K, Σ 1, Σ K K = p z n = k k =1 K k =1 ( ) p ( x n z n = k) x z = 1 z = 2 z = 3 z THING TO NOTICE If we knew the hidden variables z n for the training data it would be easy to estimate parameters just estimate individual Gaussians separately. θ
Hidden Variable Interpretation 12 THING TO NOTICE #2: If we knew the parameters θ it would very easy to estimate the posterior distribution over each hidden variable z n using Bayes rule: ( ) = p x z = k,θ p z = k x,θ K k =1 ( )π k p ( x z = k,θ )π k p(z x) x z = 1 z = 2 z = 3 z z=1 z=2 z=3
Expectation Maximization 13 Chicken and egg problem: could find z 1...N if we knew θ could find θ if we knew z 1...N Solution: Expectation Maximization (EM) algorithm (Dempster, Laird and Rubin 1977) EM for Gaussian mixtures can be viewed as alternation between 2 steps: 1. Expectation Step (E-Step) For fixed θ find posterior distribution over responsibilities z 1...N 2. Maximization Step (M-Step) Now use these posterior probabilities to re-estimate θ
K-Means: A Poor-Man s Approximation 14 The K-means algorithm is very similar to EM, except that We assume the mixing coefficients are the same for each component. We assume each component is isotropic with common variance. We do not admit any uncertainty about the responsibilities at each iteration.
K-Means: A Poor-Man s Approximation 15 K-Means consists of the following steps: 1. Randomly select the mean for each component. 2. Associate each input with the nearest component mean. 3. Recompute the means based upon the new association. Steps 2 and 3 are alternated until convergence.
Example 16 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 (e) 2 (f) 2 2 2 2 2 2 2 2 2 2 (g) 2 (h) 2 (i) 2 2 2 2 2 2 2 2 2
Mixture Models and EM: Topics 17 1. Intuition 2. Equations 3. Examples 4. Applications
Responsibilities 18 The responsibility r nk of component k for observation x n is the posterior probability that component k generated x n. Responsibility r nk P ( z n = k x n ;θ ) 2 L =5 Responsibilities Update Equation: Responsibility r nk = p ( x z = k;θ n n )π k (t) K p ( x n z n = k;θ )π k (t) k =1 2 2 (e) 2 Let N k = effective number of observations explained by component k: N N k r nk n =1
Example: Mixture of Gaussians 19 Differentiating does not allow us to isolate the parameters analytically. However, it does generate equations that can be used for iterative estimation of these parameters by EM.
Example: Mixture of Gaussians 2 Parameter Update Equations: µ k = 1 N k N n =1 r nk x n Σ k = 1 N r N nk x n µ k k n =1 ( )( x n µ ) t k π k = 1 N N k =1 r nk
Mixture of (Multivariate) Gaussians 21 Here we will derive the formula for the mixing coefficents π k. Textbook Problem 11.2: Derive the EM formulae for the mean μ k and covarianceσ k. Please do this at home! When solving for Σ k, recall that: d da A = A A t and d da xt Ax = xx t
Initialization 22 EM typically takes longer to converge than K- Means, and the computations are more elaborate. It is thus common to use the solution to K-means to initialize the parameters for EM.
Convergence 23 EM is guaranteed to monotonically increase the likelihood. However, since in general the likelihood is nonconvex, we are not guaranteed to find the globally optimal parameters.
End of Lecture October 24, 212
Mixture Models and EM: Topics 25 1. Intuition 2. Equations 3. Examples 4. Applications
Bivariate Gaussian Mixture Example 26 1 (a) 1 (b) 1 (c).5.5.5 Samples from p x ( k, j k Θ) Samples from p ( x Θ k ) Responsibilities P j x ;Θ(t) k k ( ).5 1.5 1.5 1
2-Component Bivariate MATLAB Example 27 1 CSE 211Z 21W 8 Exam grade (%) 6 4 2 2 4 6 8 1 Assignment grade (%)
2-Component Bivariate MATLAB Example 28 %update responsibilities for i=1:k end p(:,i)=alphas(i).*mvnpdf(x,mu(i,:),squeeze(s(i,:,:))); p=p./repmat((sum(p,2)),1,k); %update parameters for i=1:k end Nk=sum(p(:,i)); mu(i,:)=p(:,i)'*x/nk; dev=x-repmat(mu(i,:),n,1); S(i,:,:)=(repmat(p(:,i),1,D).*dev)'*dev/Nk; alphas(i)=nk/n; Exam grade (%) 1 8 6 4 2 CSE 211Z 21W 2 4 6 8 1 Assignment grade (%)
Old Faithful Example 29 2 2 2 L =1 Time to next eruption (min) 2 2 2 (a) 2 L =2 2 2 2 (b) 2 L =5 2 2 2 (c) 2 L = 2 2 2 2 2 (d) 2 2 (e) 2 2 (f) 2 Duration of eruption (min)
Face Detection Example: 2 Components 3 Face Model Parameters Non-Face Model Parameters.4999.4675.51.5325 Prior Mean Standard deviation Prior Mean Standard deviation Each component is still assumed to have diagonal covariance. The face model and non-face model have divided the data into two clusters. In each case, these clusters have roughly equal weights. The primary thing that these seem to have captured is the photometric (luminance) variation. Note that the standard deviations have become smaller than for the single Gaussian model as any given data point is likely to be close to one mean or the other.
Results for MOG 2 Model 31 1 Pr(Hit).9.8.7.6.5.4.3.2.1.2.4.6.8 1 Pr(False Alarm) MOG 2 Diagonal Uniform Performance improves relative to a single Gaussian model, although it is not a dramatic improvement. We have a better description of the data likelihood.
MOG 5 Components 32.988.1925.262.2275.1575 Prior Mean Face Model Parameters Standard deviation.1737.225.195.22.1863 Prior Mean Non-Face Model Parameters Standard deviation
MOG 1 Components 33.75.1425.1437.988.138.1187.1638.1175.138..1137.688.763.8.1338.163.163.1263.9.988
Results for Mog 1 Model 34 1.9.8.7 Performance improves slightly more, particularly at low false alarm rates. Pr(Hit).6.5.4.3.2.1 MOG 1 MOG 2 Diagonal Uniform.2.4.6.8 1 Pr(False Alarm)
Background Subtraction 35 Test Image Desired Segmentation GOAL : (i) Learn background model (ii) use this to segment regions where the background has been occluded
What if the scene isn t static? 36 Gaussian is no longer a good fit to the data. Not obvious exactly what probability model would fit better.
Background Mixture Model 37
Background Subtraction Example 38