Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate
Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means was simply a bloc coordinate descent scheme for a specific objective function Today: how to learn probabilistic models for unsupervised learning problems 2
EM: Soft Clustering Clustering (e.g., -means) typically assumes that each instance is given a hard assignment to exactly one cluster Does not allow uncertainty in class membership or for an instance to belong to more than one cluster Problematic because data points that lie roughly midway between cluster centers are assigned to one cluster Soft clustering gives probabilities that an instance belongs to each of a set of clusters 3
Probabilistic Clustering Try a probabilistic model! Allows overlaps, clusters of different size, etc. Can tell a generative story for data p(x y) p(y) Challenge: we need to estimate model parameters without labeled y s (i.e., in the unsupervised setting) Z X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? -0.1-2.0?? 0.2 1.5 4
Probabilistic Clustering Clusters of different shapes and sizes Clusters can overlap! (-means doesn t allow this) 5
Finite Mixture Models Given a dataset: x (1),, x () Mixture model: Θ = {λ 1,, λ, θ 1,, θ } p x Θ = λ y p y (x θ y ) y=1 where p y (x θ y ) is a mixture component from some family of probability distributions parameterized by θ y and λ 0 such that y λ y = 1 are the mixture weights We can thin of λ y = p Y = y Θ for some random variable Y that taes values in {1,, } 6
Finite Mixture Models 7
Multivariate Gaussian A d-dimensional multivariate Gaussian distribution is defined by a d d covariance matrix Σ and a mean vector μ p x Σ, μ = 1 2π d det Σ exp 1 2 x μ T Σ 1 (x μ) The covariance matrix describes the degree to which pairs of variables vary together The diagonal elements correspond to variances of the individual variables 8
Multivariate Gaussian A d-dimensional multivariate Gaussian distribution is defined by a d d covariance matrix Σ and a mean vector μ p x Σ, μ = 1 2π d det Σ exp 1 2 x μ T Σ 1 (x μ) The covariance matrix must be a symmetric positive definite matrix in order for the above to mae sense Positive definite: all eigenvalues are positive & matrix is invertible Ensures that the quadratic form is concave 9
Gaussian Mixture Models (GMMs) We can define a GMM by choosing the th component of the mixture to be a Gaussian density with parameters θ = μ, Σ p x Σ, μ = 1 2π d det Σ exp 1 2 x μ T Σ 1 (x μ ) We could cluster by fitting a mixture of Gaussians to our data How do we learn these inds of models? 10
Learning Gaussian Parameters MLE for supervised univariate Gaussian μ MLE = 1 x i 2 σ MLE = 1 x i μ MLE 2 MLE for supervised multivariate Gaussian μ MLE = 1 x i Σ MLE = 1 x i μ MLE x i μ MLE T 11
Learning Gaussian Parameters MLE for supervised multivariate mixture of Gaussian distributions μ MLE = 1 x i Σ MLE = 1 x i μ MLE x i μ MLE T Sums are over the observations that were generated by the th mixture component (this requires that we now which points were generated by which distribution!) 12
The Unsupervised Case What if our observations do not include information about which of the mixture components generated them? Consider a joint probability distribution over data points, x (i), and mixture assignments, y {1,, } MLE: arg max Θ p(x i Θ) = arg max Θ = arg max Θ y=1 y=1 p(x i, Y = y Θ) p x i Y = y, Θ p(y = y Θ) 13
The Unsupervised Case What if our observations do not include information about which of the mixture components generated them? Consider a joint probability distribution over data points, x (i), and mixture assignments, y {1,, } MLE: arg max Θ p(x i Θ) = arg max Θ = arg max Θ y=1 y=1 p(x i, Y = y Θ) p x i We only now how to compute the probabilities for each mixture component Y = y, Θ p(y = y Θ) 14
The Unsupervised Case In the case of a Gaussian mixture model p x i Y = y, Θ = 1 2π d det Σ y exp 1 2 x(i) μ y T Σy 1 (x (i) μ y ) p Y = y Θ = λ y Differentiating the MLE objective yields a system of equations that is difficult to solve in general The solution: modify the objective to mae the optimization easier 15
Expectation Maximization 16
Jensen s Inequality For a convex function f: R n R, any a 1,, a [0,1] such that a 1 + + a = 1, and any x (1),, x R n, a 1 f x 1 + + a f x f a 1 x 1 + + a x 17
EM Algorithm log l(θ) = = log log y=1 F(Θ, q) y=1 y=1 p x i, Y = y Θ qi y q i y p x i, Y = y Θ q i y log p x i, Y = y Θ q i y 18
EM Algorithm log l(θ) = = log log y=1 F(Θ, q) y=1 y=1 p x i, Y = y Θ qi y q i y p x i, Y = y Θ q i y log p x i, Y = y Θ q i y q i (y) is an arbitrary positive probability distribution 19
EM Algorithm log l(θ) = = log log y=1 F(Θ, q) y=1 y=1 p x i, Y = y Θ qi y q i y p x i, Y = y Θ q i y log p x i, Y = y Θ q i y Jensen s ineq. 20
EM Algorithm arg max Θ,q 1,..,q y=1 q i y log p x i, Y = y Θ q i y This objective is not jointly concave in Θand q 1,, q Best we can hope for is a local maxima (and there could be A LOT of them) The EM algorithm is a bloc coordinate ascent scheme that finds a local optimum of this objective Start from an initialization Θ 0 and q 1 0,, q 0 21
EM Algorithm E step: with the θ s fixed, maximize the objective over q q t+1 arg max q 1,..,q y=1 q i y log p x i, Y = y Θ t q i y Using the method of Lagrange multipliers for the constraint that y q i y = 1 gives q i t+1 y = p Y = y X = x i, Θ t 22
EM Algorithm M step: with the q s fixed, maximize the objective over Θ θ t+1 arg max Θ y=1 q i t+1 y log p x i, Y = y Θ t q i t+1 y For the case of GMM, we can compute this update in closed form This is not necessarily the case for every model May require gradient ascent 23
EM Algorithm Start with random parameters E-step maximizes a lower bound on the log-sum for fixed parameters M-step solves the MLE estimation problem for fixed probabilities Iterate between the E-step and M-step until convergence 24
EM for Gaussian Mixtures E-step: M-step: Σ y t+1 = q t i y = λ t i p x i μ t y, Σt y t p x i μ t t y, Σ y μ y t+1 = y λ y q i t y x i q i t y q t i y x i t+1 μ y x i t+1 μ T y q t+1 i y Probability of x (i) under the appropriate multivariate normal distribution λ y t+1 = 1 q i t y 25
Gaussian Mixture Example: Start 26
After first iteration 27
After 2nd iteration 28
After 3rd iteration 29
After 4th iteration 30
After 5th iteration 31
After 6th iteration 32
After 20th iteration 33
Properties of EM EM converges to a local optima This is because each iteration improves the loglielihood Proof same as -means (just bloc coordinate ascent) E-step can never decrease lielihood M-step can never decrease lielihood If we mae hard assignments instead of soft ones, algorithm is equivalent to -means! 34