Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s textbook, Machine Learning: A Probabilistic Perspective

Gaussian Mixture Models 0.75 0.7 0.65 0.6 0.7 0.65 0.6 0.55 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Mixture of 3 Gaussian Distributions in 2D Contour Plot of Joint Density, Marginalizing Cluster Assignments

Gaussian Mixture Models! Observed feature vectors: x i R d, i =1, 2,...,N! Hidden cluster labels: z i {1, 2,...,K}, i =1, 2,...,N! Hidden mixture means: µ k R d, k =1, 2,...,K! Hidden mixture covariances:! Hidden mixture probabilities:! Gaussian mixture marginal likelihood: p(x i π, µ, Σ) = Σ k R d d, k =1, 2,...,K π k, K π k =1 K z i =1 k=1 π zi N (x i µ zi, Σ zi ) p(x i z i, π, µ, Σ) =N (x i µ zi, Σ zi )

Clustering with Gaussian Mixtures 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete Data Labeled by True Cluster Assignments 0 0.5 1 Incomplete Data: Points to be Clustered With complete data, learning is Gaussian discriminant analysis. C. Bishop, Pattern Recognition & Machine Learning

π Expectation Maximization (EM) y i π y t π z i θ x i N Supervised Training θ x t Supervised Testing θ x i N Unsupervised Learning π, θ parameters (define cluster location and shape) z 1,...,z N hidden data (group observations into clusters)! Initialization: Randomly select starting parameters! E-Step: Given parameters, find posterior of hidden data! Equivalent to test inference of full posterior distribution! M-Step: Given posterior distributions, find likely parameters! Distinct from supervised ML/MAP, but often still tractable! Iteration: Alternate E-step & M-step until convergence

EM Algorithm 2 0!2!2 0 (b) 2 C. Bishop, Pattern Recognition & Machine Learning

EM Algorithm 2 L =1 0!2!2 0 (c) 2 C. Bishop, Pattern Recognition & Machine Learning

EM Algorithm 2 L =5 0!2!2 0 (e) 2 C. Bishop, Pattern Recognition & Machine Learning

Convexity

Convexity & Jensen s Inequality f(e[x]) E[f(X)] f(x) chord a x! b

Concavity & Jensen s Inequality ln(e[x]) E[ln(X)]

EM as Lower Bound Maximization ( ) ln p(x θ) =ln p(x, z θ) ln p(x θ) z ( ) p(x, z θ) q(z) ln q(z) z ln p(x θ) z q(z) ln p(x, z θ) z q(z) ln q(z) L(q, θ)! Initialization: Randomly select starting parameters! E-Step: Given parameters, find posterior of hidden data q (t) = arg max q L(q, θ (t 1) )! M-Step: Given posterior distributions, find likely parameters θ (t) = arg max L(q (t), θ) θ! Iteration: Alternate E-step & M-step until convergence θ (0)

Lower Bounds on Marginal Likelihood KL(q p) L(q, θ) ln p(x θ) C. Bishop, Pattern Recognition & Machine Learning

EM: A Sequence of Lower Bounds ln p(x θ) L (q,θ) θ old θ new C. Bishop, Pattern Recognition & Machine Learning

Expectation Maximization Algorithm KL(q p) KL(q p) =0 L(q, θ old ) ln p(x θ old ) L(q, θ new ) ln p(x θ new ) E Step: Optimize distribution on hidden variables given parameters M Step: Optimize parameters given distribution on hidden variables C. Bishop, Pattern Recognition & Machine Learning

EM: Expectation Step ln p(x θ) z q(z) ln p(x, z θ) z q(z) ln q(z) L(q, θ) q (t) = arg max q L(q, θ (t 1) )! General solution, for any probabilistic model: q (t) (z) =p(z x, θ (t 1) ) posterior distribution given current parameters! Applying to probabilistic mixture models: p(z i π) = Cat(z i π) p(x i z i, θ) =p(x i θ ) zi π k p(x i θ k ) K l=1 π lp(x i θ l ) r ik = p(z i = k x i, π, θ) = π θ z i x i N

E-Step for Gaussian Mixtures 1 (c) 1 (b) 0.5 0.5 0 0 0 0.5 1 Posterior Probabilities of Assignment to Each Cluster r ik = p(z i = k x i, π, θ) = 0 0.5 1 Incomplete Data: Points to be Clustered π k p(x i θ k ) K l=1 π lp(x i θ l )

Reminder: Exponential Families φ(x) R d θ Θ Z(θ) > 0 h(x) > 0 fixed vector of sufficient statistics (features), specifying the family of distributions unknown vector of natural parameters, determine particular distribution in this family normalization constant or partition function, ensuring this is a valid probability distribution reference measure independent of parameters (for many models, we simply have ) h(x) =1! Examples: Multinomial, Dirichlet, Gaussian, Poisson,!! ML estimation via moment matching: E θ [φ(x)] = 1 N φ(x i ) N i=1

Reminder: Constrained Optimization! Solution: ˆθ = arg max θ K k=1 θ k =1 ˆθ k = a k K k=1 subject to a 0 a 0 = a k log θ k a k 0 θ k 0 K k=1! Proof for K=2: Change of variables to unconstrained problem! Proof for general K: Lagrange multipliers (see textbook) a k

EM: Maximization Step ln p(x θ) q(z) ln p(x, z θ) z z θ (t) = arg max L(q (t), θ) = arg max θ θ! Unlike E-step, no simplified general solution! Applying to mixtures of exponential families: p(z i π) = Cat(z i π) p(x i z i, θ) = exp(θ T z i φ(x i ) A(θ zi )) ˆπ k = N k N Eˆθk [φ(x)] = 1 N k N i=1 r ik φ(x i ) q(z) ln q(z) L(q, θ) q(z) ln p(x, z θ) z weighted moment matching π θ N k = N i=1 z i x i N r ik

Binary Features: Mixtures of Bernoullis 10 Clusters Identified via EM Algorithm from Binarized MNIST Digits

Density Shape Matters 2 21 errors using gauss (red=error) 2 4 errors using student (red=error) 1 1 0 0 1 1 2 2 3 3 4 4 5 5 6 Bankrupt Solvent 7 5 4 3 2 1 0 1 2 Mixture of Gaussians 6 Bankrupt Solvent 7 5 4 3 2 1 0 1 2 Mixture of Student t Densities

Real Clusters may not be Round Ng, Jordan, & Weiss, NIPS02

Dimensionality Reduction Supervised Learning Unsupervised Learning Discrete classification or categorization clustering Continuous regression dimensionality reduction!goal: Infer label/response y given only features x!classical: Find latent variables y good for compression of x!probabilistic learning: Estimate parameters of joint distribution p(x,y) which maximize marginal probability p(x)