Expectation Maximization

Similar documents
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Probabilistic Graphical Models: Representation and Inference

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Machine Learning Techniques for Computer Vision

Mixtures of Gaussians. Sargur Srihari

Expectation Maximization Algorithm

Clustering, K-Means, EM Tutorial

Expectation Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Introduction to Machine Learning

The Expectation-Maximization Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Probabilistic Graphical Models

Latent Variable View of EM. Sargur Srihari

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

p L yi z n m x N n xi

MIXTURE MODELS AND EM

Statistical Pattern Recognition

Gaussian Mixture Models

CPSC 540: Machine Learning

CSE446: Clustering and EM Spring 2017

Latent Variable Models

Latent Variable Models and EM Algorithm

Lecture 6: Gaussian Mixture Models (GMM)

Statistical Pattern Recognition

Clustering and Gaussian Mixture Models

Data Mining Techniques

COM336: Neural Computing

Gaussian Mixture Models, Expectation Maximization

Expectation maximization

Variational Mixture of Gaussians. Sargur Srihari

K-Means and Gaussian Mixture Models

Review and Motivation

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

EM (cont.) November 26 th, Carlos Guestrin 1

Recent Advances in Bayesian Inference Techniques

Mixtures of Gaussians continued

Bayesian Networks Structure Learning (cont.)

Lecture 4: Probabilistic Learning

Latent Variable Models and Expectation Maximization

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

CSCI-567: Machine Learning (Spring 2019)

Notes on Machine Learning for and

Linear Dynamical Systems

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Latent Variable Models and Expectation Maximization

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Probabilistic clustering

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Hidden Markov Models and Gaussian Mixture Models

Machine Learning for Data Science (CS4786) Lecture 12

STA 4273H: Statistical Machine Learning

1 Expectation Maximization

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

13 : Variational Inference: Loopy Belief Propagation and Mean Field

The Expectation Maximization or EM algorithm

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

13: Variational inference II

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Variational Scoring of Graphical Model Structures

Finite Singular Multivariate Gaussian Mixture

Technical Details about the Expectation Maximization (EM) Algorithm

Naïve Bayes classification

STA 4273H: Statistical Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Variational inference

An Introduction to Expectation-Maximization

Introduction to Machine Learning

Probabilistic Graphical Models

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Mixture of Gaussians Models

L11: Pattern recognition principles

Lecture 3: Machine learning, classification, and generative models

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Lecture 6: Graphical Models: Learning

But if z is conditioned on, we need to model it:

Curve Fitting Re-visited, Bishop1.2.5

Unsupervised Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Mixture Models and Expectation-Maximization

CSC411 Fall 2018 Homework 5

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 6: April 19, 2002

Based on slides by Richard Zemel

Machine Learning Summer School

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

G8325: Variational Bayes

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

A Note on the Expectation-Maximization (EM) Algorithm

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Transcription:

Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop

Learning in DAGs Two things could be learned: - Graph structure - Parameters governing the conditional probability distribution Learning the structure often involves a search over candidate structures and a method to score each structure. - In practice, it is often difficult to extract the conditional independence relationships that make DAGs so appealing in the first place. - MCMC methods are also used to search over the space of structures. We will focus on the problem of learning the parameters.

Maximum Likelihood Learning Consider the parameter set θ = (θ1,..., θ) which govern the conditional probability distributions P(Xi Pai, θ). One way to learn the parameters is to maximize the likelihood (or probability) of the data D (set of observed variables): θ ML = arg max θ = arg max θ = arg max θ = arg max θ {ln P (D θ)} { ln When there are no latent variables, the situation is much easier: - If we can condition on all the variables, the graph factors by d-separation and we can estimate the parameters for all the P(Xi Pai, θ) independently (eg. Naïve Bayes Classifier). N n=1 P (x n θ) } { N } ln P (x n θ) n=1 { N ln } P (x n, h n θ) h n n=1 Sum over H (latents) could be problematic

ML Example: Multivariate Gaussian Multivariate Gaussian distribution: p(x = x) = N (x µ, Σ) = 1 exp (2π) D/2 Σ 1/2 { 1 } 2 (x µ)t Σ 1 (x µ) Mean Covariance Maximum Likelihood Solution: µ ML = 1 N Σ ML = 1 N N n=1 x n N (x n µ ML )(x n µ ML ) T n=1 Nice simple closed-form solutions

Gaussian Mixture Models Now let s consider a random variable distributed according to a mixture of Gaussians. Conditional distributions for a D-dimensional X: P (I = i) = w i p(x = x I = i) = N (x µ i, Σ i ) = 1 (2π) D/2 Σ i 1/2 exp ( 1 ) 2 (x µ i) T Σ 1 i (x µ i ) where I is an index over the multivariate Gaussian components in the mixture and the mixing proportion, w i, is the marginal probability that X is generated by mixture component i. Marginal distributions: p(x = x) = i p(x = x I = i)p (I = i) = i w i N (x µ i, Σ) 1 1 P (x i) P (x) 0.5 0.5 0 0 0.5 1 (a) 0 0 0.5 1 (b)

Gaussian Mixture Models (cont.) Graphical model: w I µ X Σ N

Maximum Likelihood of GMM Log likelihood function: ln p(d w, µ, Σ) = { } N ln w i N (x n µ i, Σ i ) n=1 i Sum over mixture components appears inside the log - No closed form ML solution

Complete and Incomplete Data If we knew the mixture component identities, things would be easier. - This is the difference between complete data and incomplete data: 1 1 0.5 0.5 0 0 0.5 1 (a) complete 0 0 0.5 1 (b) incomplete The complete data picture treats the latent variables as missing data.

The makings of an iterative scheme Problem: we don t know the values of the latent variables (they re missing!) The EM idea: instead maximize the expected value of the completedata log likelihood - with the expectation w.r.t. P(latents observed, parameters) 1 1 0.5 0.5 0 0 0.5 1 (b) 0 0 0.5 1 (a)

Expectation-Maximization Algorithm E-step (expectation): evaluate the posterior distribution P(Z X, θ old ) using current estimate, θ old, of the parameters. M-step (maximization): re-estimate θ by maximizing the expected complete-data log-likelihood: θ new = arg max θ = arg max θ Q(θ, θ old ) { } P (Z X, θ old ) ln P (X, Z θ) Z Note that the log and the summation have been exchanged - this will often make the summation tractable. Iterate E and M steps until convergence. Guaranteed to converge to a local optimum with linear convergence rate

E Step: Mixture of Gaussians Calculate P(In Xn, θ) for each observed example Xn X n = [X 1,n,..., X d,n,..., X D,n ] T P (I n = i X n, θ) = P (I n = i w i )P (X n I n = i, µ i, Σ i ) P (X n ) P (I n = i w i )P (X n I n = i, µ i, Σ i ) = i P (I n = i w i )P (X n I n = i, µ i, Σ i ) w i N (X n µ i, Σ i ) = j w jn (X n µ j, Σ j ) Where θ = {θ 1,..., θ i,..., θ K }, θ i = {w i, µ i, Σ i } and N ( ) is the multivariate Gaussian probability density function.

M Step: Mixture of Gaussians For mixtures of Gaussians: θ arg max θ { } P (I n = i X n = x n, θ) ln P (X n = x n, I n = i θ ) n i We already computed P(In = i Xn = xn,θ) in the E step and we can decompose the joint P(Xn = xn, In = i θ ): P (i n x n, θ) ln p(x n, i n θ ) = n n i = n P (i n x n, θ) ln p(x n i n, θ )P (i n θ ) Now we maximize this expression w.r.t θ (on to the M step) i P (i n x n, θ) ln w i + n i P (i n x n, θ) ln N (x n µ i, Σ i) i

M Step: Mixture of Gaussians (cont.) Let s consider updating wi: (subject to the constraint i wi = 1) w i [ ( )] P (i n x n, θ) ln w i + λ w i 1 n i i = 0 N n=1 1 w i P (i n x n, θ) + λ = 0 w i 1 N N P (i n x n, θ) n=1

M Step: Mixture of Gaussians (cont.) Now consider updating the mean vectors μi: µ i [ ] P (i n x n, θ) ln N (x n µ i, Σ i) n i = 0 µ i [ P (i n x n, θ) n i ( 1 2 ln( Σ i ) 1 2 (x n µ i ) T Σ 1 i (x n µ i )) ] = 0 µ i N n=1 P (i n x n, θ)x n N n=1 P (i n x n, θ)

M Step: Mixture of Gaussians (cont.) Finally, let s consider updating the covariance matrices Σi: [ ] P (i n x n, θ) ln N (x n µ i, Σ i) = 0 Σ i n i Σ i [ P (i n x n, θ) n i ( 1 2 ln( Σ i ) 1 2 (x n µ i ) T Σ 1 i (x n µ i )) ] = 0 This is the new μi Σ i N n=1 P (i n x n, θ)(x n µ i )(x n µ i ) T N n=1 P (i n x n, θ)

EM Gaussian Mixture: Summary Given observed X1 to XN and hidden variables I1 to IN (mixture component) iterate E and M steps until convergence. E step: for each data point n compute P (i n x n, θ) = w i N (x n µ i, Σ i ) K j=1 w jn (x n µ j, Σ j ) M step: Update the parameters of component i (from 1 to K) with w i 1 N N P (i n x n, θ) n=1 µ i N n=1 P (i n x n, θ)x n N n=1 P (i n x n, θ) Σ i N n=1 P (i n x n, θ)(x n µ i )(x n µ i ) T N n=1 P (i n x n, θ)

EM I: GMM model of Old Faithful data Time between eruptions (minutes) Duration of eruption (minutes)

!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;

!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;

!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;

!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;

!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;

!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;

Over-fitting with Gaussian Mixtures with Singularities (infinities) occur in the likelihood function when components collapse onto a data point fitting in Gaussian Mixture Models Also, maximum likelihood cannot the number 1 1 determine 2 N (xn xnfunction, σ I) = whend/2 as σ 0! ofin components ties likelihood a component σ (2π) apses onto a data point: &'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/; ith maximum likelihood cannot determine the number Also, maximum likelihood cannot determine the number of mixture components components (the likelihood always increases with more components).

EM II: Clustering Documents with Naïve Bayes Consider you have a collection of D unlabeled documents. Build an initial naïve Bayes classifier with parameters θ. Use EM to find the maximum likelihood estimation of the parameters. Naïve Bayes assumption for document clustering: - The probability of a document di given class cj is the product of the probabilities of the words wdi,k in the document given that class: P (d i c j, θ) = k P (w di,k c j, θ) - The model parameters are the probabilities of the words wt given the class cj: θwt cj (consider this as a class specific vocabulary) and the marginal probabilities of the class cj: θcj Repeat until convergence: - E step: Use the current classifier (θ) to estimate component membership of each unlabled document, i.e. the probability that each class generated each document P(cj di, θ). - M step: Re-estimate the classifier (θ) given the estimated component membership of each document.

EM II: Clustering Documents with Naïve Bayes E step: M step: P (y i = c j d i, θ) = P (c j θ)p (d i c j, θ) P (d i θ) = θ wt c j P (w t c j, θ) = θ cj P (c j θ) = P (c j θ) d i k=1 P (w d i,k c j, θ) C r=1 P (c r θ) d i k=1 P (w d i,k c r, θ) D V s=1 i=1 Num(w t, d i )P (y i = c j d i ) D i=1 Num(w s, d i )P (y i = c j d i ) D i=1 P (y i = c j d i ) D where D is the number of documents, C is the number of classes, d i is the number of words in document d i and w t is the t-th word in the vocabulary of size V.

EM: optimizing a lower bound Recall: our original goal is to maximize the likelihood p(x θ). Suppose that direct optimization of p(x θ) is difficult, but that optimizing the complete-data likelihood function p(x, Z θ) is significantly easier. Introduce a distribution q(z) over the latents, for any choice of q(z): ln p(x θ) = L(q, θ) + KL(q p) where L(q, θ) = { } p(x, Z θ) q(z) ln q(z) Z KL(q p) = { } p(z X, θ) q(z) ln q(z) Z

EM: optimizing a lower bound (cont.) Maximizing L(q, θ) with respect to a free-form q distribution, we obtain the true posterior distribution: The lower bound L(q, θ) L(q, θ) = Z q(z) = p(z X, θ) then becomes p(z X, θ old ) ln = Q(θ, θ old ) + const { } p(x, Z θ) p(z X, θ old ) which, as a function of θ is the expected complete-data log likelihood (up to an additive constant).

EM: optimizing a lower bound (cont.) Initial Configuration:

EM: optimizing a lower bound (cont.) E-step:

EM: optimizing a lower bound (cont.) M-step:

Acknowledgments The material presented here is taken from tutorials, notes and lecture slides from Yoshua Bengio, Christopher Bishop, Andrew Moore, Tom Mitchell and Scott Davies.!"#$%#"%"#&$'%#$()&$'*%(+,%-*$'*%".%#&$*$%*/0,$*1%2+,'$3%(+,%4)"##%3"-/,% 5$%,$/06&#$,%0.%7"-%."-+,%#&0*%*"-')$%8(#$'0(/%-*$.-/%0+%6090+6%7"-'%"3+% /$)#-'$*1%:$$/%.'$$%#"%-*$%#&$*$%*/0,$*%9$'5(#08;%"'%#"%8",0.7%#&$8%#"%.0#% 7"-'%"3+%+$$,*1%<"3$'<"0+#%"'060+(/*%('$%(9(0/(5/$1%=.%7"-%8(>$%-*$%".%(% *06+0.0)(+#%?"'#0"+%".%#&$*$%*/0,$*%0+%7"-'%"3+%/$)#-'$;%?/$(*$%0+)/-,$%#&0*% 8$**(6$;%"'%#&$%."//"30+6%/0+>%#"%#&$%*"-')$%'$?"*0#"'7%".%2+,'$ 3@*% #-#"'0(/*A%&##?ABB3331)*1)8-1$,-BC(38B#-#"'0(/* 1%D"88$+#*%(+,% )"''$)#0"+*%6'(#$.-//7%'$)$09$,1%