The Expectation Maximization or EM algorithm

Similar documents
Week 3: The EM algorithm

Lecture 8: Graphical models for Text

Latent Variable Models and EM algorithm

Expectation Maximization

An Introduction to Expectation-Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2017

(Multivariate) Gaussian (Normal) Probability Densities

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Gaussian Mixture Models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Latent Variable Models and EM Algorithm

Expectation Maximization Algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

13 : Variational Inference: Loopy Belief Propagation and Mean Field

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Expectation maximization tutorial

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

13: Variational inference II

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Variational inference

PATTERN RECOGNITION AND MACHINE LEARNING

Statistical learning. Chapter 20, Sections 1 4 1

The Expectation-Maximization Algorithm

Expectation Propagation Algorithm

Probabilistic Graphical Models for Image Analysis - Lecture 4

G8325: Variational Bayes

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Expectation Maximization

Data Mining Techniques

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Quantitative Biology II Lecture 4: Variational Methods

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Mixtures of Gaussians. Sargur Srihari

A Note on the Expectation-Maximization (EM) Algorithm

EM & Variational Bayes

Probabilistic & Unsupervised Learning

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 13 : Variational Inference: Mean Field Approximation

CPSC 540: Machine Learning

Algorithmisches Lernen/Machine Learning

14 : Mean Field Assumption

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Variational Autoencoders (VAEs)

Machine learning - HT Maximum Likelihood

Variational Inference (11/04/13)

Variational Learning : From exponential families to multilinear systems

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Foundations of Statistical Inference

Variational Inference. Sargur Srihari

Expectation Maximization

Variational Inference and Learning. Sargur N. Srihari

Series 7, May 22, 2018 (EM Convergence)

Machine Learning for Data Science (CS4786) Lecture 12

Latent Variable Models

Statistical Pattern Recognition

Statistical Pattern Recognition

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Variational Autoencoders

Clustering with k-means and Gaussian mixture distributions

Lecture 6: Gaussian Mixture Models (GMM)

Technical Details about the Expectation Maximization (EM) Algorithm

CSC411 Fall 2018 Homework 5

Expectation Maximization

Clustering with k-means and Gaussian mixture distributions

Variational Autoencoder

EXPECTATION- MAXIMIZATION THEORY

Linear Dynamical Systems (Kalman filter)

A minimalist s exposition of EM

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Clustering, K-Means, EM Tutorial

Expectation Maximization - Math and Pictures Johannes Traa

Lecture 7 and 8: Markov Chain Monte Carlo

Introduction to Probabilistic Graphical Models: Exercises

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Auto-Encoding Variational Bayes

Lecture 4: Probabilistic Learning

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Density Estimation. Seungjin Choi

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Posterior Regularization

Statistical Machine Learning Lectures 4: Variational Bayes

Latent Variable View of EM. Sargur Srihari

STA 4273H: Statistical Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

CSE446: Clustering and EM Spring 2017

Clustering and Gaussian Mixture Models

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Weighted Finite-State Transducers in Computational Biology

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Theory and Use of the EM Algorithm. Contents

Expectation Propagation for Approximate Bayesian Inference

But if z is conditioned on, we need to model it:

Transcription:

The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11

Contents notation, objective the lower bound functional, F(q(H), θ) the EM algorithm example: Gaussian mixture model Appendix: KL divergence Carl Edward Rasmussen The EM algorithm November 15th, 2017 2 / 11

Notation Probabilistic models may have visible (or observed) variables y, latent variables, (or hidden or unobserved variables or missing data) z and parameters θ. Example: in a Gaussian mixture model, the visible variables are the observations, the latent variables are the assignments of data points to mixture components and the parameters are the means, variances, and weights of the mixture components. The likelihood, p(y θ), is the probability of the visible variables given the parameters. The goal of the EM algorithm is to find parameters θ which maximize the likelihood. The EM algorithm is iterative and converges to a local maximum. Throughout, will be used to denote an arbitrary distribution of the latent variables, z. The exposition will assume that the latent variables are continuous, but an analogue derivation for discrete z can be obtained by substituting integrals with sums. Carl Edward Rasmussen The EM algorithm November 15th, 2017 3 / 11

The lower bound Bayes rule: p(z y, θ) = p(y z, θ)p(z θ) p(y θ) p(y θ) = p(y z, θ)p(z θ). p(z y, θ) Multiply and divide by an arbitrary (non-zero) distribution : p(y θ) = p(y z, θ)p(z θ) p(z y, θ), take logarithms: log p(y θ) = log p(y z, θ)p(z θ) + log p(z y, θ), and average both sides wrt : p(y z, θ)p(z θ) log p(y θ) = log dz + log p(z y, θ) dz. }{{}}{{} lower bound functional F(,θ) non-negative KL( p(z y,θ)) Carl Edward Rasmussen The EM algorithm November 15th, 2017 4 / 11

The EM algorithm From initial (random) parameters θ t=0 iterate t = 1,..., T the two steps: E step: for fixed θ t 1, maximize the lower bound F(, θ t 1 ) wrt. Since the log likelihood log p(y θ) is independent of maximizing the lower bound is equivalent to minimizing KL( p(z y, θ t 1 )), so q t (z) = p(z y, θ t 1 ). M step: for fixed q t (z) maximize the lower bound F(q k (z), θ) wrt θ. We have: F(, θ) = log ( p(y z, θ)p(z θ) ) dz log dz, whose second term is the entropy of, independent of θ, so the M step is θ t = argmax q t (z) log ( p(y z, θ)p(z θ) ) dz. θ Although the steps work with the lower bound, each iteration cannot decrease the log likelihood as log p(y θ t 1 ) E = step F(q t (z), θ t 1 ) M step F(q t (z), θ t ) lower bound log p(y θ t ). Carl Edward Rasmussen The EM algorithm November 15th, 2017 5 / 11

EM as Coordinate Ascent in F Carl Edward Rasmussen The EM algorithm November 15th, 2017 6 / 11

Example: Mixture of Gaussians In a Gaussian mixture model, the parameters are θ = {µ j, σ 2 j, π j} j=1...k the mixture means, variances and mixing proportions for each of the k components. There is one latent variable per data-point z i, i = 1... n taking on values 1... k. The probability of the observations given the latent variables and the parameters, and the prior on latent variables are p(y i z i = j, θ) = exp ( (y i µ j ) ) 2 / 2πσ 2 j, p(z i = j θ) = π j, so the E step becomes: 2σ 2 j q(z i = j) u ij = π j exp( (y i µ j ) 2 /2σ 2 j )/ 2πσ 2 j q(z i = j) = r ij = u ij u i, where u i = k j=1 u ij. This shows that the posterior for each latent variable, z i follows a discrete distribution with probability given by the product of the prior and likelihood, renormalized. Here, r ij is called the responsibility that component j takes for data point i. Carl Edward Rasmussen The EM algorithm November 15th, 2017 7 / 11

Example: Mixture of Gaussians continued The lower bound is F(, θ) = n k q(z i = j) [ log(π j ) 1 2 (y i µ j ) 2 /σ 2 j 1 2 log(σ2 j ) ] + const. i=1 j=1 The M step, optimizing F(, θ) wrt the parameters, θ F σ 2 j = F µ j = n q(z i = j) y i µ j σ 2 j i=1 n q(z i = j) [ (y i µ j ) 2 2σ 4 j i=1 1 2σ 2 j = 0 µ j = [F + λ(1 k j=1 π j)] = 0 π j = 1 π j n n i=1 q(z i = j)y i n i=1 q(z i = j), ] n = 0 σ 2 i=1 j = q(z i = j)(y i µ j ) 2 n i=1 q(z, i = j) n q(z i = j), i=1 which have nice interpretations in terms of weighted averages. Carl Edward Rasmussen The EM algorithm November 15th, 2017 8 / 11

Clustering with MoG Carl Edward Rasmussen The EM algorithm November 15th, 2017 9 / 11

Clustering with MoG Carl Edward Rasmussen The EM algorithm November 15th, 2017 10 / 11

Appendix: some properties of KL divergence The (asymmetric) Kullbach Leibler divergence (or relative entropy) KL(q(x) p(x)) is non-negative. To minimize, add a Lagrange multiplier enforcing proper normalization and take variational derivatives: δ [ q(x) log q(x) δq(x) p(x) dx + λ( 1 Find stationary point by setting the derivative to zero: q(x)dx )] = log q(x) p(x) + 1 λ. q(x) = exp(λ 1)p(x), normalization conditon λ = 1, so q(x) = p(x), which corresponds to a minimum, since the second derivative is positive: δ 2 δq(x)δq(x) KL(q(x) p(x)) = 1 q(x) > 0. The minimum value attained at q(x) = p(x) is KL(p(x) p(x)) = 0, showing that KL(q(x) p(x)) is non-negative attains its minimum 0 when p(x) and q(x) are equal (almost everywhere). Carl Edward Rasmussen The EM algorithm November 15th, 2017 11 / 11