Variational Mixture of Gaussians. Sargur Srihari

Similar documents
Mixtures of Gaussians. Sargur Srihari

Latent Variable View of EM. Sargur Srihari

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

MCMC and Gibbs Sampling. Sargur Srihari

Variational Inference. Sargur Srihari

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

STA 4273H: Statistical Machine Learning

Expectation Maximization

Curve Fitting Re-visited, Bishop1.2.5

Linear Dynamical Systems

Data Mining Techniques

Introduction to Probabilistic Graphical Models: Exercises

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

CS229 Lecture notes. Andrew Ng

Expectation Maximization

Variational Principal Components

Mixture Models and EM

K-Means and Gaussian Mixture Models

Machine Learning Techniques for Computer Vision

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

13: Variational inference II

Bayesian Linear Regression. Sargur Srihari

Recent Advances in Bayesian Inference Techniques

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Quantitative Biology II Lecture 4: Variational Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2017

EM & Variational Bayes

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

A Gradient-Based Algorithm Competitive with Variational Bayesian EM for Mixture of Gaussians

Graphical Models for Collaborative Filtering

p L yi z n m x N n xi

13 : Variational Inference: Loopy Belief Propagation and Mean Field

ECE285/SIO209, Machine learning for physical applications, Spring 2017

Clustering and Gaussian Mixture Models

Technical Details about the Expectation Maximization (EM) Algorithm

The Expectation-Maximization Algorithm

Variational Inference. Sargur Srihari

Statistical Pattern Recognition

Variational inference

Statistical Pattern Recognition

Introduction to Probabilistic Graphical Models

Clustering, K-Means, EM Tutorial

Study Notes on the Latent Dirichlet Allocation

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Instructor: Dr. Volkan Cevher. 1. Background

Learning Bayesian network : Given structure and completely observed data

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Probabilistic Graphical Models for Image Analysis - Lecture 4

Latent Variable Models and Expectation Maximization

An Introduction to Expectation-Maximization

Manifold Constrained Variational Mixtures

The Expectation Maximization or EM algorithm

Basic Sampling Methods

Latent Variable Models and EM Algorithm

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Latent Variable Models and Expectation Maximization

Variational Autoencoders

Online Algorithms for Sum-Product

Probabilistic & Unsupervised Learning

COM336: Neural Computing

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 2: Priors and Conjugacy

Based on slides by Richard Zemel

Outline Lecture 2 2(32)

Probabilistic Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

Variational Inference and Learning. Sargur N. Srihari

Probabilistic Graphical Models

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Week 3: The EM algorithm

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Introduction to Machine Learning

Bayesian Statistics and Data Assimilation. Jonathan Stroud. Department of Statistics The George Washington University

Algorithms for Variational Learning of Mixture of Gaussians

PMR Learning as Inference

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Pattern Recognition and Machine Learning

Latent Variable Models

Machine Learning for Data Science (CS4786) Lecture 12

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

STA 4273H: Statistical Machine Learning

Variational Inference (11/04/13)

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Gaussian Mixture Models

Transcription:

Variational Mixture of Gaussians Sargur srihari@cedar.buffalo.edu 1

Objective Apply variational inference machinery to Gaussian Mixture Models Demonstrates how Bayesian treatment elegantly resolves difficulties with maximum likelihood issues Many more complex distributions can be solved using straightforward extensions of this analysis 2

Graphical Model for GMM Graphical model corresponding to likelihood function of standard GMM: Plate otation: Equivalent networks Directed acyclic graph Representing mixture For each observation x n we have a corresponding latent latent variable z n A 1-of- binary vector with elements z nk for,.. Denote observed data by X={x 1,..,x } Latent variables by Z={z 1,..,z } 3

Likelihood Function for GMM Mixture density function is Therefore Likelihood function is p(x π,µ,σ) = n =1 x π k (x n µ k,σ k ) Since z has values {z k } with probabilities π k Product is over the i.i.d. samples Therefore log-likelihood function is ln p(x π,µ,σ) = n =1 ln π k (x n µ k,σ k ) Find parameters, π, µ and Σ that maximize log-likelihood A more difficult problem than for a single Gaussian 4

GMM m.l.e. expressions Obtained using derivatives of log-likelihood 1 µ = γ( z )x k nk n k n= 1 Σ k = 1 k γ(z nk )(x n µ k )(x n µ k ) T n =1 k π k = k = γ(z nk ) n =1 Parameters (means) All three are in terms of responsibliities Parameters(covariance matrices) Parameters (Mixing Coefficients) ot closed form solutions for the parameters Since γ (z nk ) the responsibilities depend on those parameters in a complex way 5

EM For GMM E step use current value of parameters µ k,σ k,π k to evaluate posterior probabilities p(z/x), γ (z ) nk i.e., responsibilities M step use these posterior probabilities to to re-estimate p(x,z): means, covariances and mixing coefficients wrt p(z/x) 6

Graphical model for Bayesian GMM GMM Bayesian GMM Mixing coefficients precisions means To specify model we need these conditional probabilities: 1. p(z π): conditional distribution of Z given mixing coeffts 2. p(x Z, µ, Λ): 3. p(π): distribution of mixing coefficients 4. p(µ,λ): prior governing mean and precision of each component 7

Conditional Distribution Expressions 1. Conditional distribution of Z={z 1,.,z } given mix coefficients π Since components are mutually exclusive 2. Conditional distribution of observed data X={x 1,..,x } given latent variables and component parameters p(x Z, µ, Λ) Since components are Gaussian p(x Z,µ,Λ) = p(z π) = n=1 π k z nk p(x z) = (x n µ k,λ 1 k ) z nk p(z) = x µ k,σ k π k z k ( ) z k where µ ={µ k } and Λ={Λ k } use of precision matrix simplifies further analysis 8

Parameter Priors: Mixing Coefficients 3. Distribution of mixing coefficients p(π) Conjugate priors simplify analysis Dirichlet distribution over π p(π) = Dir(π α 0 ) = C(α 0 ) π k α 0 1 We have chosen the same parameter α 0 for each of the components C(α 0 ) is the normalization constant for the Dirichlet distribution 9

Parameter Priors: Mean, Precision 4. Distribution of Mean and Precision of Gaussian components p(µ,λ) Gaussian-Wishart prior is p(µ,λ) = p(µ Λ) p(λ) = µ k m 0 (β 0 Λ k ) 1 W (Λ k W 0,ν 0 ) k =1 ( ) Which represents the conjugate prior when both mean and precision are unknown Resulting model has: Link between Λ and µ Due to distribution (4) above

Bayesian etwork for Bayesian GMM Joint of all random variables: p(x,z,π,µ,λ) = p(x Z,µ,Λ)p(Z π)p(µ Λ)p(Λ) All the factors were given earlier Only X={x 1,..,x } are observed This B provides a nice distinction between latent variables and parameters Variables such as z n that appear inside the plate are latent variables o of such variables grows with data set Variables outside the plate are parameters Fixed in no. and outside of data set Mixing Coeffts Precisions Means 11 From viewpoint of PGMs no fundamental difference

Recall GMM The variational approach The EM approach: 1. Evaluation of posterior distribution p(z X) 2. Evaluation of expectation of p(x,z) wrt to p(z X) Our goal is to specify the variational distribution q(z,π,µ,λ) which will specify p(z,π,µ,λ X) Recall ln p(x) = L(q) + L(q p) where L(q) = and p(x) = p(z)p(x z) = π k x µ k,σ k " q(z)ln p(x,z) % # & dz $ q(z) ' " L{q p} = q(z)ln# $ z p(z X) q(z) % & dz ' ( ) Here p(z) has parameter π with distribution p(π) 12

Variational Distribution In variational inference we can specify q by M using a factorized distribution q(z) = q i (Z i ) For Bayesian GMM the latent variables and parameters are Z, π, µ and Λ. So we consider the variational distribution q(z,π,µ,λ)=q(z)q(π,µ,λ) Remarkably, this is the only assumption needed for a tractable solution to a Bayesian Mixture Model Functional forms of both q(z) and q(π,µ,λ) are determined automatically by optimizing the variational distribution 13 i=1 Subscripts for q s omitted

Sequential update equations Using general result of factorized distributions When L(q) is defined as the q that makes the functional L(q) largest is For Bayesian GMM log of optimized factor is Since lnq * j (Z j ) = E i j [ ln p(x,z) ]+ const lnq *(Z) = E π,µ,λ p(x, Z) L(q) = q(z)ln dz = q i ln p(x, Z) lnq i q(z) dz i i ( ) ln p X,Z, π,µ,λ +const p(x,z,π,µ,λ) = p(x Z,µ,Λ)p(Z π)p(µ Λ)p(Λ) lnq *(Z) = E π ln p(z π) + E µ,λ ln p(x Z,µ,Λ) +const we have ote: Expectations are are just weighted sums 14

Simplification of q*(z) Expression for factor q*(z) lnq *(Z) = E π ln p(z π) + E µ,λ Absorbing terms not depending on Z into constant lnq *(Z) = n=1 z nk where ln ρ nk = E ln π k ln ρ nk +const where D is dimensionality of data variable x Taking exponentials on both sides ormalized distribution is ln p(x Z,µ,Λ) +const + 1 2 E ln λ k D 2 ln(2π) 1 2 E (x µ k Δ k n µ k ) T Λ k (x k µ k ) q *(Z) n=1 z ρ nk nk q *(Z) = n=1 z r nk nk where r nk = ρ nk j=1 ρ nj r nk are positive since ρ nk are exponentials of real nos. and will sum to one as required 15

Factor q*(z) has same form as prior ormalized distribution is We have found form of q* to maximize the functional L(q) It has same form as prior q *(Z) = p(z π) = Distribution q*(z) is discrete and has the standard result E[z nk ]=r nk, which play the role of responsibilities Since equations for q*(z) depend on moments of other variables They are coupled and solved iteratively n=1 n=1 π k z nk z r nk nk 16

Variational EM Variational E-step: determine responsibilities r nk Variational M-step: 1. determine statistics of data set k = n=1 x k = 1 k S k = 1 k r nk r nk x n n=1 n=1 r nk ( x n x n )( x n x n ) T Responsibility of k th component Mean of k th component Covariance matrix of k th component and 2. find optimal solution for the factor q(π,µ,λ) 17

Factorization of q(π,µ,λ) Using general result of factorized distributions lnq j * (Z j ) = E i j [ ln p(x,z) ]+ const We can write ( ) lnq *(π,µ,λ) = ln p(π) + ln p µ k,λ k + E Z ln p(z π) + E z nk 1 ( ) ln x µ,λ k k k which decomposes into terms involving π and only µ,λ The terms involving µ and Λ comprise sum of terms involving µ k and Λ k leading to factorization n=1 +const q(π,µ,λ) = q(π) q(µ k,λ k ) 18

Factor q(π) is a Dirichlet Given the factorization Consider each factor in turn: q(π) and q(µ k,λ k ) (2a) Identifying terms depending on π, q(π) has the solution lnq *(π) = (α 0 1) lnz k + r nk ln π k +const Taking exponential on both sides we get q*(π) as a Dirichlet n=1 q(π,µ,λ) = q(π) q(µ k,λ k ) q *(π) = Dir(π α) where α has the components α k =α 0 + k Dirichlet: Dir Γ( α ) 0 k 1 ( µ α) = µ where α = k 0 αk Γ( α1)... Γ( αk ) k= 1 k= 1 α =3 α k =0.1

Factor q*(µ k,λ k ) is a Gaussian-Wishart (2b) Variational posterior for q*(µ k,λ k ) Does not further factorize into marginals It is a Gaussian-Wishart distribution q *(µ k,λ k ) = ( µ k m k (β k Λ k ) 1 )W (Λ k W 0,ν 0 ) W is the Wishart distribution It has the form W(Λ W,ν)=B Λ (ν-d-1)/2 exp[-½tr(w -1 Λ)] where ν is the no. of degrees of freedom, W is a D x D scale matrix and Tr is the trace. B(W,ν) is a normalization constant It is the conjugate prior for a Gaussian with known mean and unknown precision matrix Λ 20

Parameters of q*(µ k,λ k ) Gaussian-Wishart q *(µ k,λ k ) = ( µ k m k (β k Λ k ) 1 )W (Λ k W 0,ν 0 ) where we have defined β k = β 0 + k m k = 1 β k ( β 0 m 0 + k x k ) W k 1 =W 0 1 + k S k + υ k = υ 0 + k + 1 β 0 k ( x β 0 + k m 0 )( x k m 0 ) T k These update equations are analogous to M- step of EM for m.l. solution of GMM Involve evaluation of same sums as EM over the data set 21

Expression for Responsibilities For the M step we need expectations E[z nk ]=r nk Which are obtained by normalizing ρ nk Since where ln ρ nk = E[ lnπ k ]+ 1 2 E [ ln λ k ] D 2 ln(2π ) 1 2 E µ k Δ k (x n µ k ) T Λ k (x k µ k ) The three expectations wrt variational distribution of parameters are easily evaluated to give lnπ k E[ lnπ k ] =ψ (α k ) ν ln Λ k 1 2 E ln λ k ψ is the digamma function with Digamma r nk = ρ nk j=1 ρ nj D [ ] = ψ ν +1 i k i=1 2 + Dln2 + ln W k E µk Λ k (x n µ k ) T Λ k (x k µ k ) = Dβ 1 k +ν k (x n µ k ) T Λ k (x k µ k ) ψ (a) = d da lnγ(a) ˆα = α k k ν is the no. of degrees of freedom of Wishart appears in the definition of Dirichlet 22

Evaluation of Responsibilities Substituting the three expectations into ln ρ nk r nk!π! k Λ 1/2 exp D υ k ( 2β k 2 x m n k )W k (x n m k ) This is similar to responsibilities for mle for EM γ(z k ) p(z k =1 x) = p(z k =1) p(x z k =1) r p(z j =1) p(x z j =1) which can be written in the form j=1 = π k(x µ k,σ k ) j=1 π j (x µ k,σ j ) 1/2 r nk π k Λ k exp 1 2 x ( n µ k )Λ k (x n µ k ) where we have used precision Λ k instead of covariance Σ k to highlight similarity 23

Summary of Optimization Optimization of variational posterior distribution involves cycling between two stages Analogous to E and M steps on m.l. EM Variational E-step: Use current distribution over model parameters to evaluate moments and hence evaluate E[z nk ]=r nk Variational M step keep responsibilities fixed; use them to recompute variational distribution over the parameters using q *(π) = Dir(π α) and q *(µ k,λ k ) = ( µ k m k (β k Λ k ) 1 )W (Λ k W 0,ν 0 ) 24

Variational Bayesian GMM Old Faithful data set =6 components After convergence there are only two components Density of red ink inside each ellipse shows Mean value of Mixing coefficients 25

Similarity of Variational Bayes and EM Close similarity between variational solution for the Bayesin mixture of Gaussians and the EM algorithm for maximum likelihood In the limit as à, the Bayesian treatment converges to the maximum likelihood EM Variational algorithm is more expensive but problem of singularity is eliminated 26

Variational Lower Bound We can straight-forwardly evaluate the lower bound L(q) for this model Recall ln p(x) = L(q) + L(q p) where L(q) = q(z)ln and p(x, Z) q(z) L{q p} = q(z)ln The lower bound is used to monitor reestimation to test for convergence dz p(z X) q(z) dz 27

Predictive Density In using a Bayesian GMM we will be interested in the predictive density for a new value ˆx of the observed variable Assuming corresponding latent variable we can show that p( ˆx X) = 1ˆα ( ) α k St ˆx m k,l k,ν k +1-D where the k th component has mean m k and precision ( L k = ν k +1 D)β k W ( 1+ β k ) k The mixture of Student s T becomes a GMM as à 28

Determining no. of components Plot of variational lower bound L versus no. of components Distinct peak at =2 For each model is trained from 100 starts Results shown as + 29