Variational Mixture of Gaussians. Sargur Srihari

Variational Mixture of Gaussians Sargur srihari@cedar.buffalo.edu 1

Objective Apply variational inference machinery to Gaussian Mixture Models Demonstrates how Bayesian treatment elegantly resolves difficulties with maximum likelihood issues Many more complex distributions can be solved using straightforward extensions of this analysis 2

Graphical Model for GMM Graphical model corresponding to likelihood function of standard GMM: Plate otation: Equivalent networks Directed acyclic graph Representing mixture For each observation x n we have a corresponding latent latent variable z n A 1-of- binary vector with elements z nk for,.. Denote observed data by X={x 1,..,x } Latent variables by Z={z 1,..,z } 3

Likelihood Function for GMM Mixture density function is Therefore Likelihood function is p(x π,µ,σ) = n =1 x π k (x n µ k,σ k ) Since z has values {z k } with probabilities π k Product is over the i.i.d. samples Therefore log-likelihood function is ln p(x π,µ,σ) = n =1 ln π k (x n µ k,σ k ) Find parameters, π, µ and Σ that maximize log-likelihood A more difficult problem than for a single Gaussian 4

GMM m.l.e. expressions Obtained using derivatives of log-likelihood 1 µ = γ( z )x k nk n k n= 1 Σ k = 1 k γ(z nk )(x n µ k )(x n µ k ) T n =1 k π k = k = γ(z nk ) n =1 Parameters (means) All three are in terms of responsibliities Parameters(covariance matrices) Parameters (Mixing Coefficients) ot closed form solutions for the parameters Since γ (z nk ) the responsibilities depend on those parameters in a complex way 5

EM For GMM E step use current value of parameters µ k,σ k,π k to evaluate posterior probabilities p(z/x), γ (z ) nk i.e., responsibilities M step use these posterior probabilities to to re-estimate p(x,z): means, covariances and mixing coefficients wrt p(z/x) 6

Graphical model for Bayesian GMM GMM Bayesian GMM Mixing coefficients precisions means To specify model we need these conditional probabilities: 1. p(z π): conditional distribution of Z given mixing coeffts 2. p(x Z, µ, Λ): 3. p(π): distribution of mixing coefficients 4. p(µ,λ): prior governing mean and precision of each component 7

Conditional Distribution Expressions 1. Conditional distribution of Z={z 1,.,z } given mix coefficients π Since components are mutually exclusive 2. Conditional distribution of observed data X={x 1,..,x } given latent variables and component parameters p(x Z, µ, Λ) Since components are Gaussian p(x Z,µ,Λ) = p(z π) = n=1 π k z nk p(x z) = (x n µ k,λ 1 k ) z nk p(z) = x µ k,σ k π k z k ( ) z k where µ ={µ k } and Λ={Λ k } use of precision matrix simplifies further analysis 8

Parameter Priors: Mixing Coefficients 3. Distribution of mixing coefficients p(π) Conjugate priors simplify analysis Dirichlet distribution over π p(π) = Dir(π α 0 ) = C(α 0 ) π k α 0 1 We have chosen the same parameter α 0 for each of the components C(α 0 ) is the normalization constant for the Dirichlet distribution 9

Parameter Priors: Mean, Precision 4. Distribution of Mean and Precision of Gaussian components p(µ,λ) Gaussian-Wishart prior is p(µ,λ) = p(µ Λ) p(λ) = µ k m 0 (β 0 Λ k ) 1 W (Λ k W 0,ν 0 ) k =1 ( ) Which represents the conjugate prior when both mean and precision are unknown Resulting model has: Link between Λ and µ Due to distribution (4) above

Bayesian etwork for Bayesian GMM Joint of all random variables: p(x,z,π,µ,λ) = p(x Z,µ,Λ)p(Z π)p(µ Λ)p(Λ) All the factors were given earlier Only X={x 1,..,x } are observed This B provides a nice distinction between latent variables and parameters Variables such as z n that appear inside the plate are latent variables o of such variables grows with data set Variables outside the plate are parameters Fixed in no. and outside of data set Mixing Coeffts Precisions Means 11 From viewpoint of PGMs no fundamental difference

Recall GMM The variational approach The EM approach: 1. Evaluation of posterior distribution p(z X) 2. Evaluation of expectation of p(x,z) wrt to p(z X) Our goal is to specify the variational distribution q(z,π,µ,λ) which will specify p(z,π,µ,λ X) Recall ln p(x) = L(q) + L(q p) where L(q) = and p(x) = p(z)p(x z) = π k x µ k,σ k " q(z)ln p(x,z) % # & dz $ q(z) ' " L{q p} = q(z)ln# $ z p(z X) q(z) % & dz ' ( ) Here p(z) has parameter π with distribution p(π) 12

Variational Distribution In variational inference we can specify q by M using a factorized distribution q(z) = q i (Z i ) For Bayesian GMM the latent variables and parameters are Z, π, µ and Λ. So we consider the variational distribution q(z,π,µ,λ)=q(z)q(π,µ,λ) Remarkably, this is the only assumption needed for a tractable solution to a Bayesian Mixture Model Functional forms of both q(z) and q(π,µ,λ) are determined automatically by optimizing the variational distribution 13 i=1 Subscripts for q s omitted

Sequential update equations Using general result of factorized distributions When L(q) is defined as the q that makes the functional L(q) largest is For Bayesian GMM log of optimized factor is Since lnq * j (Z j ) = E i j [ ln p(x,z) ]+ const lnq *(Z) = E π,µ,λ p(x, Z) L(q) = q(z)ln dz = q i ln p(x, Z) lnq i q(z) dz i i ( ) ln p X,Z, π,µ,λ +const p(x,z,π,µ,λ) = p(x Z,µ,Λ)p(Z π)p(µ Λ)p(Λ) lnq *(Z) = E π ln p(z π) + E µ,λ ln p(x Z,µ,Λ) +const we have ote: Expectations are are just weighted sums 14

Simplification of q*(z) Expression for factor q*(z) lnq *(Z) = E π ln p(z π) + E µ,λ Absorbing terms not depending on Z into constant lnq *(Z) = n=1 z nk where ln ρ nk = E ln π k ln ρ nk +const where D is dimensionality of data variable x Taking exponentials on both sides ormalized distribution is ln p(x Z,µ,Λ) +const + 1 2 E ln λ k D 2 ln(2π) 1 2 E (x µ k Δ k n µ k ) T Λ k (x k µ k ) q *(Z) n=1 z ρ nk nk q *(Z) = n=1 z r nk nk where r nk = ρ nk j=1 ρ nj r nk are positive since ρ nk are exponentials of real nos. and will sum to one as required 15

Factor q*(z) has same form as prior ormalized distribution is We have found form of q* to maximize the functional L(q) It has same form as prior q *(Z) = p(z π) = Distribution q*(z) is discrete and has the standard result E[z nk ]=r nk, which play the role of responsibilities Since equations for q*(z) depend on moments of other variables They are coupled and solved iteratively n=1 n=1 π k z nk z r nk nk 16

Variational EM Variational E-step: determine responsibilities r nk Variational M-step: 1. determine statistics of data set k = n=1 x k = 1 k S k = 1 k r nk r nk x n n=1 n=1 r nk ( x n x n )( x n x n ) T Responsibility of k th component Mean of k th component Covariance matrix of k th component and 2. find optimal solution for the factor q(π,µ,λ) 17

Factorization of q(π,µ,λ) Using general result of factorized distributions lnq j * (Z j ) = E i j [ ln p(x,z) ]+ const We can write ( ) lnq *(π,µ,λ) = ln p(π) + ln p µ k,λ k + E Z ln p(z π) + E z nk 1 ( ) ln x µ,λ k k k which decomposes into terms involving π and only µ,λ The terms involving µ and Λ comprise sum of terms involving µ k and Λ k leading to factorization n=1 +const q(π,µ,λ) = q(π) q(µ k,λ k ) 18

Factor q(π) is a Dirichlet Given the factorization Consider each factor in turn: q(π) and q(µ k,λ k ) (2a) Identifying terms depending on π, q(π) has the solution lnq *(π) = (α 0 1) lnz k + r nk ln π k +const Taking exponential on both sides we get q*(π) as a Dirichlet n=1 q(π,µ,λ) = q(π) q(µ k,λ k ) q *(π) = Dir(π α) where α has the components α k =α 0 + k Dirichlet: Dir Γ( α ) 0 k 1 ( µ α) = µ where α = k 0 αk Γ( α1)... Γ( αk ) k= 1 k= 1 α =3 α k =0.1

Factor q*(µ k,λ k ) is a Gaussian-Wishart (2b) Variational posterior for q*(µ k,λ k ) Does not further factorize into marginals It is a Gaussian-Wishart distribution q *(µ k,λ k ) = ( µ k m k (β k Λ k ) 1 )W (Λ k W 0,ν 0 ) W is the Wishart distribution It has the form W(Λ W,ν)=B Λ (ν-d-1)/2 exp[-½tr(w -1 Λ)] where ν is the no. of degrees of freedom, W is a D x D scale matrix and Tr is the trace. B(W,ν) is a normalization constant It is the conjugate prior for a Gaussian with known mean and unknown precision matrix Λ 20

Parameters of q*(µ k,λ k ) Gaussian-Wishart q *(µ k,λ k ) = ( µ k m k (β k Λ k ) 1 )W (Λ k W 0,ν 0 ) where we have defined β k = β 0 + k m k = 1 β k ( β 0 m 0 + k x k ) W k 1 =W 0 1 + k S k + υ k = υ 0 + k + 1 β 0 k ( x β 0 + k m 0 )( x k m 0 ) T k These update equations are analogous to M- step of EM for m.l. solution of GMM Involve evaluation of same sums as EM over the data set 21

Expression for Responsibilities For the M step we need expectations E[z nk ]=r nk Which are obtained by normalizing ρ nk Since where ln ρ nk = E[ lnπ k ]+ 1 2 E [ ln λ k ] D 2 ln(2π ) 1 2 E µ k Δ k (x n µ k ) T Λ k (x k µ k ) The three expectations wrt variational distribution of parameters are easily evaluated to give lnπ k E[ lnπ k ] =ψ (α k ) ν ln Λ k 1 2 E ln λ k ψ is the digamma function with Digamma r nk = ρ nk j=1 ρ nj D [ ] = ψ ν +1 i k i=1 2 + Dln2 + ln W k E µk Λ k (x n µ k ) T Λ k (x k µ k ) = Dβ 1 k +ν k (x n µ k ) T Λ k (x k µ k ) ψ (a) = d da lnγ(a) ˆα = α k k ν is the no. of degrees of freedom of Wishart appears in the definition of Dirichlet 22

Evaluation of Responsibilities Substituting the three expectations into ln ρ nk r nk!π! k Λ 1/2 exp D υ k ( 2β k 2 x m n k )W k (x n m k ) This is similar to responsibilities for mle for EM γ(z k ) p(z k =1 x) = p(z k =1) p(x z k =1) r p(z j =1) p(x z j =1) which can be written in the form j=1 = π k(x µ k,σ k ) j=1 π j (x µ k,σ j ) 1/2 r nk π k Λ k exp 1 2 x ( n µ k )Λ k (x n µ k ) where we have used precision Λ k instead of covariance Σ k to highlight similarity 23

Summary of Optimization Optimization of variational posterior distribution involves cycling between two stages Analogous to E and M steps on m.l. EM Variational E-step: Use current distribution over model parameters to evaluate moments and hence evaluate E[z nk ]=r nk Variational M step keep responsibilities fixed; use them to recompute variational distribution over the parameters using q *(π) = Dir(π α) and q *(µ k,λ k ) = ( µ k m k (β k Λ k ) 1 )W (Λ k W 0,ν 0 ) 24

Variational Bayesian GMM Old Faithful data set =6 components After convergence there are only two components Density of red ink inside each ellipse shows Mean value of Mixing coefficients 25

Similarity of Variational Bayes and EM Close similarity between variational solution for the Bayesin mixture of Gaussians and the EM algorithm for maximum likelihood In the limit as à, the Bayesian treatment converges to the maximum likelihood EM Variational algorithm is more expensive but problem of singularity is eliminated 26

Variational Lower Bound We can straight-forwardly evaluate the lower bound L(q) for this model Recall ln p(x) = L(q) + L(q p) where L(q) = q(z)ln and p(x, Z) q(z) L{q p} = q(z)ln The lower bound is used to monitor reestimation to test for convergence dz p(z X) q(z) dz 27

Predictive Density In using a Bayesian GMM we will be interested in the predictive density for a new value ˆx of the observed variable Assuming corresponding latent variable we can show that p( ˆx X) = 1ˆα ( ) α k St ˆx m k,l k,ν k +1-D where the k th component has mean m k and precision ( L k = ν k +1 D)β k W ( 1+ β k ) k The mixture of Student s T becomes a GMM as à 28

Determining no. of components Plot of variational lower bound L versus no. of components Distinct peak at =2 For each model is trained from 100 starts Results shown as + 29