Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn

Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of DHS Assume examples n each class come from a parameterzed Gaussan densty Estmate the parameters (mean, varance) of the Gaussan densty for each class, and use them for classfcaton Estmaton uses Maxmum Lkelhood approach.

Revew of Maxmum Lkelhood Gven n..d. examples from a densty p(x;θ), wth known form p and unknown parameter θ. Goal: estmate θ, denoted by θˆ, such that the observed data s most lkely to be from the dstrbuton wth that θ. Steps nvolved: Wrte the lkelhood of the observed data. Maxmze the lkelhood wth respect to the parameter. 3

Example: D Gaussan Dstrbuton Maxmum Lkelhood Estmaton of Mean of a Gaussan Dstrbuton 0.4 True Densty MLE from 0 Samples MLE from 00 samples 0.35 0.3 0.5 0. 0.5 0. 0.05 0 4 - -0.5 0 0.5.5.5 3 x

Example: D Gaussan Dstrbuton 5 D Parameter Estmaton for Gaussan Dstrbuton 4 3 Blue: True densty Red : Estmated from 50 examples. y 0 5 - - 0 3 4 x

Multmodal Class Dstrbutons A sngle Gaussan may not accurately model the classes. Fnd subclasses n handwrtten onlne characters (,000 characters wrtten by 00 wrters) Performance mproves by modelng subclasses Connell and Jan, Wrter Adaptaton for Onlne Handwrtng Recognton, IEEE PAMI, Mar 00

Multmodal Classes Handwrtten f vs y classfcaton task. 7 A sngle Gaussan dstrbuton may not model the classes accurately.

An extreme example of multmodal classes 0 Lmtatons of Unmodal class modellng y 9 8 7 6 5 4 3 Red vs. Blue classfcaton. The classes are well separated. However, ncorrect model assumptons result n hgh classfcaton error. 8 0 0 4 6 8 0 x The red class s a mxture of two Gaussan dstrbutons There s no class label nformaton, when modelng the densty of just the red class.

Fnte mxtures k random sources, probablty densty functons f (x),,,k f (x) f (x) Choose at random X random varable f (x) f k (x) 9

Fnte mxtures Example: 3 speces (Irs) 0

Fnte mxtures f (x) f (x) X Choose at random, Prob.(source ) α random varable f (x) f k (x) Condtonal: Jont: f (x source ) f (x) f (x and source ) f (x) α Uncondtonal: f (x and source ) f(x) all sources k α f (x)

Fnte mxtures f (x) k α f(x) Component denstes Mxng probabltes: α 0 and k α Parameterzed components (e.g., Gaussan): f (x) f (x θ ) Θ f (x Θ) k α f (x θ) { θ,θ,...,θk, α, α,..., αk}

Gaussan mxtures f (x θ ) Gaussan Arbtrary covarances: f (x θ ) N(x µ, C ) Θ { µ, µ,..., µ k, C, C,..., Ck, α, α,... αk } Common covarance: f (x θ ) N(x µ, C) 3 Θ { µ, µ,..., µ k, C, α, α,... αk }

Mxture fttng / estmaton () () (n) Data: n ndependent observatons, x {x, x,..., x } Goals: estmate the parameter set Θ, maybe classfy the observatons Example: - How many speces? Mean of each speces? - Whch ponts belong to each speces? Classfed data (classes unknown) Observed data 4

Gaussan mxtures (d), an example µ σ 3 µ 4 σ µ 7 3 σ 0. 3 5 α 0.6 α 0.3 α 0. 3

6 Gaussan mxtures, an R example k 3 4 0 3 µ 0 0 C 3 3 3 µ C 4 4 µ 8 0 0 C (500 ponts)

Uses of mxtures n pattern recognton Unsupervsed learnng (model-based clusterng): - each component models one cluster - clusterng mxture fttng Observatons: - unclassfed ponts Goals: - fnd the classes, - classfy the ponts 7

Uses of mxtures n pattern recognton Mxtures can approxmate arbtrary denstes Good to represent class condtonal denstes n supervsed learnng Example: - two strongly non-gaussan classes. - Use mxtures to model each class-condtonal densty. 8

Uses of mxtures n pattern recognton Fnd subclasses (lexemes) Eg. onlne characters Performance mproves by modelng subclasses,000 characters wrtten by 00 wrters Connell and Jan, Wrter Adaptaton for Onlne Handwrtng Recognton, IEEE PAMI, 00

Fttng mxtures n ndependent observatons x {x (), x (),..., x (n) } Maxmum (log)lkelhood (ML) estmate of Θ: Θ ˆ arg max Θ L(x, Θ) L(x, Θ) log n j f (x ( j) Θ) n k log j ( j) α f ( x θ ) 0 mxture ML estmate has no closed-form soluton

Gaussan mxtures: A pecular type of ML Θ { µ, µ,..., µ k, C, C,..., Ck, α, α,... αk } Maxmum (log)lkelhood (ML) estmate of Θ: Θ ˆ arg max L(x, Θ) Θ Subject to: C α 0 postve and defnte k α Problem: the lkelhood functon s unbounded as det( C ) 0 There s no global maxmum. Unusual goal: a good local maxmum

A Pecular type of ML problem Example: a -component Gaussan mxture f (x µ,µ,σ, α) α πσ e (x µ σ ) + α π e (x µ ) Some data ponts: { x, x,..., x n} µ x L(x, Θ) log α πσ + α π e (x µ ) + n j log(...), as σ 0

Fttng mxtures: a mssng data problem ML estmate has no closed-form soluton Standard alternatve: expectaton-maxmzaton (EM) algorthm Mssng data problem: Observed data: x {x (), x (),..., x (n) () () (n) Mssng data: z { z, z,..., z } Mssng labels ( colors ) } z ( j) [ ] ( j) ( j) z, z,..., z ( kj), [ 0... 0 0... 0] T 3 at poston x ( j) generated by component

Fttng mxtures: a mssng data problem Observed data: x {x z { z (), x, z (),..., x,..., z (n) () () (n) Mssng data: } } z ( j) [ ] ( j) ( j) z,...,z k Complete log-lkelhood functon: L n k Θ ( j) c ( x, z, ) z log ) j ( ( j) α f ( x θ ) k- zeros, one log f (x ( j),z ( j) Θ) In the presence of both x and z, Θ would be easy to estmate, but z s mssng. 4

The EM algorthm ˆ ˆ ˆ ˆ ) ( 0) () (t) (t+ Iteratve procedure: Θ, Θ,..., Θ, Θ,... Under mld condtons: Θˆ (t) local maxmum of L( x, Θ) t The E-step: compute the expected value of ( x, z, Θ) L c E[L c ( x, z, Θ) x, Θˆ (t) ] Q( Θ, Θˆ (t) ) The M-step: update parameter estmates 5 ˆ (t+ ) arg max Q(, ˆ (t Θ Θ Θ ) ) Θ

The EM algorthm: the Gaussan case The E-step: Q( Θ, Θˆ Because ( x, z, Θ) E[z (j) L c Θ ˆ (t) x, ] Pr{z (j) (t) ) E Bnary varable Z [L c ( x, z, Θ) x, Θˆ L ( x,e[ z x, Θˆ c (t) s lnear n z Bayes law x (j), Θˆ (t) } k αˆ n αˆ f(x n (t) ], Θ) (j) f(x (j) ] θˆ (t) θˆ ) (t) n ) w (j,t) 6 (j,t) w Estmate, at teraton t, of the probablty that x ( j) was produced by component Soft probablstc assgnment

The EM algorthm: the Gaussan case Result of the E-step: (j,t) w Estmate, at teraton t, of the probablty that x ( j) was produced by component The M-step: αˆ (t + ) n n j w ( j,t ) 7 ˆ n (t+ ) j µ n w j ( j,t) w x ( j,t) ( j) Ĉ (t+ ) n ( j,t) ( j) w (x j n µ ˆ j (t+ ) w ( j,t) ) (x ( j) µ ˆ (t+ ) ) T

Dffcultes wth EM It s a local (greedy) algorthm (lkelhood never dcreases) Intalzaton dependent 74 teratons 70 teratons 8

Automatcally decdng the number of components Add a penalty term to the objectve functon, whch ncreases wth the number of clusters Start wth a large number of clusters Modfy the M-step to nclude a kller crteron whch removes components satsfyng certan crteron Fnally, choose the number of components, resultng n wth the largest objectve functon value (lkelhood - penalty). 9

Example Same as n [Ueda and Nakano, 998]. 30

Example k 4 n 00 k max 0 0 0 4 4 µ µ 0 µ 4 µ 0 4 α C I m 4 3

Example Same as n [Ueda, Nakano, Ghahraman and Hnton, 000]. 3

Example An example wth overlappng components 33

34 The rs (4-dm.) data-set: 3 components correctly dentfed

Another supervsed learnng example Problem: learn to classfy textures, from 9 Gabor features. - Four classes: 35 -Ft Gaussan mxtures to 800 randomly located feature vectors from each class/texture. -Test on the remanng data. Mxture-based Lnear dscrmnant Quadratc dsrmnant Error rate 0.0074 0.085 0.055

Resultng decson regons -d projecton of the texture data and the obtaned mxtures 36

Propertes of EM EM s extremely popular because of the followng propertes: Easy to mplement Guarantees the lkelhood ncreases monotoncally (why?) Guarantees the convergence of the soluton to a statonary pont.e., local maxma (why?). Lmtatons of EM resultng soluton depends hghly on the ntalzaton Could be slow n several cases compared to drect optmzaton methods (e.g., Iteratve scalng) 37

EM as lower bound optmzaton Start wth ntal guess θ 0 0, θ l( θ, θ ) θ 0 0, θ

EM as lower bound optmzaton Touch Pont l( θ, θ ) l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Start wth ntal guess { θ, θ } 0 Come up wth a lower bounded l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Q( θ, θ ) s a concave functon Touch pont: Q( θ θ, θ θ ) 0 0 0 { θ 0, θ 0 }

EM as lower bound optmzaton Start wth ntal guess { θ, θ } 0 l( θ, θ ) l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Come up wth a lower bounded l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Q( θ, θ ) s a concave functon Touch pont: Q( θ θ, θ θ ) 0 0 0 { θ 0, θ 0 }{ θ, θ } Search the optmal soluton that maxmzes Q( θ, θ )

EM as lower bound optmzaton Start wth ntal guess { θ, θ } 0 l( θ, θ ) l( θ, θ ) + Q( θ, θ ) l( θ, θ ) Come up wth a lower bounded l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Q( θ, θ ) s a concave functon Touch pont: Q( θ θ, θ θ ) 0 0 0 { θ 0, θ 0 }{ θ, θ } { θ, θ } Search the optmal soluton that maxmzes Q( θ, θ ) Repeat the procedure

EM as lower bound optmzaton Optmal Pont Start wth ntal guess { θ, θ } 0 l( θ, θ ) { θ 0, θ 0 }{ θ, θ } { θ θ },,... Come up wth a lower bounded l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Q( θ, θ ) s a concave functon Touch pont: Q( θ θ, θ θ ) 0 0 0 Search the optmal soluton that maxmzes Q( θ, θ ) Repeat the procedure Converge to the local optmal

Summary Expectaton-Maxmzaton algorthm E step: Compute expected complete data lkelhood Mstep: Maxmze the lkelhood to fnd parameters Can be used wth any model wth hdden (latent) varables Hdden varables can be natural to the model or can be artfcally ntroduced. Makes the parameter estmaton smpler, and effcent EM algorthm can be explaned from many perspectves Bound optmzaton Proxmal pont optmzaton, etc Several generalzatons/specalzatons exst Easy to mplement, and s wdely used! 43