Computing the maximum likelihood estimates: concentrated likelihood, EM-algorithm. Dmitry Pavlyuk

Computig the maximum likelihood estimates: cocetrated likelihood, EM-algorithm Dmitry Pavlyuk The Mathematical Semiar, Trasport ad Telecommuicatio Istitute, Riga, 13.05.2016

Presetatio outlie 1. Basics of MLE 2. Pseudo-likelihood 3. Fiite Mixture Models 4. The Expectatio-Maximizatio algorithm 5. Numerical Example 2

1. Basics of MLE 3

The estimatio problem Let X (X (1), X (2),, X (d) ) is a multivariate (d-variate) radom variable with kow multivariate p.d.f. f(x, θ) of K ukow parameters θ (θ 1, θ 2,, θ K ), θ Θ. The problem is to estimate parameters θ o the base of the sample of size from X: x (x 1, x 2,, x ) x i (x i (1), xi (2),, xi (d) ) 4

Maximum likelihood estimator The likelihood fuctio L θ x represets probability of receivig the sample x give parameters θ. I case of idepedet observatios i the sample L θ x f x i, θ i1 The maximum likelihood estimator is (R. Fisher, 1912+): መθ mle argmax θ Θ if a maximum exists. L θ x 5

Maximum likelihood estimator For computatio purposes the log-likelihood fuctio is itroduced: l θ x ll θ x l i1 f x i, θ Good limitig statistical properties of መθ mle : - Cosistecy - Asymptotic efficiecy - Asymptotic ormality i1 lf x i, θ 6

Maximum likelihood estimator FOC : l θ x i1 lf x i, θ θ max l θ x θ k 0 for all k 1,,K Not all log-likelihood fuctios have aalytical derivatives! 7

MLE example: multivariate ormal For example, for the multivariate ormal variable: X~MVN μ, Σ θ mv μ, Σ f mv x, θ mv φ x, μ, Σ 1 2π d e 1 2 x μ T Σ 1 x μ detσ 2π d 2detΣ 1 2exp 1 2 x μ T Σ 1 x μ 8

MLE example: multivariate ormal The log-likelihood fuctio: l mv μ, Σ x lφ x i, μ, Σ i1 d 2 l 2π 2 l detσ 1 2 i1 x i μ T Σ 1 x i μ FOC: l mv μ, Σ x μ l mv μ, Σ x Σ 0 0 9

MLE example: multivariate ormal Matrix calculus (for symmetric A): B T AB B 2BT A l deta A A 1 10

MLE example: multivariate ormal l mv μ, Σ x i1 μ d 2 l 2π 2 l detσ 1 2 σ i1 x i μ T Σ 1 μ x i μ T Σ 1 x i μ Settig this to zero we obtai the pleasat result μƹ Σ 1 i1 x i xҧ 11

MLE example: multivariate ormal l mv μ, Σ x Σ d 2 l 2π 2 l detσ 1 2 σ i1 2 Σ 1 1 2 i1 Σ Σ 1 x i μ x i μ T Σ 1 x i μ T Σ 1 x i μ Settig this to zero we obtai the result Σ 1 x i μ x i μ T i1 12

2. Pseudo-likelihood 13

Pseudo-likelihood There are a umber of suggestios for modifyig the likelihood fuctio to extract the evidece i the sample cocerig a parameter of iterest θ A whe θ (θ A, θ B ). The sample vector x is also trasformed ito 2 parts: x s s A, s B Such modificatios are geerally kow as pseudolikelihood fuctios: Coditioal likelihood Margial likelihood Cocetrated (profile) likelihood 14

Margial likelihood Margial likelihood fuctio: f X, θ f s,θ s A, s B, θ A, θ B f margial,a s A θ A f margial,b s B s A, θ A, θ B Maximum likelihood estimates for θ A are obtaied by maximizig the margial desity: f margial,a s A θ A Problems: Igorig some of the data Require aalytical forms of the fuctios 15

Coditioal likelihood Coditioal likelihood fuctio: f X, θ f s,θ s A, s B, θ A, θ B f coditioal,a s A s B, θ A f coditioal,b s B θ A, θ B Maximum likelihood estimates for θ A are obtaied by maximizig the coditioal desity: f coditioal,a s A s B, θ A Problems: Igorig some of data variability Require aalytical forms of the fuctios 16

Cocetrated likelihood Cocetrated likelihood fuctio: f X, θ f X, θ A, θ B f cocetrated X, θ A f cocetrated X, θ A, መθ B θ A Maximum likelihood estimates for θ A are obtaied by maximizig the cocetrated likelihood f cocetrated Problems: Severely biased Require መθ B θ A 17

Cocetrated likelihood l θ A, θ B x θa,θ B max Takig l θ A, θ B x θ B aalytically ad solvig l θ A, θ B x θ B 0 we obtai መθ B θ A ad move to cocetrated (profile) likelihood. 18

Cocetrated likelihood Σ μ 1 x i μ x i μ T i1 The cocetrated likelihood: l mv,cocetrated μ, Σ μ x d 2 l 2π 2 l det 1 i1 x i μ x i μ T 1 2 i1 x i μ T 1 i1 x i μ x i μ T 1 x i μ 19

Cocetrated likelihood l mv,cocetrated μ, Σ μ x 2 d l 2π + l det i1 x i μ x i μ T + d μ argmi μ l det i1 x i μ x i μ T This result is quite famous i ecoometrics! 20

3. Fiite Mixture Models 21

Gaussia mixture model Let we have a mixture of M multivariate radom variables (for example, ormal): X m ~MVN μ m, Σ m m 1,.., M with probability π m for every class. θ gmm μ 1,, μ m, Σ 1, Σ 2,, Σ m, π 1,, π m McLachla G., Peel D. (2000) Fiite Mixture Models, Willey Series i Probability ad Statistics, Joh Wiley & Sos, New York. 22

Gaussia mixture model d1: d2: 23

Gaussia mixture model Medical applicatios Schlattma P. (2009) Medical Applicatios of Fiite Mixture Models, Statistics for Biology ad Health, Spriger Fiacial applicatios Brigo, D.; Mercurio, F. (2002). Logormal-mixture dyamics ad calibratio to market volatility smiles Alexader, C. (2004). "Normal mixture diffusio with ucertai volatility: Modellig short- ad log-term smile effects" Image, speech, text recogitio Styliaou, Y. etc. (2005). GMM-Based Multimodal Biometric Verificatio Reyolds, D., Rose, R. (1995). Robust text-idepedet speaker idetificatio usig Gaussia mixture speaker models Permuter, H.; Fracos, J.; Jermy, I.H. (2003). Gaussia mixture models of texture ad colour for image database retrieval. 24

Gaussia mixture model Followig the law of complete probability, the likelihood fuctio is L gmm θ gmm x π m φ x i, μ m, Σ m M i1 m1 M l gmm θ gmm x ll gmm θ gmm x l π m φ x i, μ m, Σ m i1 m1 l π 1 φ x i, μ 1, Σ 1 + + π m φ x i, μ m, Σ m i1 The logarithm of sum prevets aalytical derivatives! 25

4. The Expectatio-Maximizatio algorithm 26

EM-algorithm The expectatio-maximizatio (EM) algorithm is a geeral method for fidig maximum likelihood estimates whe there are missig values or latet variables. I the mixture model cotext, the missig data is represeted by a set of observatios of a discrete radom variable Z that idicates which mixture compoet geerated the observatio i. 1, if observatio i belogs to class m, z im ቊ 0, otherwise 27

EM-algorithm: GMM 1, if observatio i belogs to class m, z im ቊ 0, otherwise If Z {z im } is give, l gmm θ gmm x l π 1 φ x i, μ 1, Σ 1 + + π m φ x i, μ m, Σ m trasformed to i1 M l gmm,complete θ gmm x, Z z i,m l π m φ x i, μ m, Σ m i1 m1 M i1 m1 z i,m lπ m + lφ x i, μ m, Σ m 28

EM-algorithm The EM iteratio icludes: a expectatio (E) step, which creates a fuctio for the expectatio of the log-likelihood evaluated usig the curret estimate for the parameters, ad a maximizatio (M) step, which computes parameters maximizig the expected log-likelihood foud o the E step. These parameter-estimates are the used to determie the distributio of the latet variables i the ext E step. 29

EM-algorithm: GMM (0) Assume θ gmm ad move to maximizatio of the expectatio of the log-likelihood fuctio: (0) E Z l gmm,complete θ gmm i1 M m1 E Z (0) z i,m θ gmm x, Z lπ m + lφ x i, μ m, Σ m 30

EM-algorithm: E-step (0) E Z z i,m x i, θ gmm (0) τ m x i, θ gmm 0 0 P z i,m 0 x i, θ gmm 0 P z i,m 1 x i, θ gmm P z 0 i,m 1 f x θ gmm, z i,m 1 0 f x i, Z θ gmm π m φ x i, μ m 0, Σ m 0 σm m 1 π m φ x i, μ 0 0 m, Σ m 0 + 1 P z i,m 1 x i, θ gmm These values are called class resposibilities. 31

EM-algorithm: M-step (0) E Z l gmm,complete θ gmm i1 FOC: M m1 τ m (0) x i, θ gmm x, Z (0) E l gmm,complete θ gmm x, Z 0, μ m (0) E l gmm,complete θ gmm x, Z 0, Σ m (0) E l gmm,complete θ gmm x, Z 0 π m lπ m + lφ x i, μ m, Σ m 32

EM-algorithm: M-step (0) E Z l complete θ gmm i1 τ m μ m (0) x i, θ gmm x, Z i1 τ m x i μ m T Σ 1 0 (0) φ x i, μ m, Σ m x i, θ gmm μ m (1) μƹ gmm,m σ (0) i1 τ m x i, θ gmm σ (0) i1 τ m x i, θ gmm x i 33

EM-algorithm: M-step (0) E Z l complete θ gmm i1 τ m Σ m x, Z i1 τ m (0) x i, θ gmm 2 Σ m 1 1 2 i1 (0) φ x i, μ m, Σ m x i, θ gmm Σ m Σ m 1 x i μ m x i μ m T Σ m 1 (1) Σ gmm,m σ (0) i1 τ m x i, θ gmm x i (0) μƹ gmm,m σ (0) i1 τ m x i, θ gmm (0) x i μ gmm,m T 34

EM-algorithm: M-step π j 1 i1 (0) E Z l complete θ gmm x, Z i1 τ m 1 π m i1 π m (0) lπ m x i, θ gmm π m τ m (0) x i, θ gmm (1) π gmm,m + λ σ m1 M π m 1 π m + λ 0 σ (0) i1 τ m x i, θ gmm 35

EM-algorithm 1. Iitialisatio Choose iitial values of θ (0) ad calculate the likelihood l θ (0) x, s0. 2. E-step Compute expectatio of specified parameterse Z 3. M-step Compute the ew estimates θ (s+1) 4. Covergece check Compute the ew likelihood ad if z i,m x i, θ (s) l θ (s+1) x l θ s x > precisio, the retur to step 2. 36

EM-algorithm Dempster, Laird, ad Rubi (1977) show that the likelihood fuctio l θ (s) x is ot decreased after a EM iteratio; that is for s 0,1,2,... l θ (s+1) x l θ (s) x See the proof i: McLachla G.J., Krisha T. (1997) The EM Algorithm ad Extesios, Wiley. 304 p. 37

5. Numerical Example 38

Numerical example DGP: d 2 Class π μ Σ 1 0.8 (1, 1) 2 0 0 2 2 0.2 (3, 4) 2 0.7 0.7 1 Implemeted with R, script is available o the semiar web page. 39

Numerical example Sample size 1000 Iteratio Log-likelihood Plot 0 0.7573044 1 12.96707 40

Numerical example Iteratio Log-likelihood Plot 2 26.22704 3 29.21921 41

Numerical example Iteratio Log-likelihood Plot 20 30.37445 42

Numerical example Real values Estimates Class π μ Σ (1000) π em 1 0.8 (1, 1) 2 0 0 2 2 0.2 (3, 4) 2 0.7 0.7 1 0.873 (1.033, 1.125) 0.127 (3.592, 4.376) (1000) (1000) μƹ em Σ em 1.994 0.113 0.113 2.406 1.203 0.522 0.522 0.880 43

Problems with EM Local maxima partially solved with careful (repetitive) iitial values selectio Slow covergece (i some cases) Meta-algorithm should be adapted for every specific problem Sigularities ad over-fittig 44

After EM Next step: Variatioal Bayes treat all parameters θ as missig variables iterate over compoets of missig variables (icludig θ) ad recalculate its expectatio 45

Recommeded literature McLachla G., Krisha T. (2008) The EM Algorithm ad Extesios, Wiley Series i Probability ad Statistics, 2d Editio, - 400 p. McLachla G., Peel D. (2000) Fiite Mixture Models, Willey Series i Probability ad Statistics, Joh Wiley & Sos, New York Gelma A., Carli J., Ster H., Duso D., Vehtari A., Rubi D. Bayesia Data Aalysis, Third Editio (Chapma & Hall/CRC Texts i Statistical Sciece) http://www.stat.columbia.edu/~gelma/book/ 46

Thak you for your attetio! Questios are very appreciated Cotacts: email: Dmitry.Pavlyuk@tsi.lv phoe: +37129958338