A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

Size: px

Start display at page:

Download "A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution"

Augustus Townsend
5 years ago
Views:

1 A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution Carl Scheffler First draft: September 008 Contents The Student s t Distribution The General EM Algorithm 3 Derivation of the EM Update Equations for the Univariate Student s t Distribution 4 Multivariate Case 4 5 Demonstration 5 A Distributions and Relations 5 The Student s t Distribution Univariate Student (x µ, λ, 0 Γ ( + Γ ( ormal ( x µ, (λη Gamma (η /, / dη ( ( λ + π + λ(x µ Multivariate Student (x µ, Λ, 0 Γ ( +D Γ ( ormal ( x µ, (ηλ Gamma (η /, / dη ( Λ π ( + (x µt Λ(x µ +D where the vectors x and µ are D-dimensional and Λ is D D. The General EM Algorithm We want to find the maximum likelihood estimate for a set of parameters Θ given a set of observed data X by maximizing P (X Θ. We assume that it is hard to solve this problem directly but that it is relatively easy to evaluate They re all column vectors. P (X, Θ

2 where is a set of latent variables such that P (X Θ The EM method then involves the following steps. P (X, Θ.. Write down the complete data log likelihood, log P (X, Θ.. Write down the posterior latent distribution, P ( X, Θ. 3. E step: write down the expectations under the distribution P ( X, Θ 0 for all terms in the complete data log likelihood (step. 4. Write down the function to maximize, Q(Θ, Θ 0 P ( X, Θ 0 log P (X, Θ, replacing integrals with the expectations from the E step. 5. M step: solve to yield the update equations. Θ 0 Once all of the expectation update equations (from step 3 and maximization update equations (from step 5 are known, we initialize our current estimate of the parameters Θ and update them by iterating the E and M updates until convergence. ote that a subscript 0 is used here and in the rest of the article to denote the old setting of the parameters. With each iteration the new parameter setting (found in step 5 will replace the old one. 3 Derivation of the EM Update Equations for the Univariate Student s t Distribution To cast the Student s t distribution in the EM framework we write the likelihood for a single data point, P (x i Θ Student (x i µ, λ,. By viewing this as an infinite mixture of ormal distributions, P (x i Θ ormal ( x i µ, (ληi Gamma ( /, /, we identify so that the complete likelihood function is P (X, Θ X {x i }, { }, Θ {µ, λ, } ormal ( x i µ, (ληi Gamma ( /, /. Step : Complete log likelihood function log P (X, Θ log ormal ( x i µ, (λ + log Gamma ( /, / log π + log λ + log λ (x i µ ( log Γ + log ( + log (

3 Step : Posterior latent distribution P ( X, Θ P (X, Θ ormal ( x i µ, (ληi Gamma ( /, / Gamma ( a i, b i ( where the last step follows from the fact that the Gamma distribution is the conjugate prior to a ormal distribution with unknown precision. We find the parameters a i and b i by combining the factors from the ormal and Gamma distributions. P (X, Θ and hence ormal ( x i µ, (λ Gamma ( /, / [ η i ( exp ( + λ ] (x i µ (All factors independent of are taken up in the proportionality. ( + Gamma, + λ (x i µ a i +, b i + λ (x i µ. Step 3: Expectation By looking at ( we find that we need to calculate the expectations of, and log under the posterior latent distribution (. E[] E[ ] j Gamma (η j a j, b j Gamma ( a i, b i a i /b i λ 0 (x i µ 0 E[log ] log j Gamma (η j a j, b j log Gamma ( a i, b i ψ(a i log b i ( 0 + ψ log 0 + λ 0 (x i µ 0 See Appendix A for the definition of the digamma function, ψ(. See Appendix A for the algebraic expansions of the ormal and Gamma distributions. 3

4 Step 4: Function to optimize Q(Θ, Θ 0 P ( X, Θ 0 log P (X, Θ log π + log λ + E[log ] λ (x i µ E[ ] ( log Γ + log ( + E[log ] ote that all the elements of Θ 0 are now implicit in the expectations. Step 5: Maximization µ 0 λ (x i µe[ ] 0 µ x ie[ ] E[] E[ ] λ 0 λ (x i µ E[ ] 0 ( λ (x i µ E[ ] ote that we require the updated µ value to find λ. ( 0 ψ + log + + ( ψ log + E[log ] E[log ] E[ ] 0 E[ ] ote that there is no closed form solution for it has to be found numerically. 4 Multivariate Case The derivation of the multivariate case requires taking vector and matrix derivatives to find the values of µ and Λ but is otherwise very similar to the univariate case. The update equations turn out to be D E[ ] 0 + (x i µ 0 T Λ 0 (x i µ 0 ( 0 + D E[log ] ψ log 0 + (x i µ 0 T Λ 0 (x i µ 0 µ x ie[ ] E[] ( Λ (x i µ(x i µ T E[ ] The update equation for remains unchanged. 3 D is still the dimensionality. 4

5 5 Demonstration The update equations for the univariate case were implemented in Python. 4 Figure shows the results. otice how the ML fit to the ormal distribution has to increase its variance (flatten out more and more to accommodate the increasing number of outliers. The Student s t distribution is relatively robust to this ML ormal distribution (ignoring outliers ML ormal distribution (with outliers ML Student's t distribution ML ormal distribution (ignoring outliers ML ormal distribution (with outliers ML Student's t distribution outlier 3 outliers Figure : Plots of data points drawn from a ormal distribution along with a small number of outliers. The ML fit of a ormal (with and without considering the outliers and a Student s t to the data are also shown. A Distributions and Relations ormal ( x µ, σ ( πσ ( exp (x µ σ Gamma (x a, b Γ(a ba x a e bx E Gamma [x] a/b E Gamma [log x] ψ(a log b ψ(x Γ (x Γ(x References [] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, ew York, Y, USA, 006. [] G.J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, ew York, Y, USA, If you have a look at the code, you ll probably notice that the update equations are a bit different from those in the text. This is simply due to optimization they are mathematically equivalent. 5

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume