A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (X i ) 1 i n from

Size: px

Start display at page:

Download "A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (X i ) 1 i n from"

Florence Baldwin
5 years ago
Views:

1 1 The EM algorithm In this set of notes, we discuss the EM (Expectation-Maximization) algorithm, which is a common algorithm used in statistical estimation to try and find the MLE. It is often used in situations that are not exponential families, but are derived from exponential families. A common mechanism by which these likelihoods are derived is through missing data, i.e. we only observe some of the sufficient statistics of the family. 1.1 Mixture model A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (X i ) 1 i n from Y Multinomial(1, π), π R L X Y l P ηl with the simplest example of P η being the univariate normal model P ηl N(µ l, σ 2 l ) keeping in mind that the parameters on the right are the mean space parameters, not the natural parameters Exercise 1. Show that the joint distribution of (X, Y ) is an exponential family. What is its reference measure, its sufficient statistics? Write out the log-likelihood based on observing an IID sample (X i, Y i ) 1 i n for this model. Call this l c (η; X, Y ) the complete likelihood. 2. What is the marginal density of X? 3. Write out the log-likelihood l(η; X) based on observing an IID sample (X i ) 1 i n from this model. What are its parameters? In the mixture model, we only observe X, though the marginal distribution of X is the same as if we had generated pairs (X, Y ) and marginalized over Y. In this problem, Y is missing data which we might call M, and X is observed data which we might call O. Formally, then, we partition our sufficient statistic into two sets: those observed, and those missing. 1.2 The EM algorithm The EM algorithm usually has two steps, both of which are based on the following function Q(η; η) E η ( lc (η; O, M) O ) The basis of the EM algorithm is the following result: Q(η; η) Q( η; η) l(η; O) l( η; O). 1

2 Therefore, any sequence (η (k) ) k 1 satisfying Q(η (k+1) ; η (k) ) Q(η (k) ; η (k) ) has l(η (k) ; O) non-decreasing. An algorithm that produces such a sequence is called a GEM algorithm (generalized EM algorithm). The proof of this is fairly straightforward after some initial slight of hand. After this slight of hand, we see the main ingredient in the proof is deviance of the conditional distribution of M O. In the general case, this deviance is not expressed in terms natural parameters but the argument is the same. Here is the proof: writing the joint distribution of (O, M) (assuming it has a density with respect to P 0 ) as dp η dp 0 f η,(o,m) (o, m) f η,o (o) f η,m O (m o) where the f s are densities with respect to P 0. Or, f η,o (o) f η,(o,m)(o, m) f η,m O (m o). Although the RHS seems to depend on m, the above equality shows that it is actually measurable with respect to o. We see that l(η; O) log f η (O i ) [log f η (O i, M i ) log f η (M i O i )] [log f η (O i, M i ) log f η (M i O i )] where we know that f η (m o) is an exponential family for O fixed. The right hand side is measurable with respect to O so its conditional expectation with respect 2

3 to O leaves it unchanged. Therefore, for any η we have the equality l(η; O) log f η (O i ) [log f η (O i, M i ) log f η (M i O i )] [ ( E η log fη (O i, M i ) ) ( O E η log fη (M i O i ) )] O E η ( lc (η; O, M) O ) Q(η; η) ( E η log fη (M i O i ) ) O i ( E η log fη (M i O i ) ) O i. Now, l(η; O) l( η; O) Q(η; η) Q( η; η) [ ( + E η log f η (M i O i ) ) ( Oi E η log fη (M i O i ) )] Oi The term [ ( E η log f η (M i O i ) ) ( O E η log fη (M i O i ) )] O is essentially half the deviance of the exponential family of conditional distributions for M O with sufficient statistics M. To see this, recall our general form of the conditional density of T 1 T 2 s 2 for an R p valued sufficient statistic partitioned as T 1 R k, T 2 R p k : f T1,T f T1 T 2 s 2 (t 1 ) 2 (t 1, s 2 ) R f k T1,T 2 (s 1, s 2 ) ds 1 e ηt 1 t 1+η T 2 s2 m 0 (t 1, s 2 ) R e k ηt 1 s 1+η2 T s 2 m 0 (s 1, s 2 ) ds 1 e ηt 1 t1 m 0 (t 1, s 2 ) R e k ηt 1 s 1 m 0 (s 1, s 2 ) ds 1 Therefore, with C a function independent of η ( ) log f η (M i O i ) ηmm T i log e ηt s M m 0 (s, O i ) ds + C(M i, O i ) R k η T MM i Λ(η M, O i ) + C(M i, O i ) where Λ(η M, O i ) is the appropriate CGF for this conditional distribution. 3

4 We see then, that log f η (M i O i ) log f η (M i O i ) Λ(η M, O i ) Λ( η M, O i ) (η M η M ) T M i. Taking conditional expectation with respect to O yields at η yields 1.3 The two basic steps ( E η log f η (M i O i ) log f η (M i O i ) ) 1 O D( η; η O) 0. 2 The algorithm is often described as having two steps the E step and the M step. Formally, the E step can be described as evaluating Q(η; η) with η fixed. That is, fix η and compute q η (η) E η ( lc (η; O, M) O ) as a function of η. The M is the maximization step and amounts to finding ˆη( η) argmax η Q(η; η) argmax η q η (η). 1.4 EM algorithm for exponential families The EM algorithm for exponential families takes a particularly nice form when the MLE map is nice in the complete data problem. Expressed sequentially, it can be expressed by the recursion ] ˆη (k+1) argmax η [η T E η (k)((m, O) O) Λ(η). In other words, we need to form the conditional expectation of all the sufficient statistics given the sufficient statistics we did observe. Following this, we just return the MLE as if we had observed those sufficient statistics. Another way to phrase this is ( ) ˆη (k+1) Λ E η (k)((m, O) O) 1.5 Mixture model example In the mixture model, if we write Y i (Y i1,..., Y il ) example the sufficient statistics can be taken to be ( ) t(x, Y ) Y ij, Y ij X i, Y ij Xi 2. where only L j1 Y ijx i X i, 1 i n is observed. 1 j L 4

5 1.5.1 Exercise Use Bayes rule to show that, in our univariat e normal mixture model P η (Y l X x) π l φ(x, µ l, σ 2 l ) L j1 π jφ(x, µ j, σ 2 l ) where φ(x, µ, σ 2 ) is the univariate density of N(µ, σ 2 l ). If we set ˆγ l (x, η) P η (Y l X x) The above exercise shows that E η ( Y il X i X E η ( Y il X 2 i E η ( ) ) X Y il X ) ˆγ l (X i, η)x i ˆγ l (X i, η)x i 2 ˆγ l (X i, η) The usual MLE map (for the mean parameters) in this model can be expressed as ˆπ l ˆµ l ˆσ 2 l Y il /n Y ilx i Y il Y il(x i ˆµ l ) 2 Y il Y ilx 2 i Y il ( Y ) 2 ilx i n Y il This leads to the algorithm, given an initial set of parameters η (0) we repeat the following updates for k 0: Form the responsibilities ˆγ l (X i ; η (k) ), 1 l L, 1 i n. Compute ˆπ (k+1) l ˆγ l (X i ; η (k) )/n ˆµ (k+1) l ˆσ 2(k+1) l ˆγ l(x i ; η (k) )X i ˆγ l(x i ; η (k) ) ˆγ l(x i ; η (k) )X 2 i ˆγ l(x i ; η (k) ) ( ) ˆµ (k+1) 2 l Repeat 5

6 Let s test out our algorithm on some data from the mixture model. mu1, sigma1 2, 1 mu2, sigma2-1, 0.8 X1 np.random.standard_normal(200)*sigma1 + mu1 X2 np.random.standard_normal(600)*sigma2 + mu2 X np.hstack([x1,x2]) %R -i X plot(density(x)) def phi(x, mu, sigma): """ Normal density """ return np.exp(-(x-mu)**2 / (2 * sigma**2)) / np.sqrt(2 * np.pi * sigma**2) def responsibilities(x, params): """ Compute the responsibilites, as well as the likelihood at the same time. """ mu1, mu2, sigma1, sigma2, pi1, pi2 params 6

7 gamma1 phi(x, mu1, sigma1) * pi1 gamma2 phi(x, mu2, sigma2) * pi2 denom gamma1 + gamma2 gamma1 / denom gamma2 / denom return np.array([gamma1, gamma2]).t, np.log(denom).sum() mu1, mu2, sigma1, sigma2, pi1, pi2 0, 1, 1, 4, 0.5, 0.5 gamma, likelihood responsibilities(x, (mu1, mu2, sigma1, sigma2, pi1, pi2)) Here is our recursive estimation procedure, which is fairly straightforward here. niter 20 n X.shape[0] values [] for _ in range(niter): gamma, likelihood responsibilities(x, (mu1, mu2, sigma1, sigma2, pi1, pi2)) pi1, pi2 gamma.sum(0) / n mu1 (gamma[:,0] * X).sum() / (pi1*n) mu2 (gamma[:,1] * X).sum() / (pi2*n) sigma1_sq (gamma[:,0] * X**2).sum() / (n*pi1) - mu1**2 sigma2_sq (gamma[:,1] * X**2).sum() / (n*pi2) - mu2**2 sigma1 np.sqrt(sigma1_sq) sigma2 np.sqrt(sigma2_sq) values.append(likelihood) We can track the value of the likelihood and, since we have an EM algorithm, the likelihood should be monotone with iterations. plt.plot(values) plt.gca().set_ylabel(r $\ell^{(k)}$ ) plt.gca().set_xlabel(r Iteration $k$ ) <matplotlib.text.text at 0xdbd6fb0> 7

8 Let s plot our density estimate to see how well the mixture model was fit. %%R -i pi1,pi2,sigma1,sigma2,mu1,mu2 X sort(x) plot(x, pi1*dnorm(x,mu1,sigma1)+pi2*dnorm(x,mu2,sigma2), col red, lwd2, type l, ylab Density ) lines(density(x)) 8

9 1.5.2 Exercise 1. Refit the mixture model assuming the variance is the same within each class, i.e. σ 2 l σ 2, independent of class l. 2. Try fitting 3 and 4 component mixture models to the above data which only has two. What do you expect to see in the fitted density? 1.6 Gaussian random effects model Another application of the EM algorithm is to random or linear mixed effects models. One version of a linear mixed effect model is Y X, Z N ( Xβ, σ 2 I + ZΣZ T ) where X is a fixed effects design matrix, Z is a random effect design matrix and Σ is a covariance matrix that must be estimated along with σ. The covariance matrix Σ might not be estimated in a completely unrestricted fashion. In the example below, the model is Σ σ 2 α I for some constant. This distribution is the same as the distribution of Xβ + Zα + ɛ X, Z 9

10 where α N(0, Σ), ɛ N(0, σ 2 I) independently given X, Z. The simplest version of such a random effects model would one in which observations were grouped by subjects and each subject had a random intercept Y ij X T i β + α i + ɛ ij, ɛ ij N(0, σ 2 ) α i N(0, σ 2 α) 1 i n, 1 j n i with the ɛ s and α s being independent. This corresponds to Z being a design matrix of indicator variables for a factor that has n levels, i.e. subject. Here, the matrix Σ σ 2 α I n n Exercise Define the complete data to be (Y ij, α i, X i ) 1 i n,1 j ni (Y ij, X i ) 1 i n,1 j ni. and assume you are only able to observe 1. What are the sufficient statistics for the joint likelihood of the complete data (conditional on X)? 2. What is the conditional distribution of α i Y ij, X i 1 j n? 3. Describe the EM algorithm to estimate (β, σ 2, σ 2 α). 4. How would you estimate the accuracy of σ 2 α? 10

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

1 More examples 1.1 Exponential families under conditioning Exponential families also behave nicely under conditioning. Specifically, suppose we write η = η 1, η 2 R k R p k so that dp η dm 0 = e ηt 1