Reading ECE 275B Homework #2 Due Thursday 2-16-12 MIDTERM is Scheduled for Tuesday, February 21, 2012 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example 7.11, pp. 178-82. Also read and understand the material in Section 7.8 of Kay. Read and understand the Expectation Maximization (EM) Algorithm described in Moon and stirling, Sections 17.1-17.6. The following papers are highly recommended: Eliminating Multiple Root Problems in Estimation, C.G. Small, J. Wang, and Z. Yang, Statistical Science, Vol. 15, No. 4 (Nov., 2000), pp. 313-332. Maximum Likelihood from Incomplete Data via the EM Algorithm, A.P. Dempster; N.M. Laird, & D.B. Rubin, Journal of the Royal Statistical Society. Series B (Methodological),Vol. 39, No. 1, (1977), pp. 1-38. Highly Recommended. On the Convergence Properties of the EM Algorithm, C.F. Jeff Wu, The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103. On the Convergence of the EM Algorithm, R.A. Boyles, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 45, No. 1, 1983, pp. 47-50. Mixture Densities, Maximum Likelihood and the Em Algorithm, R.A. Redner & H.F. Walker, SIAM Review, Vol. 26, No. 2 (Apr., 1984), pp. 195-239. Highly Recommended. Direct Calculation of the Information Matrix via the EM Algorithm, D. Oakes, Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No. 2 (1999), pp. 479-482. The Variational Approximation for Bayesian Inference: Life After the EM Algorithm, D.G. Tzikas, A.C. Likas, and N.P. Galatsanos, IEEE Signal Processing Magazine, November 2008, pp. 131-146. Comments on the Above Cited Papers As mentioned in class, For highly nonlinear problems of the type considered by Kay and for parametric mixture distributions 1 there will generally be multiple roots to the likelihood equation corresponding to multiple local maxima to the likelihood function. The question of how to deal with this problem, and in particular the question of how to track the zero of 1 In both cases we are leaving the nice world of regular exponential family distributions 1
the score function having the desired asymptotic property of efficiency (and which generally does not correspond to the global maximum of the likelihood function) is a difficult one of keen interest. In this regard, you might want to read the paper Eliminating Multiple Root Problems in Estimation cited above. Note from your readings in Kay that the Newton-Raphson procedure uses the Sample Information Matrix (SIM), while the Method of Scores uses the actual Fisher Information Matrix (FIM) which is the expected value of the SIM. As discussed in Kay, in the limit of an infinitely large number of samples it is the case that FIM SIM. Thus a perceived merit of the Newton-like numerical approaches is that an asymptotically valid estimate of the parameter error covariance matrix is provided by the availability of the SIM. 2 Important and useful background material on the history, applications, and convergence properties of the EM algorithm as a procedure for performing missing-data maximum likelihood estimation can be found in the classic paper by Dempster, Rubin, & Laird (DLR) cited above. The DLR paper discusses a variety of applications, including mixture-density estimation and the example given in Moon. In the DLR paper the complete data space is not a simple cartesian product of the actual data and hidden data spaces. The subsequent papers by Wu and Boyles, both cited above, discuss the lack of validity of the convergence proof given in the original DLR paper. Boyles gives a illuminating counterexample which contradicts the convergence claim made in DLR. Wu gives a rigorous and correct proof of convergence for the EM algorithm, and states sufficient conditions for convergence. Also cited above is the excellent and highly recommended SIAM Review paper by Redner & Walker on applying the EM algorithm to the problem of mixture density estimation. Despite the existence of excellent books now available of the subject of mixture density estimation, this classic paper is still well worth reading. The use of mixture densities to model arbitrary density functions and classification when class-labeled training data are not available are two immensely important uses of the EM algorithm. The convergence conditions given by DLR and Wu are clearly important for mixture density estimation. While mixtures of regular exponential family distributions (REFD s) result in convergent behavior, this is not true when one or more of the component densities of a mixture are not from an REFD. For example, a mixture of a uniform distribution and a gaussian distribution will not result in a convergent EM algorithm, as you will find in one of the homework problems below. One claimed potential drawback to the basic EM algorithm is the general lack of the availability of the SIM or FIM for use as a parameter error covariance matrix. 3 In fact, a 2 Recall that for regular statistical families the MLE asymptotically attains the C-R lower bound, which is the inverse of the FIM. The availability of the error covariance matrix coupled with the fact that the MLE is asymptotically normal allows one to easily construct confidence intervals (error bars) about the estimated parameter values. 3 On the one hand, the ability to avoid the explicit computation of the FIM (usually required to be updated at every iteration step) typically provides a huge computational savings relative to the Newton 2
careful reading of DLR shows that the SIM can be readily computed. This fact is discussed and made salient in the above-cited paper Direct Calculation of the Information Matrix via the EM Algorithm which discusses the additional computations (over and above the basic EM algorithm computations) needed to compute an estimate of the information matrix. Finally, the recent review paper by Tzikas et al. on the relationship between the EM algorithm and Variational approximation is a nice, short synopsis of material to be found in the textbook Pattern Recognition and Machine Learning, C. Bishop, Springer, 2006. The derivation of the EM algorithm given in this paper is slightly simpler than the one I give in lecture as (unlike DLR) they assume that the complete data space is the cartesian product of the actual data and hidden data spaces (Bishop makes this same simplifying assumption). Homework Problems and Matlab Programming Exercises 1. Let the scalar random variables X 1,, X n and Y 1,, Y n be all mutually independent, where Y i Poisson(β τ i ) and X i Poisson(τ i ), i = 1, n. The two data sets {X i } n i=1 and {Y i} n i=1 taken together comprise the complete data. The parameters β and τ i, i = 1,, n, are deterministic and unknown. 4 This provides a simple model of the incidence of a disease Y i seen in hospital i, where the underlying rate is a function of an overall (hospital independent) effect β due to the intrinsic virulence of the disease and an additional hospital specific factor τ i. The availability of the measurements X i allows one to directly estimate the hospital specific factor τ i. 5 (a) Find the complete information maximum likelihood estimate (MLE) of the unknown parameters. (b) Assume that the observation for X 1 is missing. (Thus each hospital i, except for hospital 1, is able to provide a measurement of the quantity X i.) Determine an EM algorithm for finding the actual information MLE of the unknown parameters. 2. Let A = [a ij ] be an arbitrary real m n matrix with ij-th component a ij. For f a differentiable scalar real-valued function of A, we define the matrix derivative and the algorithm. On the other hand, having this matrix available once the algorithm has finally converged to an estimate of the unknown parameters allows one to construct confidence intervals (parameter error bars) for these parameters. 4 Recall that a Poisson(λ) distribution has the form p Z (z λ) = e λ λ z z! z = 0, 1, with mean E {z} = λ. 5 For example, the parameter τ i could depend upon the size and overall health of the population in the region served by hospital i and X i could be the waiting time in minutes before an entering patient is first examined by a trained medical professional. Or τ i could depend on the quality of the health care provided by hospital i and X i could be the number of nurses with advanced training. Et cetera. 3
matrix gradient of f(a) in a component-wise manner respectively as [ ] ( ) T [ ] f(a) f(a) ; A f(a) A a ji A f(a) = f(a) a ij Note that this convention is consistent with the convention used to define the (covariant) vector derivative and (contravariant cartesian) gradient in ECE275A. For Σ square and invertible (but not necessarily symmetric) prove that Σ log det Σ = Σ 1 and Σ tr Σ 1 W = Σ 1 W Σ 1. Note that the condition for the first identity to be well-defined is that det Σ > 0. Assuming only that det Σ 0, what is log det Σ? Σ 3. Let y be a gaussian random vector with unknown mean µ and unknown covariance matrix Σ which is assumed to be full rank. Given a collection of N iid samples of y use the result of the previous homework problem to find the maximum likelihood estimates of µ and Σ. 6 Why is just determining a stationary point of the log likelihood function generally sufficient to claim uniqueness and optimality of the solution? 4. (i) In order to model the pdf of a scalar random variable, y, completely set up the M component Gaussian mixture density problem as a hidden data problem and then derive the EM algorithm for identifying the unknown mixture parameters. Assume that the individual mixture parameters are all independent. (ii) Write a program (in Matlab, or your favorite programming language) to implement your algorithm as requested in the problems given below. 5. Note that the algorithm derived in problem 4 above requires no derivatives, hessians, or step-size parameters in its implementation. Is this true for a complete implementation of the EM solution given by Kay as equations (7.58) and (7.59)? Why? Can one always avoid derivatives when using the EM algorithm? 6. On pages 290, 291, 293, and 294 (culminating in Example 9.3), Kay shows how to determine consistent estimates for the parameters of a zero mean, unimodal gaussian mixture. What utility could these parameters have in determining MLE s of the parameters via numerical means? 6 Do NOT use the gradient formulas for matrices constrained to be symmetric as is done by Moon and Stirling on page 558. Although the symmetrically constrained derivative formulas are sometimes useful, they are not needed here because the stationary point of the likelihood just happens to turn out to be symmetric (a happy accident!) without the need to impose this constraint. This means that the ( accidently symmetric) solution is optimal in the space of all matrices, not just on the manifold of symmetric matrices. I have never seen the symmetry constraint applied when taking the derivative except in Moon and Stirling. 4
7. Consider a two-gaussian mixture density p(y Θ), with true mixture parameters α = α 1 = 0.6, α = α 2 = 1 α = 0.4, means µ 1 = 0, µ 2 = 10, and variances σ 2 1, and σ 2 2. Thus Θ = {α, α, θ}, θ = {µ 1, µ 2, σ 2 1, σ 2 2}. The computer experiments described below are to be performed for the two cases σ1 2 = σ2 2 = 1 (i.e., both component standard deviations equal to 1) and σ1 2 = σ2 2 = 25 (i.e., both component standard deviations equal to 5). Note that in the first case the means of the two component densities are located 10 standard deviations apart (probably a rarely seen situation in practice) whereas in the second case the means of the two component densities are located 2 standard deviations apart (probably a not so rare situation). (a) Plot the true pdf p(y Θ) as a function of y. Generate and save 400 iid samples from the pdf. Show frequency histograms of the data for 25 samples, 50 samples, 100 samples, 200 samples, and 400 samples, plotted on the same plot as the pdf. (b) (i) Derive an analytic expression for the mean µ, and variance σ 2 of y as a function of the mixture parameters, and evaluate this expression using the true parameter values. (ii) Compute the sample mean µ and sample variance σ 2 of y using the 400 data samples and compare to the analytically determined true values. (iii) Show that the sample mean and sample variance are the ML estimates for the mean and variance of a (single) gaussian model for the data. Plot this estimated single gaussian model against the true two gaussian mixture density. (c) (i) Assume that the true density is unknown and estimate the parameters for a two-component gaussian mixture 7 using the EM algorithm starting from the following initial conditions, What happens and why? µ 1 = µ 2 = µ, σ 2 1 = σ 2 2 = σ 2, α = 1 2. (ii) Perturb the initial conditions slightly to destroy their symmetry and describe the resulting behavior of the EM algorithm in words. (iii) Again using all 400 samples, estimate the parameters for a variety of initial conditions and try to trap the algorithm in a false local optimum. Describe the results. (iv) Once you have determined a good set of initial conditions, estimate the mixture parameters for the two component model via the ME algorithm for 25, 50, 100, 200, and 400 samples. Show that the actual data log-likelihood function increases monotonically with each iteration of the EM algorithm. Plot your final estimated parameter values versus log 2 (Sample Number). 7 By visual inspection of the data frequency histograms, one can argue that whatever the unknown distribution is, it is smooth with bimodal symmetric humps and hence should be adequately modelled by a two component gaussian mixture. 5
(d) Discuss the difference between the case where the component standard deviations both have the value 1 and the case where they are both have the value 5. 8. (i) For the two component gaussian mixture case derive the EM algorithm when the constraint σ 2 1 = σ 2 2 = σ 2 is known to hold. (I.e., incorporate this known constraint into the algorithm.) Compare the resulting algorithm to the one derived in Problem 4 above. (ii) Estimate the unknown parameters of a two-component gaussian mixture using the synthetic data generated in Problem 4 above using your modified algorithm and compare the results. 9. EM computer experiment when one of the components is non-regular. Consider a twocomponent mixture model where the first component is uniformly distributed between 0 and θ 1, U[0, θ 1 ], with mixture parameter α = 0.55, and the second component is normally distributed with mean µ and standard deviation σ 2, θ 2 = (µ, σ 2 ) T, with mixture parameter ᾱ = 1 α = 0.45. Assume that all of the parameters are independent of each other. Assume that 0 log 0 = 0. (a) i. Generate 1000 samples from the distribution described above for θ 1 = 1, µ = 3, and 2 σ2 = 1. Plot a histogram of the data and the density function 16 on the same graph. ii. Now assume that you do not known how the samples were generated and learn the unknown distribution using 1, 2, 3,10, and 50 component gaussian mixture models by training on the synthesized data. Plot the true mixture distribution and the 5 learned gaussian mixture models on the same graph. Recall that the MLE-based density estimate is asymptotically closest to the true density in the Kullback-Liebler Divergence sense. Note that our learned models approximate an unknown non-regular mixture density by regular gaussian mixtures. Show that the actual data log-likelihood function increases monotonically with each iteration of the EM algorithm. iii. Repeat the above for the values µ = 1, 1 and 2, with all other parameters 2 unchanged. (b) Suppose we now try to solve the non-regular two component mixture problem described above by forming a uniform distribution plus gaussian two-component mixture model and applying the EM algorithm in an attempt to learn the components parameters θ 1, µ and σ 2 directly. Convince yourself that in setting up the EM algorithm the Gaussian component can be handled as has already been done above, so that the novelty, and potential difficulty, is in dealing with the non-regular uniform distribution component. i. In the M-step of the EM algorithm show that if ˆx l,1 0 for some sample y l > θ 1 then the sum N ˆx j,1 log p 1 (y j ; θ 1 ), j=1 6
where p 1 (y j ; θ 1 ) denotes the uniform distribution, is not bounded from below. What does this say about the admissible values of ˆθ 1 +? ii. Show that maximizing N ˆx j,1 log p 1 (y j ; θ 1 ) j=1 with respect to θ 1 is equivalent to maximizing ( ) ˆN1 1 θ 1 j 1 [0,θ1 ](y j ) with θ 1 y j for all y j > 0 (1) where {y j } are the data samples for which ˆx j,1 0. 1 [0,θ1 ](y j ) is the indicator function that indicates whether or not y j is in the closed interval [0, θ 1 ]. Maximize the expression (1) to obtain the estimate ˆθ + 1. (Hint: Compare to the full information MLE solution you derived last quarter.) Explain why the estimate provided by the EM algorithm remains stuck at the value of the very first update ˆθ +, and never changes thereafter, and why the value one gets stuck at depends on the initialization value of ˆθ 1. Verify these facts in simulation (see below for additional simulation requests). Try various initialization values for ˆθ 1, including ˆθ 1 > max j y j. (Note that only initialization values that have an actual data log-likelihood that is bounded from below need be considered.) Show that the actual data log-likelihood function monotonically increases with each iteration of the EM algorithm (i.e., that we have monotonic hill-climbing on the actual data log-likelihood function). iii. Let us attempt to heuristically attempt to fix the stickiness problem of the EM algorithm with an ad hoc attempt at stabilizing the algorithm. There is no guarantee that the procedure outlined below will work as we are not conforming to the theory of the EM algorithm. Thus, monotonic hill-climbing on the actual data log-likelihood function is not guaranteed. First put a floor on the value of ˆx j,1 for positive y j : If ˆx j,1 < ɛ and y j > 0 set ˆx j,1 = ɛ and ˆx j,2 = 1 ɛ Choose ɛ to be a very small value. In particular choose ɛ 1/ˆθ 1 where ˆθ 1 > max y j is the initialization value of the estimate of θ 1. j Next define two intermediate estimates, ˆθ 1 and ˆθ 1 = 2ˆN1 N j=1 7 ˆx j,1 y j with ˆN1 = ˆθ 1, of θ 1 as follows. N j=1 ˆx j,1
ˆθ 1 = max ˆx j,1 y j e j ( yj ˆθ 1 ) 2 τ In the complete information case, which corresponds to setting ˆx j,1 = x j,1 and ˆN 1 = N 1, the estimate ˆθ 1 corresponds to the BLUE (and method of moments) estimate of θ 1. We can refer to ˆθ 1 as the pseudo-blue estimate. The estimate ˆθ 1 modifies the EM update estimate by adding the factor ˆx j,1, which weights y j according to the probability that it was generated by the uniform distribution, and an exponential factor which weights y j according to how far it is from the pseudo-blue estimate of θ 1. We can refer to ˆθ 1 as the weighted EM update of θ 1. The parameter τ determines how aggressively we penalize the deviation of y j from ˆθ 1. We construct our final Modified EM update of θ 1 as ˆθ + 1 = β ˆθ 1 + β ˆθ 1 for 0 β 1 and β = 1 β For example, setting β = 1/2 yields ˆθ + 1 = 1 2 (ˆθ 1 + We see that there are three free parameters to be set when implementing the Modified EM update step. Namely, ɛ, τ, and β. Attempt to run the modified EM algorithm on 1000 samples generated from data for the cases θ 1 = 1, σ 2 = 1 and µ = 1, 1, 3, and 2. Plot the actual 16 2 2 data log-likelihood function to see if it fails to increase monotonically with each iteration of the EM algorithm. If you can t get the algorithm to work, don t worry I thought up this heuristic fix on the fly and it definitely is not motivated by any rigorous mathematics. 8 If you can get the procedure to work, describe its performance relative to the pure Gaussian mixture cases investigated above and show its performance for various choices of β, including β = 0, 0.5, and 1. ) ˆθ 1 8 I dreamed up this fix a few years ago. So far only one student has actually got it to work and work surprisingly well! Of course I m always curious as to how many students can get it to work every time I teach this course. 8