An Approximate Fisher Scoring Algorithm for Finite Mixtures of Multinomials

Size: px

Start display at page:

Download "An Approximate Fisher Scoring Algorithm for Finite Mixtures of Multinomials"

Kelly Stokes
5 years ago
Views:

1 An Approximate Fisher Scoring Agorithm for Finite Mixtures of Mutinomias Andrew M. Raim, Mingei Liu, Nagaraj K. Neercha and Jorge G. More Abstract Finite mixture distributions arise naturay in many appications incuding custering and cassification. Since they usuay do not yied cosed forms for maximum ikeihood estimates MLEs, numerica methods using the we known Fisher Scoring or Expectation-Maximization agorithms are considered. In this work, an approximation to the Fisher Information Matrix of an arbitrary mixture of mutinomia distributions is introduced. This eads to an Approximate Fisher Scoring agorithm AFSA, which turns out to be cosey reated to Expectation-Maximization, and is more robust to the choice of initia vaue than Fisher Scoring iterations. A combination of AFSA and the cassica Fisher Scoring iterations provides the best of both computationa efficiency and stabe convergence properties. Key Words: Mutinomia; Finite mixture, Maximum ikeihood, Fisher Information Matrix, Fisher Scoring. 1 Introduction A finite mixture mode arises when observations are beieved to originate from one of severa popuations, but it is unknown to which popuation each observation beongs. Mode identifiabiity of mixtures is an important issue with a considerabe iterature; see, for exampe, Robbins 1948 and Teicher The book by McLachan and Pee 2000 is a good entry point to the iterature on mixtures. A majority of the mixture iterature deas with mixtures of norma distributions; however, Bischke 1962; 1964, Kroikowska 1976, and Kabir 1968 are a few eary works which address mixtures of discrete distributions. The present work focuses on mixtures of mutinomias, which have wide appications in custering and cassification probems, as we as modeing overdispersion data More and Nagaraj, It is we known that computation of maximum ikeihood estimates MLE under mixture distributions is often anayticay intractabe, and therefore iterative numerica methods are needed. Cassica iterative techniques such as Newton-Raphson and Fisher Scoring are two widey used methods. The more recent Expectation-Maximization EM agorithm discussed in Dempster, Laird and Rubin 1977 has become another standard technique to compute MLEs. EM is a framework for performing estimation in missing data probems. The idea Andrew Raim is graduate student at Department of Mathematics and Statistics, University of Maryand, Batimore County, Batimore, MD 21250, USA Emai: araim1@umbc.edu. Mingei Liu is Senior Principia Biostatistician at Medtronic, Santa Rosa, CA, USA. Nagaraj Neercha is Professor at Department of Mathematics and Statistics, University of Maryand, Batimore County, Batimore, MD 21250, USA Emai: nagaraj@umbc.edu. Jorge More is Principa Statistician at Procter & Gambe, Cincinnati, OH, USA. 1

2 is to sove a difficut incompete data probem by repeatedy soving tractabe competedata probems. If the unknown popuation abes are treated as missing information, then estimation under the mixture distribution can be considered a missing data probem, and an EM agorithm can be used. Unike Fisher Scoring FSA, EM does not require computation of an expected Hessian in each iteration, which is a great advantage if this matrix is difficut to compute. Sow speed of convergence has been cited as a disadvantage of EM. Variations and improved versions of the EM agorithm have been widey used for obtaining MLEs for mixtures Mcachan and Pee, 2000, chapter 2. Fisher Scoring iterations require the inverse of the Fisher Information Matrix FIM. In the mixture setting, computing the FIM invoves a compicated expectation which does not have an anayticay tractabe form. The matrix can be approximated numericay by Monte Caro simuation for exampe, but this is computationay expensive, especiay when repeated over many iterations. More and Nagaraj 1991; 1993 proposed a variant of Fisher Scoring using an approximate FIM in their study of a mutinomia mode with extra variation. This mode, now referred to as the Random Cumped Mutinomia see Exampe 3.1 for detais, is a specia case of the finite mixture of mutinomias. The approximate FIM was justified asymptoticay, and was used to obtain MLEs for the mode and to demonstrate their efficiency. In the present paper, we extend the approximate FIM idea to genera finite mixtures of mutinomias and hence formuate the Approximate Fisher Scoring Agorithm AFSA for this famiy of distributions. By using the approximate FIM in pace of the true FIM, we obtain an agorithm which is cosey reated to EM. Both AFSA and EM have a sower convergence rate than Fisher Scoring once they are in the proximity of a maximum, but both are aso much more robust than Fisher Scoring in finding such regions from an arbitrary initia vaue. The rest of the paper is organized as foows. In section 2, a arge custer approximation for the Fisher Information Matrix is derived and some of its properties are presented. This approximate information matrix is easiy computed and has an immediate appication in Fisher Scoring, which is presented in section 3. Simuation studies are presented in section 4, iustrating convergence properties of the approximate information matrix and approximate Fisher Scoring. Concuding remarks are given in section 5. 2 An Approximate Fisher Information Matrix Consider the mutinomia sampe space with m trias paced into k categories at random, Ω { x x 1,..., x k : x j {0, 1,..., m}, k j1 } x j m. The standard mutinomia density is fx; p, m m! x 1!... x k! px p x k k Ix Ω, 2

3 where I is the indicator function, and the parameter space is { p p 1,..., p k 1 : 0 < p j < 1, k 1 j1 } p j < 1 R k 1. If a random variabe X has distribution fx; p, m, we wi write X Mut k p, m. Since x k m k 1 j1 x j and p k 1 k 1 j1 p j, the kth category can be considered as redundant information. Foowing the samping and overdispersion iterature, we wi refer to the number of trias m as the custer size of a mutinomia observation. Now suppose there are s mutinomia popuations Mut k p 1, m,..., Mut k p s, m, p p 1,..., p,k 1 where the th popuation occurs with proportion π for 1,..., s. If we draw X from the mixed popuation, its probabiity density is a finite mixture of mutinomias fx; θ s π fx; p, m, θ p 1,..., p s, π, and we wi write X MutMix k θ, m. The dimension of θ is q : sk 1 + s 1 sk 1, disregarding the redundant parameters p 1k,..., p sk, π s. We wi aso make use of the foowing sighty-ess-cumbersome notation for densities, Px : fx; θ, m : P x : fx; p, m : the mixture, the th component of the mixture. The setting of this paper wi be an independent sampe X i MutMix k θ, m i, i 1,..., n with custer sizes not necessariy equa. The resuting ikeihood is { n n s [ m i! Lθ fx i ; θ π x i1!... x ik! px i px ik k Ix i Ω] } The inner summation prevents cosed-form ikeihood maximization, hence our goa wi be to compute the MLE ˆθ numericay. Some additiona preiminaries are given in Appendix A. In genera, as mentioned earier, the Fisher Information Matrix FIM for mixtures invoves a compicated expectation which does not have a tractabe form. Since the mutinomia 3

4 mixture has a finite sampe space, it can be computed naivey by using the definition of the expectation Iθ { } { } T og fx; θ og fx; θ fx; θ, 2.3 θ θ given a particuar vaue for θ. Athough the number of terms k+m 1 m in the summation is finite, it grows quicky with m and k, and this method becomes intractibe as m and k increase. For exampe, when m 100 and k 4, the sampe space Ω contains more than 178,000 eements. To avoid these potentiay expensive computations, we extend the approximate FIM approach of More and Nagaraj 1991; 1993 to the genera finite mixture of mutinomias. The foowing theorem states our main resut. Theorem 2.1. Suppose X MutMix k θ, m is a singe observation from the mixed popuation. Denote the exact FIM with respect to X as Iθ. Then an approximation to the FIM with respect to X is given by the sk 1 sk 1 bock-diagona matrix where for 1,..., s F m [ D 1 are k 1 k 1 matrices, F π D 1 π Ĩθ : Bockdiag π 1 F 1,..., π s F s, F π, + p 1 k 11T ] and D diagp 1,..., p,k 1 + π 1 s 11 T and D π diagπ 1,..., π s 1 is a s 1 s 1 matrix, and 1 denotes a vector of ones of the appropriate dimension. To emphasize the dependence of the FIM and the approximation on m, we wi aso write I m θ and Ĩmθ. If the vectors p 1,..., p s are distinct i.e. p a p b for every pair of popuations a b, then I m θ Ĩmθ as m. A proof is given in Appendix B. Notice that the matrix F is exacty the FIM of Mut k p, m for the th popuation, and F π is the FIM of Mut s π, 1 corresponding to the mixing probabiities π; see Appendix A for detais. The approximate FIM turns out to be equivaent to a compete data FIM, as shown in Proposition 2.2 beow, which provides an interesting connection to EM. This matrix can be formuated for any finite mixture whose components have a we-defined FIM, and is not imited to the case of mutinomias. Proposition 2.2. The matrix Ĩθ is equivaent to the FIM of X, Z, where 1 with probabiity π 1 Z. s with probabiity π s, and X Z Mut k p, m

5 Proof of Proposition 2.2. Here Z represents the popuation from which X was drawn. The compete data ikeihood is then Lθ x, z s 1 [ π fx p, m] Iz. This ikeihood eads to the score vectors [ og Lθ a Da 1 x k x ] k 1, p a p ak π og Lθ D 1 π s s 1, π s where 1,..., s so that IZ Bernouiπ, and s denotes the vector 1,..., s 1. Taking second derivatives yieds [ 2 og Lθ p a p T a Da 2 x k + x ] k 11 T, a 2 p a p T b 2 og Lθ 0, for a b, og Lθ 0, p a πt [ 2 og Lθ π πt D 2 π s + s π 2 s p 2 ak 11 T ]. Now take the expected vaue of the negative of each of these terms, jointy with respect to X, Z, to obtain the bocks of Ĩθ. Coroary 2.3. Suppose X i MutMixθ, m i, i 1,..., n, is an independent sampe from the mixed popuation with varying custer sizes, and M m m n. Then the approximate FIM with respect to X 1,..., X n is given by Ĩθ Bockdiag π 1 F 1,..., π s F s, F π, F M [ D 1 F π n [ D 1 π + p ] 1 k 11T, + πs 1 11 ] T. 1,..., s Proof of Coroary 2.3. Let Ĩiθ represent the approximate FIM with respect to observation X i. The resut is obtained using Ĩθ Ĩ1θ + + Ĩnθ, corresponding to the additive property of exact FIMs for independent sampes. The additive property can be justified by noting that each Ĩiθ is a true compete data FIM, by Proposition

6 Since Ĩθ is a bock diagona matrix, some usefu expressions can be obtained in cosed form. Coroary 2.4. Let Ĩθ represent the FIM with respect to an independent sampe X i MutMixθ, m i, i 1,..., n. Then: a The inverse of Ĩθ is given by Ĩ 1 θ Bockdiag π1 1 F1 1,..., πs 1 Fs 1, Fπ 1, 2.5 F 1 M 1 {D p p T }, 1,..., s Fπ 1 n 1 {D π ππ T }. b The trace of Ĩθ is given by tr Ĩθ s k 1 { Mπ p 1 j 1 j1 c The determinant of Ĩθ is given by s det Ĩθ 1 k 1 p 1 k j1 } s 1 + p 1 k + Mπ p 1 j 1 π 1 s n { π 1 s 1 1 nπ 1 } + πs 1. Proof of Coroary 2.4 a. Since Ĩθ is bock diagona, its inverse can be obtained by inverting the bocks, which can immediatey be seen to be 2.5. To find the expressions for the individua bocks, we can appy the Sherman-Morrison formua see for exampe Rao 1965, chapter 1 C + uv T 1 C 1 C 1 uv T C v T C 1 u. For the case of Fπ 1, for exampe, take C Dπ 1, u πs 1/2 1, and v πs 1/2 1 T and use the expressions in Coroary 2.3. Proof of Coroary 2.4 b. Since the trace of a bock diagona matrix is the sum of the traces of its bocks, we have tr Ĩθ π 1 tr F π s tr F s + tr F π. 2.6 The individua traces can be obtained as tr F tr [ MD 1 + p 1 k 11T ] k 1 j1 M { p 1 j + p 1 k },. 6

7 a summation over the diagona eements. Simiary for the bock corresponding to π, tr F π tr [ n D 1 π + π 1 s 11 T ] s 1 1 n { π 1 The resut is obtained by repacing these expressions into 2.6. } + πs 1. Proof of Coroary 2.4 c. Since Ĩθ has a bock diagona structure, det Ĩθ det {F π} s det {π F } 1 n s 1 det { D 1 π + π 1 s 11 T } s 1 π k 1 M k 1 det { D 1 + p } 1 k 11T 2.7 Reca the property see for exampe Rao 1965, chapter 1 that for M non-singuar, we have detm + uu T M u u T 1 detm 1 + u T M 1 u. This yieds, for instance det { D 1 π + π 1 s 11 } T det { } Dπ π 1 s 1 T D π 1 [ π ] s s 1 s 1 π 1 πs 1 π s 1 1 π 1. The resut can be obtained by substituting the simpified determinants into 2.7. The determinant and trace of the FIM are not utiized in the computation of MLEs, but are used in the computation of many statistics in subsequent anaysis. In such appications, it may be preferabe to have a cosed form for these expressions. As one exampe, consider the Consistent Akaike Information Criterion with Fisher Information CAICF formuated in Bozdogan, The CAICF is an information-theoretic criterion for mode seection, and is a function of the og-determinant of the FIM. It can aso be shown that Im 1 θ Ĩ 1 m θ 0 as m, which we now state as a theorem. A proof is given in Appendix B. This resut is perhaps more immediatey reevant than Theorem 2.1 for our Fisher Scoring appication presented in the foowing section. Theorem 2.5. Let I m θ and Ĩmθ be defined as in Theorem 2.1 namey the FIM and approximate FIM with respect to a singe observation with custer size m. Then Im 1 θ Ĩm 1 θ 0 as m. In the next section, we use the approximate FIM obtained in Theorem 2.1 to define an approximate Fisher Scoring agorithm and investigate its properties. 7

8 3 Approximate Fisher Scoring Agorithm Consider an independent sampe with varying custer sizes X i MutMix k θ, m i, i 1,..., n. Let θ 0 be an initia guess for θ, and Sθ be the score vector with respect to the sampe see Appendix A. Then by independence Sθ Sθ; x i, where Sθ; x i is the score vector with respect to the ith observation. The Fisher Scoring Agorithm is given by computing the iterations unti the convergence criteria θ g+1 θ g + I 1 θ g Sθ g, g 1, 2, og Lθ g+1 og Lθ g < ε is met, for some given toerance ε > 0. In practice, a ine search may be used for every iteration after determining a search direction, but such modifications wi not be considered here. Note that 3.1 uses the exact FIM which may not be easiy computabe. We propose to substitute the approximation Ĩθ for Iθ, and wi refer to the resuting method as the Approximate Fisher Scoring Agorithm AFSA. The expressions for Ĩθ and its inverse are avaiabe in cosed form, as seen in Coroaries 2.3 and 2.4. AFSA can be appied to finite mixture of mutinomia modes which are not expicity in the form of 2.2. We now give two exampes which use AFSA to compute MLEs for such modes. The first is the Random Cumped mode for overdispersed mutinomia data. The second is an arbirtrary mixture of mutinomias with inks from parameters to covariates. Exampe 3.1. In section 1 we have mentioned the Random Cumped Mutinomia RCM, a distribution that addresses overdispersion due to cumped samping in the mutinomia framework. RCM represents an interesting mode for exporing computationa methods. Recenty, Zhou and Lange 2010 have used it as an iustrative exampe for the minorizationmaximization principe. Raim et a 2012 have expored parae computing in maximum ikeihood estimation using arge RCM modes as a test probem. It turns out that RCM conforms to the finite mixture of mutinomias representation 2.1, and can therefore be fitted by the AFSA agorithm. Once the mixture representation is estabished, the score vector and approximate FIM can be formuated by the use of transformations; see for exampe section 2.6 of Lehmann and Casea Hence, we can obtain the agorithm presented in More and Nagaraj 1993 and Neercha and More 1998 as an AFSA-type agorithm. Consider a custer of m trias, where each tria resuts in one of k possibe outcomes with probabiities π 1,..., π k. Suppose a defaut category is aso seected at random, so that each 8

9 tria either resuts in this defaut outcome with probabiity ρ, or an independent choice with probabiity 1 ρ. Intuitivey, if ρ 0, RCM approaches a standard mutinomia distribution. Using this idea, an RCM random variabe can be obtained from the foowing procedure. Let Y 0, Y 1,..., Y m iid Mut k π, 1 and U 1,..., U m iid U0, 1 be independent sampes, then X Y 0 m IU i ρ + m Y i IU i > ρ Y 0 N + Z N 3.2 foows the distribution RCM k π, ρ. The representation 3.2 emphasizes that N Binomiam, ρ, Z N Mut k π, m N, and Y 0 Mut k π, 1, where N and Y 0 are independent. RCM is aso a specia case of the finite mixture of mutinomias, so that X fx; π, ρ k π fx; p, m, 1 p 1 ρπ + ρe, for 1,..., k 1 p k 1 ρπ, where fx; p, m is our usua notation for the density of Mut k p, m. This mixture representation can be derived using moment generating functions, as shown in More and Nagaraj, Notice that in this mixture s k so that the number of mixture components matches the number of categories. There are aso ony k distinct parameters rather than sk 1 as in the genera mixture. The approximate FIM for the RCM mode can be obtained by transformation, starting with the expression for the genera mixture. Consider transforming the k dimensiona η π, ρ to the q sk 1 k + 1k 1 dimensiona θ p 1,..., p s, π so that 1 ρπ + ρe 1. θη 1 ρπ + ρe k 1. 1 ρπ π The q k Jacobian of this transformation is 1 ρi k 1 π + e 1 θ η θi.. η j 1 ρi k 1 π + e k 1. 1 ρi k 1 π I k 1 0 9

10 Using the reations Sη T θ og fx; θ og fx; θ, η η θ T θ θ Iη Var Sη Iθ, η η it is possibe to obtain an expicit form of the approximate FIM as Ĩη a ij, where m1 ρ 2 β i + β k π 1 i + π 1 k, i j, i, j {1,..., k 1} m1 ρ 2 β k π 1 k i j, i, j {1,..., k 1} a ij m1 ργ i γ k, j k, i {1,..., k 1} m k 1 ρ π i1 π i [1 ρπ i + ρ] 1, i k, j k and β i π i 1 ρπ i + ρ + 1 π i, γ i π i1 π i 1 ρπ i 1 ρπ i + ρ + π i, i 1,..., k. 1 ρ It can be shown rigorousy that Ĩη Iη 0 as m, as stated in More and Nagaraj, 1993, and proved in detai in More and Nagaraj, The proof is simiar in spirit to the proof of Theorem 2.1. We then have AFSA iterations for RCM, η g+1 η g + Ĩ 1 η g Sη g, g 1, 2,... The foowing exampe invoves a mixture of mutinomias where the response probabiities are functions of covariates. The idea is anaogous to the usua mutinomia with ogit ink, but with inks corresponding to each component of the mixture. Exampe 3.2. In practice there are often covariates to be inked into the mode. As an exampe for how AFSA can be appied, consider the foowing fixed effect mode for response Y MutMix k θx, m with d 1 covariates x and z. To each p vector, a generaized ogit ink wi be added og p jx p k x η j, η j x T β j, for 1,..., s and j 1,..., k 1. A proportiona odds mode wi be assumed for π, og π 1z + + π z π +1 z + + π s z ηπ, η π ν + z T α, for 1,..., s 1, taking η0 π : and ηs π :. The unknown parameters are the d 1 vectors α and β j, and the scaars ν. Denote these parameters coectivey as β 1. β 1 ν 1 B β s, where β. and ν.. ν β,k 1 ν s α 10

11 Expressions for the θ parameters can be obtained as p j x π z e η j 1 + k 1 b1 eη b eηπ 1 + e ηπ eηπ e ηπ 1 for 1,..., s and j 1,..., k 1, for 1,..., s. To impement AFSA, a score vector and approximate FIM are needed. For the score vector we have SB T T N θ og fy; θ og fy; θ B B N θ where N η 1,..., η s, η π, η η 1,..., η,k 1, and η π η1 π,..., ηs 1. π For the FIM we have T T N θ θ N IB Var SB Iθ. B N N B Finding expressions for the two Jacobians is tedious but straightforward. Propositions 3.3 and 3.4 and Theorem 3.5 state consequences of the main approximation resut, which have significant impications on the computation of MLEs. We have aready seen that the approximate FIM is equivaent to a compete data FIM from EM. There is aso an interesting connection between AFSA and EM, in that the iterations are agebraicay reated. To see this connection, expicit forms for AFSA and EM iterations are first presented, with proofs given in Appendix B. Proposition 3.3 AFSA Iterations. The AFSA iterations can be written expicity as π g+1 p g+1 j π g 1 n 1 M θ g+1 θ g + Ĩ 1 θ g Sθ g, g 1, 2, P x i Px i, where M m m n. P x i Px i x ij p g j 1,..., s [ 1 1 M ] P x i m i, 1,..., s, j 1,..., k. Px i Proposition 3.4 EM Iterations. Consider the compete data 1 with probabiity π 1 Z i. and X i Z i Mut k p, m i, s with probabiity π s, 11

12 where X i, Z i are independent for i 1,..., n. Denote γ g i : PZ i x i, θ g as the posterior probabiity that the ith observation beongs to the th group. Iterations for an EM agorithm are given by π g+1 1 n p g+1 j γ g i n x ijγ g i n m iγ g i 1 n πg P x i, 1,..., s, Px i n x ij P x i Px i n m i P x i, 1,..., s, j 1,..., k. Px i The iterations for AFSA or EM are repeated for g 1, 2,..., with a given initia guess θ 0, unti og Lθ g+1 og Lθ g < ε, where ε > 0 is a given toerance, which is taken to be the stopping criteria for the remainder of this paper. Theorem 3.5. Denote the estimator from EM by ˆθ, and the estimator from AFSA by θ. Suppose custer sizes are equa, so that m 1 m n m. If the two agorithms start at the gth iteration with θ g, then for the g + 1th iteration, π g+1 ˆπ g+1 and p g+1 j for 1,..., s and j 1,..., k. ˆπ g+1 π g ˆp g+1 j + 1 ˆπg+1 π g Proof of Theorem 3.5. It is immediate from Propositions 3.3 and 3.4 that π g+1 and that Now, ˆπ g+1 π g ˆp g+1 j + ˆπ g+1 π g n x ij P x i Px i m n P x i Px i 1 mn 1 n 1 ˆπg+1 π g 1 n P x i Px i x ij + p g j p g j P x i Px i. P x i Px i + pg j 1 1 n 1 1 P x i n Px i P x i Px i p g j ˆπ g+1, p g+1 j

13 The g + 1th AFSA iterate can then be seen as a inear combination of the gth iterate and the g + 1th step of EM. The coefficient ˆπ g+1 /π g is non-negative but may be arger than 1. Therefore p g+1 j need not ie stricty between ˆp g+1 j and p g j. Figure 1 shows a pot of p g+1 j as the ratio ˆπ g+1 /π g varies. However, suppose that at gth step the EM agorithm is cose to convergence. Then ˆπ g+1 From 3.4 we wi aso have ˆπ g ˆπg+1 ˆπ g 1, for 1,..., s. p g+1 j ˆp g+1 j, for 1,..., s, and j 1,..., k. From this point on, AFSA and EM iterations are approximatey the same. Hence, in the vicinity of a soution, AFSA and EM wi produce the same estimate. Note that this resut hods for any m, and does not require a arge custer size justification. For the case of varying custer sizes m 1,..., m n, ˆπ g+1 π g ˆp g+1 j + 1 ˆπg+1 π g n x ij P x i Px i 1 n m i P x i n Px i p g j P x i Px i + pg j 1 1 n P x i, 3.5 Px i which does not simpify to p g+1 j as in the proof of Theorem 3.5. However, this iustrates that EM and AFSA are sti cosey reated. This aso suggests an ad-hoc revision to AFSA, etting p g+1 j equa 3.5 so that the agebraic reationship to EM woud be maintained as in 3.4 for the baanced case. A more genera connection is known between EM and iterations of the form θ g+1 θ g + I 1 c θ g Sθ g, g 1, 2,..., 3.6 where I c θ is a compete data FIM. Titterington 1984 shows that the two iterations are approximatey equivaent under appropriate reguarity conditions. The equivaence is exact when the compete data ikeihood is a reguar exponentia famiy { } Lµ exp bx + η T t + aη, η ηµ, t tx, and µ : EtX is the parameter of interest. The compete data ikeihood for our mutinomia mixture is indeed a reguar exponentia famiy, but the parameter of interest θ is a transformation of µ rather than µ itsef. Therefore the equivaance is approximate, as we have seen in Theorem 3.5. The justification for AFSA eading to this paper foowed the historica approach of Bischke 1964, and not from the roe of Ĩθ as a compete data FIM. But the reationship between EM and the iterations 3.6 suggests that AFSA is a reasonabe approach for finite mixtures beyond the mutinomia setting, 13

14 AFSA step compared to previous iterate and EM step ˆp g+1 j p g+1 j p g j 0 1 ˆπ g+1 /π g Figure 1: The next AFSA p g+1 j depends on the ratio ˆπ g+1 /π g. iteration is a inear combination of ˆp g+1 j and p g j, which 4 Simuation Studies The main resut stated in Theorem 2.1 aows us to approximate the matrix Iθ by Ĩθ, which is much more easiy computed. Theorem 2.5 justifies Ĩ 1 θ as an approximation for the inverse FIM. In the present section, simuation studies investigate the quaity of the two approximations as a function of m. We aso present studies to demonstrate the convergence speed and soution quaity of AFSA. 4.1 Distance between true and approximate FIM Consider two concepts of distance to compare the coseness of the exact and approximate matrices. Based on the Frobenius norm A F i j a2 ij, a distance metric d F A, B A B F can be constructed using the sum of squared differences of corresponding eements. This distance wi be arger in genera when the magnitudes of the eements are arger, so we wi aso consider a scaed version d S A, B d F A, B B F i j a ij b ij 2, i j b2 ij noting that this is not a true distance metric since it is not symmetric. Using these two metrics, we compare the distance between true and approximate FIMs, and aso the dis- 14

15 tance between their inverses. Consider a mixture MutMix 2 θ, m of three binomias, with parameters p 1/7 1/3 2/3 and π 1/6 2/6 3/6. Figure 2 pots the two distance types for both the FIM and inverse FIM as m varies. Note that distances are potted on a og scae, so the vertica axis represents orders of magnitude. To see more concretey what is being compared, for the moderate custer size m 20 we have, respectivey for the approximate and exact FIMs, vs and for the approximate and exact inverse FIMs, vs Since the approximations are bock diagona matrices they have no way of capturing the off-diagona bocks, which are present in the exact matrices but are eventuay dominated by the bock-diagona eements as m. This emphasizes one obvious disadvantage of the approximate FIM, which is that it cannot be used to estimate a the asymptotic covariances for the MLEs for a fixed custer size. For this m 20 case, the bock-diagona eements for both pairs of matrices are not very cose, athough they are at east the same order of magnitude with the same signs. The magnitudes of eements in the inverse FIMs are in genera much smaer than those in the FIMs, so the unscaed distance wi naturay be smaer between the inverses. Now in Figure 2 consider the distance d F Ĩθ, Iθ as m is varied. For the FIM, the distance appears to be moderate at first, then increasing with m, and finay beginning to vanish as m becomes arge. What is not refected here is that the magnitudes of the eements themseves are increasing; this is infating the distance unti the convergence of Thereom 2.1 begins to kick in. Considering the scaed distance d S Ĩθ, Iθ heps to suppress the effect of the eement magnitudes and gives a cearer picture of the convergence. Focusing next on the inverse FIM, consider the distance d F Ĩ 1 θ, I 1 θ. For m < 5 the exact FIM is computationay singuar, so its inverse cannot be computed. Note that in this case the conditions for identifiabiity are not satisfied see Appendix A. This is not just a coincidence; there is a known reationship between mode non-identifiabiity and singuarity of the FIM Rothenberg, For m between 5 and about 23, the distance is very arge at first because of near-singuarity of the FIM, but quicky returns to a reasonabe magnitude. As m increases further, the distance quicky vanishes toward zero. We aso consider the 15.

16 Log of Frobenius Distance b/w Exact and Approx Matrices Log of Scaed Frobenius Distance b/w Exact and Approx Matrices ogdistance FIM Inverse FIM ogdistance FIM Inverse FIM m m a Using unscaed distance b Using scaed distance Figure 2: Distance between exact and approximate FIM and its inverse, as m is varied. scaed distance d S Ĩ 1 θ, I 1 θ. Again, this heps to remove the effects of the eement magnitudes, which are becoming very sma as m increases. Even after taking into account the scae of the eements, the distance between the inverse matrices appears to be converging more quicky than the distance between the FIM and approximate FIM. 4.2 Effectiveness of AFSA method Convergence Speed We first observe the convergence speed of AFSA and severa of its competitors. Consider the mixture of two trinomias Y i iid MutMix 3 θ, m 20, i 1,..., n 500 p 1 1/3 1/3 1/3, p , π We fit the MLE using AFSA, FSA, and EM. After the gth iteration, the quantity δ g og Lθ g og Lθ g 1 is measured. The sequence og δ g is potted for each agorithm in Figure 3. Note that δ g may be negative, except for exampe in EM which guarantees an improvement to the og-ikeihood in every step. A negative δ g can be interpreted as negative progress, at east from a oca maximum. The absoute vaue is taken to make potting possibe on the og scae, but some steps with negative progress have been obscured. The resuting estimates and 16

17 standard errors for a agorithms are shown in Tabe 1, and additiona summary information is shown in Tabe 2. We see that AFSA and EM have amost exacty the same rate of convergence toward the same soution, as suggested by Thereom 3.5. FSA had severe probems, and was not abe to converge within 100 iterations; i.e. δ g < 10 8 was not attained. The situation for FSA is worse than it appears in the pot. Athough og δ g is becoming sma, FSA s steps resut in both positive and negative δ g s unti the iteration imit is reached. This indicates a faiure to approach any maximum of the og-ikeihood. We aso considered an FSA hybrid with a warmup period, where for a given ε 0 > 0 the approximate FIM is used unti the first time δ g < ε 0 is crossed. Notice that ε 0 corresponds to no warmup period. A simiar idea has been considered by Neercha and More 2005, who proposed a two-stage procedure for AFSA in the RCM setting of Exampe 3.1. The first stage consisted of running AFSA iterations unti convergence, and in the second stage one additiona iteration of exact Fisher Scoring was performed. The purpose of the FSA iteration was to improve standard error estimates, which were previousy found to be inaccurate when computed directy from the approximate FIM Neercha and More, Here we note that FSA aso offers a faster convergence rate than AFSA, given an initia path to a soution. Therefore, AFSA can be used in eary iterations to move to the vicinity of a soution, then a switch to FSA wi give an acceerated converge to the soution. This approach depends on the exact FIM being feasibe to compute, so the sampe space cannot be too arge. For the present simuations, we make use of the naive summation 2.3. Hence, there is a trade-off in the choice of ε 0 between energy spent on computing the exact FIM and a arger number of iterations required for AFSA. Figure 3 shows that the hybrid strategy is effective, addressing the erratic behavior of FSA from an arbitrary starting vaue and the sower convergence rates of EM and AFSA. Tabe 2 shows that even a very imited warmup period such as ε 0 10 can be sufficient. The Newton-Raphson agorithm, which has not been shown here, performed simiary to Fisher Scoring but has issues with singuarity in some sampes. Standard errors for AFSA were obtained as a 11,..., a qq, denoting Ĩ 1 ˆθ a ij. For FSA and FSA-Hybrid, the inverse of the exact FIM was used instead. The basic EM agorithm does not yied standard error estimates. Severa extensions have been proposed to address this, such as by Louis 1982 and Meng and Rubin In ight of Theorem 3.5, standard errors from Ĩ 1 θ evauated at EM estimates coud aso be used to obtain simiar resuts to AFSA Monte Caro Study We next consider a Monte Caro study of the difference between AFSA and EM estimators. Observations were generated from Y i ind MutMix k θ, m i, i 1,..., n 500, given varying custer sizes m 1,..., m n which themseves were generated as Z 1,..., Z n iid Gammaα, β, m i Z i

18 Convergence of competing agorithms ogabsdeta AFSA FSA EM FSA w/ warmup 1e iteration Figure 3: Convergence of severa competing agorithms for a sma test probem Tabe 1: Estimates and standard errors for the competing agorithms. FSA Hybrid produced the same resuts with ε 0 set to 0.001, 0.01, 0.1, 1, and 10. FSA AFSA EM FSA Hybrid ˆp SE ˆp SE ˆp SE ˆp SE ˆπ SE

19 Tabe 2: Convergence of severa competing agorithms. Hybrid FSA is shown with severa choices of the warmup toerance ε 0. Exact FSA uses ε 0. method ε 0 oglik to iter AFSA EM FSA FSA FSA FSA FSA FSA Severa different settings of θ are considered, with s 2 mixing components and proportion π 0.75 for the first component. The parameters α and β were chosen such that EZ i αβ 20. This gives β 20/α so ony α is free, and VarZ i αβ 2 400/α can be chosen as desired. The expectation and variance of m i are intuitivey simiar to Z i, and their exact vaues may be computed numericay. Once the n observations are generated, an AFSA estimator θ and an EM estimator ˆθ are fit. This process is repeated 1000 times yieding θ r and ˆθ r for r 1,..., A defaut initia vaue was seected for each setting of θ, and used for both agorithms in every repetition. To measure the coseness of the two estimators, a maximum reative difference is taken over a components of θ, then averaged over a repetitions: D D r, where D r 1000 r1 Here represents the maximum operator. Notice that obtaining a good resut for D depends on the vectors ˆθ and θ being ordered in the same way. To hep ensure this, we add the constraint π 1 > > π s, which is enforced in both agorithms by reordering the estimates for π 1,..., π s and p 1,..., p s accordingy after every iteration. Tabe 3 shows the resuts of the simuation. Nine different scenarios for θ are considered. The custer sizes m 1,..., m n are seected in three different ways: a baanced case where m i 20 for i 1,..., n, custer sizes seected at random with sma variabiity using α 100, and custer sizes seected at random with moderate variabiity using α 25. Both agorithms are susceptibe to finding oca maxima of the ikeihood, but in this experiment AFSA encountered the probem much more frequenty. These cases stood out because the oca maxima occurred with one of the mixing proportions or category probabiities cose to zero, i.e. a convergence to the boundary of the parameter space. This is an especiay bad situation for our Monte Caro statistic D, which can become very arge if 19 q j1 θ r j θ r j ˆθ r j.

20 Tabe 3: Coseness between AFSA and EM estimates, over 1000 trias A. B. C. D. E. F. G. H. I. Custer sizes equa α 100 α 25 p 1 p 2 m i 20 Varm i Varm i , 0.3 1/3, 1/ , 0.5 1/3, 1/ , 0.5 1/3, 1/ , 0.1, , 0.25, , 0.2, , 0.25, this occurs even once for a given scenario. The probem occurred most frequenty for the case p 1 0.1, 0.3 and p 2 1/3, 1/3. To counter this, we restarted AFSA with a random starting vaue whenever a soution with any estimate ess than 0.01 was obtained. For this experiment, no more than 15 out of 1000 trias required a restart, and no more than two restarts were needed for the same tria. In practice, we recommend starting AFSA with severa initia vaues to ensure that any soutions on the boundary are not missteps taken by the agorithm. The entries in Tabe 3 show that sma to moderate variation of the custer sizes does not have a significant impact on the equivaence of AFSA and EM. On the other hand, as p 1 and p 2 are moved coser together, the quantity D tends to become arger. Theorem 2.1 depends on the distinctness of the category probabiity vectors, so the quaity of the FIM approximation at moderate custer size may begin to suffer in this case. The estimation probem itsef aso intuitivey becomes more difficut as p 1 and p 2 become coser. Reca that the dimension of p i is k 1; it can be seen from Tabe 3 that increasing k from 2 to 4 does not necessariy have a negative effect on the resuts. In Scenario E, p 1 and p 2 are not too cose together, yet D has a simiar magnitude to Scenario D where the two vectors are coser. Figure 4 shows a pot of the individua D r for Scenarios D and E. Notice that in Scenario E, one particuar simuation in each case is responsibe for the arge magnitude of D. Upon remova of these simuations, the order of D is reduced from 10 3 to However, many arge D r were present in the Scenario D resuts. 5 Concusions A arge custer approximation was presented for the FIM of the finite mixture of mutinonias mode Theorem 2.1. This matrix has a convenient bock diagona form, where each non-zero bock is the FIM of a standard mutinomia observation. Furthermore, the approximation is equivaent to the compete data FIM, had popuation abes been recorded for each 20

21 m20 α100 α25 m20 α100 α Reative Distances Between EM and AFSA for Scenarios D and E Scenario D Scenario E Figure 4: Boxpots for Scenarios D and E of Monte Caro study. At this scae, the boxes appear as thin horizonta ines. observation Proposition 2.2. Using this approximation to the FIM, we formuated the Approximate Fisher Scoring Agorithm AFSA, and showed that its iterations are cosey reated to the we known Expectation-Maximization EM agorithm for finite mixtures Theorem 3.5. Simuations show that a rather arge custer size is needed before the exact and approximate FIM are cose; this is not surprising given that a bock diagona matrix is being used to approximate a dense matrix. A arge custer size is aso needed for a cose approximation of the inverse, athough the inverses are seen to converge together more quicky. Therefore, the approximate FIM and its inverse are not we-suited to repace the exact matrices for genera usage. This means, for exampe, that one shoud be cautious about computing standard errors for the MLE from the approximate inverse FIM. As another exampe of a genera use for the approximate FIM, consider approximate 1 α eve Wad-type and Score-type confidence regions, { θ 0 : ˆθ θ 0 T Ĩ ˆθ ˆθ θ 0 χ 2 q,α } and { θ 0 : Sθ 0 T Ĩ 1 θ 0 Sθ 0 χ 2 q,α }, 5.1 respectivey, using the approximate FIM in pace of the exact FIM. Such regions are very practica to compute, but wi ikey not have the desired coverage for θ. However, we might expect the Score region to perform better for moderate custer sizes because it invoves the inverse matrix. On the other hand, the approximate FIM works we as a too for estimation in the AFSA agorithm. This is interesting because the more standard Fisher Scoring and Newton-Raphson agorithms do not work we on their own. For Newton-Raphson, the invertibiity of the Hessian depends on the sampe as we as the current iterate θ g and the mode. Fisher Scoring can be computed when the custer size is not too sma so that 21

22 the FIM is non-singuar, but it is often unabe to make progress at a from an arbitrariy chosen starting point. In this case, AFSA or EM is usefu for giving FSA some initia hep. If FSA has a sufficienty good starting point, it can converge very quicky. Therefore we recommend a hybrid approach: use AFSA iterations for an initia warmup period, then switch to FSA once a path toward a soution has been estabished. This approach may aso hep to reduce the number of exact FIM computations needed, which may be expensive. Athough AFSA and EM are cosey reated and often tend toward the same soution, AFSA is not restricted to the parameter space. Additiona precautions may therefore be needed to prevent AFSA iterations from drifting outside of the space. AFSA aso tended to converge to the boundary of the space more often than EM; hence, we reiterate the usua advice of trying severa initia vaues as a good practice. AFSA may be preferabe to EM in situations where it is more natura to formuate. Derivation of the E-step conditiona og-ikeihood may invove evauating a compicated expectation, but is not required for AFSA. A tradeoff for AFSA is that the score vector for the observed data must be computed; this may invove a messy differentiation, but is arguaby easier to address numericay than the E- step. AFSA iterations were obtained for the Random-Cumped Mutinomia in Exampe 3.1, starting from a genera mutinomia mixture and using an appropriate transformation of the parameters. It is interesting to note the reationship between FSA, AFSA, and EM as Newton-type agorithms. Fisher Scoring is a cassic agorithm where the Hessian is repaced by its expectation. In AFSA the Hessian is repaced instead by a compete data FIM. EM can be considered a Newton-type agorithm aso, where the entire ikeihood is repaced by a compete data ikeihood with missing data integrated out. In this ight, EM and AFSA iterations are seen to be approximatey equivaent. Severa interesting questions can be raised at this point. There is a reationship between AFSA and EM which extends beyond the mutinomia mixture; we wonder if the reationship between the exact and compete data information matrix generaizes as we. Aso, for the present mutinomia mixture, perhaps there is a sma custer bias correction that coud be appied to improve the approximation. This might aow standard errors and confidence regions such as 5.1 to safey be derived from the approximate FIM. 6 Acknowedgements The hardware used in the computationa studies is part of the UMBC High Performance Computing Faciity HPCF. The faciity is supported by the U.S. Nationa Science Foundation through the MRI program grant no. CNS and the SCREMS program grant no. DMS , with additiona substantia support from the University of Maryand, Batimore County UMBC. See for more information on HPCF and the projects using its resources. The first author additionay acknowedges financia support as HPCF RA. 22

23 References W. R. Bischke. Moment estimators for the parameters of a mixture of two binomia distributions. The Annas of Mathematica Statistics, 332: , W. R. Bischke. Estimating the parameters of mixtures of binomia distributions. Journa of the American Statistica Association, 59306: , H. Bozdogan. Mode seection and Akaike s Information Criterion AIC: The genera theory and its anaytica extensions. Psychometrika, 523: , S. Chandra. On the mixtures of probabiity distributions. Scandinavian Journa of Statistics, 4: , A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum ikeihood from incompete data via the EM agorithm. Journa of the Roya Statistica Society, Series B, 391:1 38, A. B. Kabir. Estimation of parameters of a finite mixture of distributions. Journa of the Roya Statistica Society. Series B, 30: , K. Kroikowska. Estimation of the parameters of any finite mixture of geometric distributions. Demonstratio Mathematica, 9: , K. Lange. Numerica Anaysis for Statisticians. Springer, 2nd edition, E. L. Lehmann and G. Casea. Theory of Point Estimation. Springer, 2nd edition, T. A. Louis. Finding the observed information matrix when using the EM agorithm. Journa of the Roya Statistica Society. Series B, 44: , G. Mcachan and D. Pee. Finite Mixture Modes. Wiey-Interscience, X. L. Meng and D. B. Rubin. Using EM to obtain asymptotic variance-covariance matrices: the SEM agorithm. Journa of the American Statistica Association, 86416: , C. D. Meyer. Matrix Anaysis and Appied Linear Agebra. SIAM, J. G. More and N. K. Nagaraj. A finite mixture distribution for modeing mutinomia extra variation. Technica Report Research report 91 03, Department of Mathematics and Statistics, University of Maryand, Batimore County, J. G. More and N. K. Nagaraj. A finite mixture distribution for modeing mutinomia extra variation. Biometrika, 802: , J. G. More and N. K. Nagaraj. Overdispersion Modes in SAS. SAS Institute,

24 N. K. Neercha and J. G. More. Large custer resuts for two parametric mutinomia extra variation modes. Journa of the American Statistica Association, 93443: , N. K. Neercha and J. G. More. An improved method for the computation of maximum ikeihood estimates for mutinomia overdispersion modes. Computationa Statistics & Data Anaysis, 491:33 43, M. Okamoto. Some inequaities reating to the partia sum of binomia probabiities. Annas of the Institute of Statistica Mathematics, 10:29 35, A. M. Raim, M. K. Gobbert, N. K. Neercha, and J. G. More. Maximum ikeihood estimation of the random-cumped mutinomia mode as prototype probem for arge-scae statistica computing. Accepted, C. R. Rao. Linear statistica inference and its appications. John Wiey and Sons Inc, H. Robbins. Mixture of distributions. The Annas of Mathematica Statistics, 193: , T. J. Rothenberg. Identification in parametric modes. Econometrica, 39: , H. Teicher. On the mixture of distributions. The Annas of Mathematica Statistics, 311: 55 73, D. M. Titterington. Recursive parameter estimation using incompete data. Journa of the Roya Statistica Society. Series B, 46: , H. Zhou and K. Lange. MM agorithms for some discrete mutivariate distributions. Journa of Computationa and Graphica Statistics, 193: , A Appendix: Preiminaries and Notation Given an independent sampe X 1,..., X n with joint ikeihood Lθ and θ having dimension q 1, the score vector is Sθ og Lθ θ og fx; θ. θ For X i Mut k p, m the score vector for a singe observation can be obtained from [ ] og fx; p, m k 1 x 1 og p x k 1 og p k 1 + x k og 1 p j p a p a x a /p a x k /p k, A.1 j1 24

25 so that og fx; p, m p x 1 /p 1. x k 1 /p k 1 x k /p k. x k /p k D 1 x k x k p k 1, denoting D : diagp 1,..., p k 1 and x k : x 1,..., x k 1. The score vector for a singe observation X MutMix k θ, m can aso be obtained, og Px p a og{ s 1 π P x} p a 1 Px π P a x a p a π a P a x og P a x Px p a π [ a P a x Da 1 x k x ] k 1 Px p ak, a 1,..., s, where D a : diagp a1,..., p a,k 1, and og Px π a og{ s 1 π P x} π a P ax P s x, a 1,..., s 1. Px Next, consider the q q FIM for the independent sampe X 1,..., X n [ { } { } ] T Iθ VarSθ E og Lθ og Lθ θ θ ] E [ 2 og Lθ. θ θt The ast equaity hods under appropriate reguarity conditions. For the mutinomia FIM, we may use A.1 to obtain { x k /p 2 k if a b og fx; p, m p a p b x a /p 2 a x k /p 2 k otherwise and so og fx; p, m diag x 1,..., x k 1 p pt p 2 1 p 2 k 1 x k 11 T. p 2 k 25

26 Therefore, we have Ip E p p mp1 diag p 2 1 og fx; p, m T,..., mp k 1 p 2 k 1 + mp k 11 T p 2 k m D 1 + p 1 k 11T. The score vector and Hessian of the og-ikeihood can be used to impement the Newton- Raphson agorithm, where the g + 1th iteration is given by { } θ g+1 θ g 2 1 θ θ og T Lθg Sθ g. The Hessian may be repaced with the FIM to impement Fisher Scoring θ g+1 θ g + I 1 θ g Sθ g. In order for the estimation probem to be we-defined in the first pace, the mode must be identifiabe. For finite mixtures, this is taken to mean that the equaity s v π fx; θ as λ fx; ξ 1 impies s v and terms within the sums are equa, except the indicies may be permuted Mcachan and Pee, 2000, section Chandra 1977 provides some insight into the identifiabiity issue, and shows that a famiy of mutivariate mixtures is identifiabe if any of the corresponding margina mixtures are identifiabe. In the present case, the mutivariate mixtures consist of mutinomia densities, and the margina densities are binomias. It is we known that a finite mixture of s components from 1 { Binomiam, θ : θ 0, 1 } is identifiabe if and ony if m 2s 1; see, for exampe, Bischke Then a sufficient condition for mode 2.2 to be identifiabe is that m i 2s 1 for at east one observation. This can be seen by the foowing emma. Lemma A.1. Suppose X i ind f i x; θ, i 1,..., n, where f i s are densities, and for at east one r {1,..., n} the famiy {f r ; θ : θ Θ} is identifiabe. Then the joint mode is identifiabe. Proof of Lemma A.1. WLOG assume that r 1, and suppose we have n n f i x i ; θ as f i x i ; ξ. Integrating both sides with respect to x 2,..., x n, using the appropriate dominating measure, f 1 x 1 ; θ as f 1 x 1 ; ξ. Since the famiy {f 1 ; θ : θ Θ} is identifiabe, this impies θ ξ. Hence the joint famiy { n f i ; θ : θ Θ} is identifiabe. 26

27 B Appendix: Additiona Proofs To prove Theorem 2.1, we wi first estabish a key inequaity. A simiar strategy was used by More and Nagaraj 1991, but they considered the specia case k s, so that the number of mixture components is equa to the number of categories within each component. Here we generaize their argument to the genera case where k s need not hod. The origina proof was inspired by the foowing inequaity from Okamoto 1959 for the tai probabiity of the binomia distribution, which was aso considered by Bischke Lemma B.1. Suppose X Binomiam, p and et fx; m, p be its density. Then for c 0, i. PX/m p c e 2mc2, ii. PX/m p c e 2mc2. Theorem B.2. For a given index b {1,..., s} we have s π a P a x P b x 2 s e m 2 δ2 ab, Px π b where δ ab k 1 j1 p aj p bj. a b Proof of Theorem B.2. For a, b {1,..., s}, assume WLOG that k 1 δ ab : p aj p bj p al p bl, for some L {1,... k 1} j1 is positive. Denote as Ωx j the mutinomia sampe space when the jth eement of x is fixed at a number x j. Then we have π a P a x P b x Px m x L 0 x L x L m 2 p al+p bl x L π a x L m 2 p al+p bl x L π a π b x L m 2 p al+p bl x L π a P a x P b x Px π a P a x P bx Px + P a x + π b P a x + a b x L > m 2 p al+p bl x L x L > m 2 p al+p bl x L x L > m 2 p al+p bl x L P b x P b x. π a P a x Px P b x B.1 Notice that the ast statement above consists of margina probabiities for the Lth coordinate of k-dimensiona mutinomias, which are binomia probabiities. Foowing Bischke 1962, suppose A Binomiam, p al and B Binomiam, p bl, then B.1 is equa to π a π b P { A m 2 p al + p bl } + P 27 { B > m 2 p al + p bl }. B.2

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part IX The EM agorithm In the previous set of notes, we taked about the EM agorithm as appied to fitting a mixture of Gaussians. In this set of notes, we give a broader view