Expectation Maximization - Math and Pictures Johannes Traa

Size: px

Start display at page:

Download "Expectation Maximization - Math and Pictures Johannes Traa"

Harry Horn
5 years ago
Views:

1 Expectation Maximization - Math and Pictures Johannes Traa This document covers the basics of the EM algorithm, maximum likelihood ML) estimation, and maximum a posteriori MAP) estimation. It also covers the EM derivations for the following mixture models: Gaussian Mixture Model GMM) Wrapped Gaussian Mixture Model WGMM) von Mises Mixture Model vmmm) von Mises-Fisher Mixture Model vmfmm) Line Mixture Model LineMM) Laplacian Mixture Model LapMM) Probabilistic Latent Semantic Indexing PLSI) The best way to understand this stuff is to code it up. Plot everything. Excellent references for the EM algorithm and probabilistic methods: Chapter 9: Mixture Models and EM in Pattern Recognition and Machine Learning 06) Bishop) Chapter 0: Grouping and Model Fitting in Computer Vision: A Modern Approach 2) Forsyth, Ponce) Machine Learning: A Probabilistic Perspective 2) Murphy) BASIC IDEA Parameter estimation is a general problem that shows up again and again. A typical situation is where we have collected some data and we want to summarize/find structure in that dataset. Take the following simple example. You are trying to model the interaction between good weather and the number of people at the beach it s a silly example, but just roll with it). If we measure both of these quantities every day of the year and make a scatterplot of our data, it might look the set of points in Figure. There s a positive correlation between good weather and people going to the beach and the data is spread around a center point. Instead of keeping track of the entire dataset, we can represent it with a Gaussian distribution, which looks like a squished and rotated bell-curve in 2 dimensions. A slices through the Gaussian contours) are shown in Figure as ellipses. This pretty much summarizes all the data with two parameters: a mean vector 2 ) and a covariance matrix 2 2).

2 Number of people at the beach Number of people at the beach Number of people at the beach How good the weather is How good the weather is How good the weather is Figure : Left) Dataset representing relationship between the number of people at the beach and the goodness of the weather. Middle) Gaussian distribution fit to the data. Right) Samples drawn from Gaussian fit. Number of people at the beach Number of people at the beach Number of people at the beach How good the weather is How good the weather is How good the weather is Figure 2: Left) Multimodal dataset. Left) Gaussian fit. Right) Mixture of Gaussians fit. By fitting a Gaussian, we are implicitly making the assumption that it s reasonable to model the data as having been sampled from a Gaussian. If we generate data from our Gaussian fit, as shown in Figure, we can see that the samples are spread in the same way as the actual data. So, in this case, our implicit assumption is reasonable. But what if our data looks like that of Figure 2? The Gaussian assumption doesn t make sense here. At least it looks like we can do better. The underlying distribution appears multimodal: it has multiple peaks. So, we can just fit multiple Gaussians instead of one. To fit a single distribution, we typically apply the maximum likelihood ML) or maximum a-posteriori MAP) method. The former tries to find the distribution that makes the most sense given the data, while the latter takes into account our belief, independent of the data, of what that distribution should look like. The Expectation-Maximization EM) algorithm is a straightforward way to fit mixture models that starts with an initial guess and iteratively improves the fit. Each iteration consists of two steps. The first assigns data points to clusters and the second re-estimates the cluster parameters according to the assignments. In EM, we typically associate each data point to each cluster with some probability rather than using binary assignments. The rest is technical details. 2

3 2 MAXIMUM LIKELIHOOD FOR THE GAUSSIAN DISTRIBUTION The maximum likelihood estimate of the mean of a Gaussian distribution is simple. The multivariate Gaussian distribution has pdf: N x; µ,σ) e 2x µ) T Σ x µ). ) 2πΣ 2 The likelihood function for a dataset drawn i.i.d. from the distribution is: so the log likelihood is: L e 2x i µ) T Σ x i µ), 2) 2πΣ 2 logl 2 log Σ xi µ ) T Σ x i µ ). 3) 2 Since this is a convex function of the parameters, we can differentiate it, set the result equal to zero and solve for the ML parameter estimates: logl µ Σ x i µ ) 0 4) µ ML N logl Σ x i, 5) 2 Σ + 2 Σ x i µ ) x i µ ) T Σ 0 6) Σ ML N xi µ ) x i µ ) T. 7) So the ML estimates are just the sample mean and covariance. 2. DATA WEIGHTING We might also include a weight w i for each data point to reflect how confident we are that it is reliable. We modify the likelihood as: wi L e 2x i µ) T Σ x i µ)], 8) 2πΣ 2 so the log likelihood is: log L 2 log Σ xi µ ) T Σ x i µ )] w i. 9) 2 Maximizing with respect to the parameters gives the weighted ML estimates: 3

4 µ WML x i w i, 0) w i xi µ ) x i µ ) T wi Σ WML. ) w i 3 MAXIMUM A POSTERIORI FOR THE GAUSSIAN DISTRIBUTION We can regularize the mean and covariance estimates of a Gaussian distribution by incorporating prior information. This biases the solution towards what we believe it should look like before seeing any data. The conjugate distribution for the mean and covariance are Gaussian and Inverse Wishart distributions, respectively. These conjugate priors ensure that the posterior distribution likelihood prior) is of the same form as the prior this is merely a convenience at this point). 3. PRIOR ON THE MEAN We can regularize the maximum likelihood solution by imposing a Gaussian prior on the mean: P µ; µ s,σ s ) 2πΣ s 2 e 2µ µ s ) T Σ s µ µ s ). 2) Thus, we have: P P x i ; µ,σ )] P ) µ; µ s,σ s, 3) logp 2 log 2πΣ xi µ ) T Σ x i µ )] 2 2 log 2πΣ s ) T ) µ µs Σ s µ µs 2 logp µ Σ x i µ )] Σ ) s µ µs 0. 5) 4) Solving for µ, the MAP solution is: µ MAP N Σ s Σ + I ) ) ) Σ s Σ x i + µ s 6) N Σ s Σ + I ) NΣs Σ µ ML + µ s ). 7) 4

5 Consider the D case for simplicity: µ MAP σ 2 s σ 2 N µ ML + µ s σ 2 s σ 2 N +. 8) As σ2 s σ 2, the prior is uninformative, so µ MAP µ ML. And as σ2 s σ 2 0, the data is uninformative the prior is strict), so µ MAP µ s. When σ2 s, the prior behaves as if one σ 2 additional measurement at µ s were present to calculate the ML solution. Also, as N, the prior becomes redundant, so µ MAP µ ML. 3.2 PRIOR ON THE COVARIANCE We can also regularize the solution for the covariance matrix using the inverse-wishart distribution with the appropriate degrees of freedom): P Σ; Σ 0 ) Σ 0 n 2 e Σ n 2 trσ Σ 0). 9) 2 Thus, P P x i ; µ,σ )] P Σ; Σ 0 ), 20) logp 2 log 2πΣ xi µ ) T Σ x i µ )] + n 2 2 log Σ 0 n 2 log Σ 2 tr Σ ) Σ 0, 2) logp Σ 2 Σ + 2 Σ x i µ ) x i µ ) ] T Σ n 2 Σ + 2 Σ Σ 0 Σ 0. 22) Thus, we have that: Σ MAP Σ 0 + N xi µ ) x i µ ) T n + N Σ 0 + N Σ ML n + N. 23) Thanks to the conjugacy relationship between the likelihood and the prior, the MAP solution is very intuitive. The parameter n in the inverse-wishart distribution controls how confident we are that Σ 0 is the correct estimate. If we are not very confident, n is set to a small number and a moderate number N of data samples will cause the MAP estimate to ignore the prior. 4 THE MATH BEHIND EM Expectation-Maximization EM) is a learning algorithm for maximum-likelihood problems with hidden variables. In the case of a mixture model, we have observed data/variables X, 5

6 unobserved data/variables Z, and parameters Θ to be learned). The hidden variables Z indicate how the observed data X are assigned to the mixture components. The complete data likelihood for a mixture model with i.i.d. samples is L Px i, z i ; Θ) 24) k k k Px i z ik ; θ k )Pz k ; θ k ) 25) K Px i ; θ k )Pz k ; θ k )] z ik 26) K Px i ; θ k )π k ] z ik 27) Px i, z i ; Θ) is the complete data likelihood for the i th observation x i, Px i ; θ k ) is the probability model pdf) of the k th component in the mixture evaluated at x i, and π k Pz k ; θ k ) is the mixing weight of the k th component. The hidden variables z ik are treated as indicator variables for each i in the above notation. So, for the i th observation x i, z ik takes the value for a single index k and 0 for all others. This has the effect of selecting one term in the product over k for each i. It s easier to work with the log likelihood, in which case we have logl k logpx i ; θ k )π k ] z ik 28) This is easy to maximize w.r.t. the parameters θ k if we know the values of the indicator variables z ik. In that case, we can just estimate the parameters for the k th component using all the data whose indicator is active for that component i.e. z ik ). Seeing as we don t know these data associations, we can first lower-bound the log likelihood by taking its expected value w.r.t. the hidden variables this requires Jensen s inequality). This gives what is known as the Q function : where ] Q E z x,θ old logl 29) logpx i ; θ k )π k ]α ik, 30) k 6

7 α ik E z x,θ oldz ik ] 3) Pz ik x i ; Θ old ) 32) Px i z ik ; θ old k )Pz ik ; θ old k ) Px i z il ; θ old l )Pz il ; θ old ) l 33) Px i z ik ; θ old k )π k 34) Px i z il ; θ old )π l l represents our belief that the k th component in the mixture is responsible for generating the i th observation. 32) follows since the expectation of an indicator variable is its probability of being. The Q function is easier to maximize and leads to the EM algorithm. In the E step, we fix the current estimate of the parameters Θ and calculate the posterior probabilities α ik. This captures how much information each data point x i contributes in estimating the parameters of each component θ k. Then, in the M step, we use these posteriors as soft weights to update the model parameters via maximization of 30). Data points with higher weights for a specific value of k will exert more influence on the update of the k th component s parameters. After the M step, Θ has changed, so the α ik have changed. We can re-estimate α ik, update Θ, and repeat until convergence. This procedure is guaranteed to reach a local maximum of 28). 5 GAUSSIAN MIXTURE MODEL GMM) The model is a K-component Mixture of Gaussians MoG). All data is drawn independently from this mixture. The likelihood function is given by The Q function is given by L k π k N ) x i ; µ k,σ k, 35) logl log π k N ) x i ; µ k,σ k. 36) k 7

8 Figure 3: Mixture of Gaussians fit to dataset in 2 dimensions. Each Gaussian is depicted by its mean µ black + ), covariance Σ -σ ellipse), and mixing weight π transparency). Data points are colored by their posterior probabilities η ik. ] Q E z x,θ t) logp x, z Θ) 37) logp z k Θ)P x i z k ; Θ)]P z k x i ; Θ t)) 38) k k k log π k N )] x i ; µ k,σ k ηik 39) log k π k e 2µ k x i ) T Σ 2πΣ k k µ k x i ) 2 logπ k 2 log 2πΣ k 2 ] η ik 40) ) T ) ] µk x i Σ k µk x i η ik 4) where Θ is the parameter set to solve for, Θ t) is the previous iteration s parameters, and 8

9 η ik Pz k x i ; Θ t) ) is the posterior probability of each hidden variable given the parameters from the previous iteration, given by η ik Pz k x i ; Θ t) ) P x i z k ; Θ t)) P z k Θ t)) P x i Θ t)) P xi z k ; Θ t)) P z k Θ t)) P x i z k ; Θ t)) P. 42) z k Θ t)) The hidden variables indicate what cluster each data point is generated from. In each iteration, we need to optimize the Q function in each coordinate of the parameter space. To do this, we take derivatives with respect to each of the parameters, set the result to zero, and solve for the locally optimal new values: k Q µ k Q Σ k Σ + Σ k k )) Q + λ π k k η ik + λ 0, π k π k Σ ) k µk x i ηik 0 43) ) ) ] T µk x i µk x i Σ k η ik 0 44) Re-arranging terms and solving for the new model parameters, we get µ k Σ k η ik x i π k 45) k, 46) η ik ) ) T η ik µk x i µk x i π k N, 47) η ik η ik. 48) These form the M-step update equations for the GMM fitting algorithm. It s interesting to note that the mean and covariance updates are just weighted ML estimates. The posterior probability η ik corresponds to how confident we are that the i th data point was sampled from the j th Gaussian. 5. GMM WITH PRIORS AND DATA WEIGHTING If we place priors on the means and/or covariances, the results are simply posterior-weighted MAP estimators. For example, the MAP update for the means is: Lagrange multipliers are used to enforce equality constraints. 9

10 µ k Ω k Σ k ) η ik + I ] Ω k Σ k ) ] η ik x i + ν k. 49) We can also include data weighting as in the case of a single Gaussian. The weights just multiply the posteriors: η ik η ik w i. 50) 6 WRAPPED GAUSSIAN MIXTURE MODEL WGMM) We can also derive a procedure for fitting a GMM on a torus. 6. UNIVARIATE WGMM samples true fit mix) fit individual) Figure 4: Mixture of univariate wrapped Gaussians WG) fit to dataset. WG components red) linearly combine to form a mixture blue) that describes the distribution of the data bars). In the D case, this is just a circle. This is useful for when we have data that lies on a circular axis in the range π,π]. The EM update equations are derived as in the regular 0

11 GMM case. Likelihood: Q function: Q j l L L j π j wn x i ; µ j,σ 2 j ) 5) j l logl log j l Partial derivatives: Update rules: Q + λ j l π j N x i ; µ j + 2πl,σ 2 j ) 52) π j N x i ; µ j + 2πl,σ 2 j ) 53) ]) log π j N x i ; µ j + 2πl,σ 2 j ) η i j l 54) logπ j ) 2 log2π) 2 logσ2 j ) x ) i µ j 2πl ) 2 η i j l, 55) i Q µ j Q σ 2 j )) π j j π j µ j σ 2 j j l l l l l π j N l 2σ 2 j η i j l 56) ) xi µ j 2πl) η i j l 0 57) σ 2 j 2σ 2 + x ) i µ j 2πl ) 2σ 2 η i j l 0 58) j j )2 η i j l + λ 0, π j x i 2πl) η i j l l π j 59) j η i j l 60) x i µ j 2πl ) 2 η i j l l l η i j l 6) η i j l 62)

12 In practice, we can t evaluate expressions with an infinite number of terms numerically, so the WG s need to be truncated after a sufficient number of terms. This involves replacing L all ) with ). l l L 6.2 BIVARIATE WGMM Figure 5: Mixture of bivariate wrapped Gaussians WG) fit to dataset. Data points are colored by posterior probability. When there are multiple circular axes to consider, we can make use of the multivariate WG distribution. For the case of two dimensions, we have: P x; µ,σ ) ] ) N x; µ + 2π,Σ, x S 2. 63) l, Likelihood: L π j j l, logl log N π j j l, x i ; µ j + 2π N x i ; µ j + 2π ],Σ j ) ],Σ j ) 64). 65) 2

13 Q Q function: j l, j l, log π j N Partial derivatives: Q + λ Q µ j Q Σ j )) j π j Update rules: Σ j x i ; µ j + 2π log π j ) 2 log 2πΣ ) 2 η i j l, l, l, µ j l, π j N j l, ],Σ j )] η i j 66) x i µ j 2π x i ; µ j + 2π π j N Σ j x i µ j 2π x i ; µ j + 2π Σ j + Σ j x i µ j 2π ],Σ j ) ]) T Σ j x i µ j 2π ]) ] η i j l2. 67) ],Σ j ). 68) ]) η i j 0, 69) ]) x i µ j 2π ]) T Σ j ] η i j 0, 70) π j η i j + λ 0. 7) l, x i µ j 2π π j N x i 2π l, l, l, ]) η i j η i j, 72) ]) x i µ j 2π 7 VON MISES MIXTURE MODEL VMMM) ]) T η i j η i j, 73) η i j. 74) We can also cluster on the unit circle with the von Mises distribution, whose pdf is: 3

14 vm x ; µ j,κ j ) 2πI 0 κ j ) eκ j cosx µ j ). 75) Because the vm has a cos ) term, we will have to numerically update the concentration parameter κ. Otherwise, the derivation is standard. 7. UNIVARIATE VMMM Likelihood: Q function: Q j j L j π j vmx i ; µ j,κ j ) 76) logl log π j vmx i ; µ j,κ j ) 77) j log π j vmx i ; µ j,κ j ) ] η i j 78) j ] log π j 2πI 0 κ j ) eκ j cosx i µ j ) η i j 79) logπj ) log2π) logi 0 κ j )) + κ j cosx i µ j ) ] η i j 80) I 0 ) is the 0 th -order modified Bessel function of the first kind. Partial derivatives: Q + λ )) π j j i η i j 8) j Q κj sinx i µ j ) ] η i j 82) µ j κ j sinxi )cosµ j ) cosx i )sinµ j ) ] η i j 0 83) Q κ j π j I κ j ) I 0 κ j ) + cosx i µ j ) ] η i j 84) Aκj ) + cosx i µ j ) ] η i j 0 85) η i j + λ 0, π j π j 86) j 4

15 Update rules: sinx i )η i j µ j tan cosx i )η i j A κ cosx ) i µ j )η i j ) I κ j ) j, A κj I 0 κ j ) η i j π j N 87) 88) η i j 89) We can solve for κ j with a standard zero-finder e.g. bisection search). Notice that the vm distribution has wrapping built into its definition, whereas a truncated wg is a good approximation. 8 VON MISES-FISHER MIXTURE MODEL VMFMM) The von Mises-Fisher is a convenient distribution for modeling uncertainty on the unit 2- sphere. The pdf is parameterized by a mean direction µ and concentration κ. Likelihood: Q function: Q j j j Px; µ,κ) L κ 2πe κ e κ ) eκµt x j, µ 2 90) π j vmfx i ; µ j,κ j ) 9) logl log π j vmfx i ; µ j,κ j ) 92) j ] log π j vmfx i ; µ j,κ j ) η i j 93) log π j ] κ j 2πe κ j e κ j ) eκ j µt j x i η i j 94) ] logπ j ) + logκ j ) log2π) loge κ j e κ j ) + κ j µ T j x i η i j 95) i η i j 96) j 5

16 Figure 6: Mixture of von Mises-Fisher distributions on the sphere. von Mises-Fisher distributions are denoted by their mean µ black + ) and concentration κ ellipses). Data points are colored by their posterior probabilities η i j. Partial derivatives: )) Q + λ µ T j µ j µ j Update rules: κ j x i η i j λµ j 0, µ T j µ j 97) ] Q eκj + e κj κ j κ j e κ j e κ + µ T j j x i η i j 0 98) )) Q + λ π j j η i j + λ 0, π j 99) π j π j j 6

17 x i η i j µ j 00) N x i η 2 i j A κ ) e κ j + e κ j j e κ j e κ κ j j π j N ˆµ T j x i η i j η i j N x i η i j 2 0) η i j η i j 02) The update of the concentration parameters is a pain, but there are good approximations. For κ 3 and large A κ j ), the following can be used as an update approximation from Mardia and Jupp 2000, pg. 98): κ j A κ j ). 03) Even when the conditions of the approximation are not met, the clustering is sufficiently stable and accurate. 9 LINE MIXTURE MODEL LINEMM) Here we derive the update equations for fitting a mixture of lines to a 2D dataset. In essence, we place D Gaussian distributions on each data point and measure error negative likelihood) evaluated at the lines in the vertical) y-coordinate. This is the multi-line extension of linear regression. The mixture model likelihood has the form: where The Q function is: Q L k logl log k π k N y i f k x i ),σ 2 k), 04) k π k e yi fk xi ) ) 2 2σ 2 k, 05) 2πσ 2 k f k x i ) a k x i + b k. 06) logπ k 2 log σ 2 ) yi a k x i + b k ) ) 2 ] k η ik. 07) 2σ 2 k 7

18 Figure 7: Mixture of lines. Data points are colored by their posterior probabilities η i j. Taking derivatives, Q + λ π k k π k Q ) yi a k x i b k xi η ik 0, 08) a k Q ) yi a k x i b k ηik 0, 09) b k Q σ 2 yi a k x i + b k ) ) 2 k 2σ 2 + ) k σ 2 2 ]η ik 0 0) k )) η ik π k + λ 0, Re-arranging and solving for the parameters gives π k. ) k 8

19 â k ) x i yi b k ηik x 2 i η ik ) yi a k x i ηik, 2) b k σ 2 k, 3) η ik yi a k x i + b k ) ) 2 ηik π k N 0 LAPLACIAN MIXTURE MODEL LAPMM), 4) η ik η ik. 5) Figure 8: Mixture of Laplacian distributions. 9

20 Now we derive the update equations for the M-step to fit a D Mixture of Laplacian distributions. This is very much like the derivation for the GMM except that the distribution used is a Laplacian rather than a Gaussian. Taking derivatives, Q L k logl log k π k Lx i µ k,b k ) 6) k π k e x i µ k b k 7) 2b k logπ k logb k x ] i µ k η ik. 8) b k Q x i µ k η ik µ k x i µ k 0, 9) Q η ik 0, 20) b k b k )) Q + λ π k k + λ 0, π k. 2) π k π k We need to assume that the denominator in equation 9) is a constant for each i and k to continue. This leads to a stable algorithm in practice. Re-arranging and solving for the parameters gives µ k b k x η i ik x i µ j η ik x i µ j η ik x i µ k π k N k, 22), 23) η ik η ik. 24) PROBABILISTIC LATENT SEMANTIC INDEXING PLSI) PLSI is a model that represents data in the probability canonical) simplex as convex combinations of categorical distributions referred to as basis vectors," endmembers," and topics." Here, we look at a derivation of the PLSI update equations used by Hofmann in 20

21 Figure 9: PLSI clustering on a 3-component probability simplex. Data is colored by its activation weight Pi k). his paper. This is not a conventional application of the EM algorithm, but the details can be obscured without effecting the derivation. We consider the following factorization: Pi, j ) Pi, j k)pk) Pi k)pj k)pk), 25) k where i is the word dimension) index and j is the document data point) index. The word and document variables are assumed to be independent given the latent variable z, which captures which topic generated each data point. The log likelihood is just the negative cross-entropy between an observed data distribution X i, j ) and its reconstruction Pi, j ) under the symmetric) PLSI model: D L X i, j )logpi, j ) 26) j j k D X i, j )log Pi k)pj k)pk), 27) k 2

22 where we require that k k Pi k), 28) D Pj k), 29) j Pk). 30) k The Q function is: D Q X i, j ) log Pi z)pj z)pk) ] Pk i, j ) 3) j k D ] X i, j ) logpi z) + logpj z) + logpk) Pk i, j ), 32) j k and the posterior is given by: Pi, j,k) Pk i, j ) Pi, j ) Pi k)pj k)pk) Pi k)pj k)pk) k. 33) Taking partial derivatives with the appropriate Lagrange multiplier expressions, we get: ) Q + λ Pi k) D X i, j ) Pk i, j ) + λ 0, 34) Pi k) j Pi k) ) D Q + λ Pj k) j X i, j ) Pk i, j ) + λ 0, 35) Pj k) Pj k) ) Q + λ Pk) k D X i, j ) Pk i, j ) + λ 0. 36) Pk) Pk) j Solving for the parameters Pi k), Pj k), and Pk), and plugging these expressions into the constraint equations 28)-30), we can solve for the Lagrange multipliers and substitute them into equations 34)-36). This gives the PLSI updates for the M-step: 22

23 D X i, j )Pk i, j ) j Pi k), 37) D X i, j )Pk i, j ) j X i, j )Pk i, j ) Pj k), 38) D X i, j )Pk i, j ) j D Pk) X i, j )Pk i, j ). 39) j. MULTIPLICATIVE UPDATES We can re-arrange 37)-39) for efficient computation using matrix algebra. First, we collapse the last two terms into one: Pj,k) Pj k)pk) X i, j )Pk i, j ), 40) and plug the E step posteriors) into the M step to get: D Y i, j )Pi, j,k) Pi k) D Y i, j )Pj,k) where Pi k) j Pk) j D Pj,k) j, 4) Pj,k) Y i, j )Pi, j,k) Pj,k) Y i, j )Pi k), 42) Writing this in matrix notation, we get: Y i, j ) X i, j ) Pi, j ). 43) W W Y H T ). J H T, J ones N,D), 44) H H W T Y ), 45) where W R N K and H R K D are the probabilities Pi k) and Pj,k) arranged into matrices such that P W H R N D is equivalent to Pi, j ) K Pi k)pj,k). We can recover Pi k) from W and both Pj k) and Pk) from H. These updates look suspiciously similar to the NMF updates derived by Lee and Seung for the KL-divergence error criterion. The two algorithms are actually equivalent up to a scaling factor. In practice, the H matrix should be normalized explicitly. k 23

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a