Expectation Maximization - Math and Pictures Johannes Traa

Size: px
Start display at page:

Download "Expectation Maximization - Math and Pictures Johannes Traa"

Transcription

1 Expectation Maximization - Math and Pictures Johannes Traa This document covers the basics of the EM algorithm, maximum likelihood ML) estimation, and maximum a posteriori MAP) estimation. It also covers the EM derivations for the following mixture models: Gaussian Mixture Model GMM) Wrapped Gaussian Mixture Model WGMM) von Mises Mixture Model vmmm) von Mises-Fisher Mixture Model vmfmm) Line Mixture Model LineMM) Laplacian Mixture Model LapMM) Probabilistic Latent Semantic Indexing PLSI) The best way to understand this stuff is to code it up. Plot everything. Excellent references for the EM algorithm and probabilistic methods: Chapter 9: Mixture Models and EM in Pattern Recognition and Machine Learning 06) Bishop) Chapter 0: Grouping and Model Fitting in Computer Vision: A Modern Approach 2) Forsyth, Ponce) Machine Learning: A Probabilistic Perspective 2) Murphy) BASIC IDEA Parameter estimation is a general problem that shows up again and again. A typical situation is where we have collected some data and we want to summarize/find structure in that dataset. Take the following simple example. You are trying to model the interaction between good weather and the number of people at the beach it s a silly example, but just roll with it). If we measure both of these quantities every day of the year and make a scatterplot of our data, it might look the set of points in Figure. There s a positive correlation between good weather and people going to the beach and the data is spread around a center point. Instead of keeping track of the entire dataset, we can represent it with a Gaussian distribution, which looks like a squished and rotated bell-curve in 2 dimensions. A slices through the Gaussian contours) are shown in Figure as ellipses. This pretty much summarizes all the data with two parameters: a mean vector 2 ) and a covariance matrix 2 2).

2 Number of people at the beach Number of people at the beach Number of people at the beach How good the weather is How good the weather is How good the weather is Figure : Left) Dataset representing relationship between the number of people at the beach and the goodness of the weather. Middle) Gaussian distribution fit to the data. Right) Samples drawn from Gaussian fit. Number of people at the beach Number of people at the beach Number of people at the beach How good the weather is How good the weather is How good the weather is Figure 2: Left) Multimodal dataset. Left) Gaussian fit. Right) Mixture of Gaussians fit. By fitting a Gaussian, we are implicitly making the assumption that it s reasonable to model the data as having been sampled from a Gaussian. If we generate data from our Gaussian fit, as shown in Figure, we can see that the samples are spread in the same way as the actual data. So, in this case, our implicit assumption is reasonable. But what if our data looks like that of Figure 2? The Gaussian assumption doesn t make sense here. At least it looks like we can do better. The underlying distribution appears multimodal: it has multiple peaks. So, we can just fit multiple Gaussians instead of one. To fit a single distribution, we typically apply the maximum likelihood ML) or maximum a-posteriori MAP) method. The former tries to find the distribution that makes the most sense given the data, while the latter takes into account our belief, independent of the data, of what that distribution should look like. The Expectation-Maximization EM) algorithm is a straightforward way to fit mixture models that starts with an initial guess and iteratively improves the fit. Each iteration consists of two steps. The first assigns data points to clusters and the second re-estimates the cluster parameters according to the assignments. In EM, we typically associate each data point to each cluster with some probability rather than using binary assignments. The rest is technical details. 2

3 2 MAXIMUM LIKELIHOOD FOR THE GAUSSIAN DISTRIBUTION The maximum likelihood estimate of the mean of a Gaussian distribution is simple. The multivariate Gaussian distribution has pdf: N x; µ,σ) e 2x µ) T Σ x µ). ) 2πΣ 2 The likelihood function for a dataset drawn i.i.d. from the distribution is: so the log likelihood is: L e 2x i µ) T Σ x i µ), 2) 2πΣ 2 logl 2 log Σ xi µ ) T Σ x i µ ). 3) 2 Since this is a convex function of the parameters, we can differentiate it, set the result equal to zero and solve for the ML parameter estimates: logl µ Σ x i µ ) 0 4) µ ML N logl Σ x i, 5) 2 Σ + 2 Σ x i µ ) x i µ ) T Σ 0 6) Σ ML N xi µ ) x i µ ) T. 7) So the ML estimates are just the sample mean and covariance. 2. DATA WEIGHTING We might also include a weight w i for each data point to reflect how confident we are that it is reliable. We modify the likelihood as: wi L e 2x i µ) T Σ x i µ)], 8) 2πΣ 2 so the log likelihood is: log L 2 log Σ xi µ ) T Σ x i µ )] w i. 9) 2 Maximizing with respect to the parameters gives the weighted ML estimates: 3

4 µ WML x i w i, 0) w i xi µ ) x i µ ) T wi Σ WML. ) w i 3 MAXIMUM A POSTERIORI FOR THE GAUSSIAN DISTRIBUTION We can regularize the mean and covariance estimates of a Gaussian distribution by incorporating prior information. This biases the solution towards what we believe it should look like before seeing any data. The conjugate distribution for the mean and covariance are Gaussian and Inverse Wishart distributions, respectively. These conjugate priors ensure that the posterior distribution likelihood prior) is of the same form as the prior this is merely a convenience at this point). 3. PRIOR ON THE MEAN We can regularize the maximum likelihood solution by imposing a Gaussian prior on the mean: P µ; µ s,σ s ) 2πΣ s 2 e 2µ µ s ) T Σ s µ µ s ). 2) Thus, we have: P P x i ; µ,σ )] P ) µ; µ s,σ s, 3) logp 2 log 2πΣ xi µ ) T Σ x i µ )] 2 2 log 2πΣ s ) T ) µ µs Σ s µ µs 2 logp µ Σ x i µ )] Σ ) s µ µs 0. 5) 4) Solving for µ, the MAP solution is: µ MAP N Σ s Σ + I ) ) ) Σ s Σ x i + µ s 6) N Σ s Σ + I ) NΣs Σ µ ML + µ s ). 7) 4

5 Consider the D case for simplicity: µ MAP σ 2 s σ 2 N µ ML + µ s σ 2 s σ 2 N +. 8) As σ2 s σ 2, the prior is uninformative, so µ MAP µ ML. And as σ2 s σ 2 0, the data is uninformative the prior is strict), so µ MAP µ s. When σ2 s, the prior behaves as if one σ 2 additional measurement at µ s were present to calculate the ML solution. Also, as N, the prior becomes redundant, so µ MAP µ ML. 3.2 PRIOR ON THE COVARIANCE We can also regularize the solution for the covariance matrix using the inverse-wishart distribution with the appropriate degrees of freedom): P Σ; Σ 0 ) Σ 0 n 2 e Σ n 2 trσ Σ 0). 9) 2 Thus, P P x i ; µ,σ )] P Σ; Σ 0 ), 20) logp 2 log 2πΣ xi µ ) T Σ x i µ )] + n 2 2 log Σ 0 n 2 log Σ 2 tr Σ ) Σ 0, 2) logp Σ 2 Σ + 2 Σ x i µ ) x i µ ) ] T Σ n 2 Σ + 2 Σ Σ 0 Σ 0. 22) Thus, we have that: Σ MAP Σ 0 + N xi µ ) x i µ ) T n + N Σ 0 + N Σ ML n + N. 23) Thanks to the conjugacy relationship between the likelihood and the prior, the MAP solution is very intuitive. The parameter n in the inverse-wishart distribution controls how confident we are that Σ 0 is the correct estimate. If we are not very confident, n is set to a small number and a moderate number N of data samples will cause the MAP estimate to ignore the prior. 4 THE MATH BEHIND EM Expectation-Maximization EM) is a learning algorithm for maximum-likelihood problems with hidden variables. In the case of a mixture model, we have observed data/variables X, 5

6 unobserved data/variables Z, and parameters Θ to be learned). The hidden variables Z indicate how the observed data X are assigned to the mixture components. The complete data likelihood for a mixture model with i.i.d. samples is L Px i, z i ; Θ) 24) k k k Px i z ik ; θ k )Pz k ; θ k ) 25) K Px i ; θ k )Pz k ; θ k )] z ik 26) K Px i ; θ k )π k ] z ik 27) Px i, z i ; Θ) is the complete data likelihood for the i th observation x i, Px i ; θ k ) is the probability model pdf) of the k th component in the mixture evaluated at x i, and π k Pz k ; θ k ) is the mixing weight of the k th component. The hidden variables z ik are treated as indicator variables for each i in the above notation. So, for the i th observation x i, z ik takes the value for a single index k and 0 for all others. This has the effect of selecting one term in the product over k for each i. It s easier to work with the log likelihood, in which case we have logl k logpx i ; θ k )π k ] z ik 28) This is easy to maximize w.r.t. the parameters θ k if we know the values of the indicator variables z ik. In that case, we can just estimate the parameters for the k th component using all the data whose indicator is active for that component i.e. z ik ). Seeing as we don t know these data associations, we can first lower-bound the log likelihood by taking its expected value w.r.t. the hidden variables this requires Jensen s inequality). This gives what is known as the Q function : where ] Q E z x,θ old logl 29) logpx i ; θ k )π k ]α ik, 30) k 6

7 α ik E z x,θ oldz ik ] 3) Pz ik x i ; Θ old ) 32) Px i z ik ; θ old k )Pz ik ; θ old k ) Px i z il ; θ old l )Pz il ; θ old ) l 33) Px i z ik ; θ old k )π k 34) Px i z il ; θ old )π l l represents our belief that the k th component in the mixture is responsible for generating the i th observation. 32) follows since the expectation of an indicator variable is its probability of being. The Q function is easier to maximize and leads to the EM algorithm. In the E step, we fix the current estimate of the parameters Θ and calculate the posterior probabilities α ik. This captures how much information each data point x i contributes in estimating the parameters of each component θ k. Then, in the M step, we use these posteriors as soft weights to update the model parameters via maximization of 30). Data points with higher weights for a specific value of k will exert more influence on the update of the k th component s parameters. After the M step, Θ has changed, so the α ik have changed. We can re-estimate α ik, update Θ, and repeat until convergence. This procedure is guaranteed to reach a local maximum of 28). 5 GAUSSIAN MIXTURE MODEL GMM) The model is a K-component Mixture of Gaussians MoG). All data is drawn independently from this mixture. The likelihood function is given by The Q function is given by L k π k N ) x i ; µ k,σ k, 35) logl log π k N ) x i ; µ k,σ k. 36) k 7

8 Figure 3: Mixture of Gaussians fit to dataset in 2 dimensions. Each Gaussian is depicted by its mean µ black + ), covariance Σ -σ ellipse), and mixing weight π transparency). Data points are colored by their posterior probabilities η ik. ] Q E z x,θ t) logp x, z Θ) 37) logp z k Θ)P x i z k ; Θ)]P z k x i ; Θ t)) 38) k k k log π k N )] x i ; µ k,σ k ηik 39) log k π k e 2µ k x i ) T Σ 2πΣ k k µ k x i ) 2 logπ k 2 log 2πΣ k 2 ] η ik 40) ) T ) ] µk x i Σ k µk x i η ik 4) where Θ is the parameter set to solve for, Θ t) is the previous iteration s parameters, and 8

9 η ik Pz k x i ; Θ t) ) is the posterior probability of each hidden variable given the parameters from the previous iteration, given by η ik Pz k x i ; Θ t) ) P x i z k ; Θ t)) P z k Θ t)) P x i Θ t)) P xi z k ; Θ t)) P z k Θ t)) P x i z k ; Θ t)) P. 42) z k Θ t)) The hidden variables indicate what cluster each data point is generated from. In each iteration, we need to optimize the Q function in each coordinate of the parameter space. To do this, we take derivatives with respect to each of the parameters, set the result to zero, and solve for the locally optimal new values: k Q µ k Q Σ k Σ + Σ k k )) Q + λ π k k η ik + λ 0, π k π k Σ ) k µk x i ηik 0 43) ) ) ] T µk x i µk x i Σ k η ik 0 44) Re-arranging terms and solving for the new model parameters, we get µ k Σ k η ik x i π k 45) k, 46) η ik ) ) T η ik µk x i µk x i π k N, 47) η ik η ik. 48) These form the M-step update equations for the GMM fitting algorithm. It s interesting to note that the mean and covariance updates are just weighted ML estimates. The posterior probability η ik corresponds to how confident we are that the i th data point was sampled from the j th Gaussian. 5. GMM WITH PRIORS AND DATA WEIGHTING If we place priors on the means and/or covariances, the results are simply posterior-weighted MAP estimators. For example, the MAP update for the means is: Lagrange multipliers are used to enforce equality constraints. 9

10 µ k Ω k Σ k ) η ik + I ] Ω k Σ k ) ] η ik x i + ν k. 49) We can also include data weighting as in the case of a single Gaussian. The weights just multiply the posteriors: η ik η ik w i. 50) 6 WRAPPED GAUSSIAN MIXTURE MODEL WGMM) We can also derive a procedure for fitting a GMM on a torus. 6. UNIVARIATE WGMM samples true fit mix) fit individual) Figure 4: Mixture of univariate wrapped Gaussians WG) fit to dataset. WG components red) linearly combine to form a mixture blue) that describes the distribution of the data bars). In the D case, this is just a circle. This is useful for when we have data that lies on a circular axis in the range π,π]. The EM update equations are derived as in the regular 0

11 GMM case. Likelihood: Q function: Q j l L L j π j wn x i ; µ j,σ 2 j ) 5) j l logl log j l Partial derivatives: Update rules: Q + λ j l π j N x i ; µ j + 2πl,σ 2 j ) 52) π j N x i ; µ j + 2πl,σ 2 j ) 53) ]) log π j N x i ; µ j + 2πl,σ 2 j ) η i j l 54) logπ j ) 2 log2π) 2 logσ2 j ) x ) i µ j 2πl ) 2 η i j l, 55) i Q µ j Q σ 2 j )) π j j π j µ j σ 2 j j l l l l l π j N l 2σ 2 j η i j l 56) ) xi µ j 2πl) η i j l 0 57) σ 2 j 2σ 2 + x ) i µ j 2πl ) 2σ 2 η i j l 0 58) j j )2 η i j l + λ 0, π j x i 2πl) η i j l l π j 59) j η i j l 60) x i µ j 2πl ) 2 η i j l l l η i j l 6) η i j l 62)

12 In practice, we can t evaluate expressions with an infinite number of terms numerically, so the WG s need to be truncated after a sufficient number of terms. This involves replacing L all ) with ). l l L 6.2 BIVARIATE WGMM Figure 5: Mixture of bivariate wrapped Gaussians WG) fit to dataset. Data points are colored by posterior probability. When there are multiple circular axes to consider, we can make use of the multivariate WG distribution. For the case of two dimensions, we have: P x; µ,σ ) ] ) N x; µ + 2π,Σ, x S 2. 63) l, Likelihood: L π j j l, logl log N π j j l, x i ; µ j + 2π N x i ; µ j + 2π ],Σ j ) ],Σ j ) 64). 65) 2

13 Q Q function: j l, j l, log π j N Partial derivatives: Q + λ Q µ j Q Σ j )) j π j Update rules: Σ j x i ; µ j + 2π log π j ) 2 log 2πΣ ) 2 η i j l, l, l, µ j l, π j N j l, ],Σ j )] η i j 66) x i µ j 2π x i ; µ j + 2π π j N Σ j x i µ j 2π x i ; µ j + 2π Σ j + Σ j x i µ j 2π ],Σ j ) ]) T Σ j x i µ j 2π ]) ] η i j l2. 67) ],Σ j ). 68) ]) η i j 0, 69) ]) x i µ j 2π ]) T Σ j ] η i j 0, 70) π j η i j + λ 0. 7) l, x i µ j 2π π j N x i 2π l, l, l, ]) η i j η i j, 72) ]) x i µ j 2π 7 VON MISES MIXTURE MODEL VMMM) ]) T η i j η i j, 73) η i j. 74) We can also cluster on the unit circle with the von Mises distribution, whose pdf is: 3

14 vm x ; µ j,κ j ) 2πI 0 κ j ) eκ j cosx µ j ). 75) Because the vm has a cos ) term, we will have to numerically update the concentration parameter κ. Otherwise, the derivation is standard. 7. UNIVARIATE VMMM Likelihood: Q function: Q j j L j π j vmx i ; µ j,κ j ) 76) logl log π j vmx i ; µ j,κ j ) 77) j log π j vmx i ; µ j,κ j ) ] η i j 78) j ] log π j 2πI 0 κ j ) eκ j cosx i µ j ) η i j 79) logπj ) log2π) logi 0 κ j )) + κ j cosx i µ j ) ] η i j 80) I 0 ) is the 0 th -order modified Bessel function of the first kind. Partial derivatives: Q + λ )) π j j i η i j 8) j Q κj sinx i µ j ) ] η i j 82) µ j κ j sinxi )cosµ j ) cosx i )sinµ j ) ] η i j 0 83) Q κ j π j I κ j ) I 0 κ j ) + cosx i µ j ) ] η i j 84) Aκj ) + cosx i µ j ) ] η i j 0 85) η i j + λ 0, π j π j 86) j 4

15 Update rules: sinx i )η i j µ j tan cosx i )η i j A κ cosx ) i µ j )η i j ) I κ j ) j, A κj I 0 κ j ) η i j π j N 87) 88) η i j 89) We can solve for κ j with a standard zero-finder e.g. bisection search). Notice that the vm distribution has wrapping built into its definition, whereas a truncated wg is a good approximation. 8 VON MISES-FISHER MIXTURE MODEL VMFMM) The von Mises-Fisher is a convenient distribution for modeling uncertainty on the unit 2- sphere. The pdf is parameterized by a mean direction µ and concentration κ. Likelihood: Q function: Q j j j Px; µ,κ) L κ 2πe κ e κ ) eκµt x j, µ 2 90) π j vmfx i ; µ j,κ j ) 9) logl log π j vmfx i ; µ j,κ j ) 92) j ] log π j vmfx i ; µ j,κ j ) η i j 93) log π j ] κ j 2πe κ j e κ j ) eκ j µt j x i η i j 94) ] logπ j ) + logκ j ) log2π) loge κ j e κ j ) + κ j µ T j x i η i j 95) i η i j 96) j 5

16 Figure 6: Mixture of von Mises-Fisher distributions on the sphere. von Mises-Fisher distributions are denoted by their mean µ black + ) and concentration κ ellipses). Data points are colored by their posterior probabilities η i j. Partial derivatives: )) Q + λ µ T j µ j µ j Update rules: κ j x i η i j λµ j 0, µ T j µ j 97) ] Q eκj + e κj κ j κ j e κ j e κ + µ T j j x i η i j 0 98) )) Q + λ π j j η i j + λ 0, π j 99) π j π j j 6

17 x i η i j µ j 00) N x i η 2 i j A κ ) e κ j + e κ j j e κ j e κ κ j j π j N ˆµ T j x i η i j η i j N x i η i j 2 0) η i j η i j 02) The update of the concentration parameters is a pain, but there are good approximations. For κ 3 and large A κ j ), the following can be used as an update approximation from Mardia and Jupp 2000, pg. 98): κ j A κ j ). 03) Even when the conditions of the approximation are not met, the clustering is sufficiently stable and accurate. 9 LINE MIXTURE MODEL LINEMM) Here we derive the update equations for fitting a mixture of lines to a 2D dataset. In essence, we place D Gaussian distributions on each data point and measure error negative likelihood) evaluated at the lines in the vertical) y-coordinate. This is the multi-line extension of linear regression. The mixture model likelihood has the form: where The Q function is: Q L k logl log k π k N y i f k x i ),σ 2 k), 04) k π k e yi fk xi ) ) 2 2σ 2 k, 05) 2πσ 2 k f k x i ) a k x i + b k. 06) logπ k 2 log σ 2 ) yi a k x i + b k ) ) 2 ] k η ik. 07) 2σ 2 k 7

18 Figure 7: Mixture of lines. Data points are colored by their posterior probabilities η i j. Taking derivatives, Q + λ π k k π k Q ) yi a k x i b k xi η ik 0, 08) a k Q ) yi a k x i b k ηik 0, 09) b k Q σ 2 yi a k x i + b k ) ) 2 k 2σ 2 + ) k σ 2 2 ]η ik 0 0) k )) η ik π k + λ 0, Re-arranging and solving for the parameters gives π k. ) k 8

19 â k ) x i yi b k ηik x 2 i η ik ) yi a k x i ηik, 2) b k σ 2 k, 3) η ik yi a k x i + b k ) ) 2 ηik π k N 0 LAPLACIAN MIXTURE MODEL LAPMM), 4) η ik η ik. 5) Figure 8: Mixture of Laplacian distributions. 9

20 Now we derive the update equations for the M-step to fit a D Mixture of Laplacian distributions. This is very much like the derivation for the GMM except that the distribution used is a Laplacian rather than a Gaussian. Taking derivatives, Q L k logl log k π k Lx i µ k,b k ) 6) k π k e x i µ k b k 7) 2b k logπ k logb k x ] i µ k η ik. 8) b k Q x i µ k η ik µ k x i µ k 0, 9) Q η ik 0, 20) b k b k )) Q + λ π k k + λ 0, π k. 2) π k π k We need to assume that the denominator in equation 9) is a constant for each i and k to continue. This leads to a stable algorithm in practice. Re-arranging and solving for the parameters gives µ k b k x η i ik x i µ j η ik x i µ j η ik x i µ k π k N k, 22), 23) η ik η ik. 24) PROBABILISTIC LATENT SEMANTIC INDEXING PLSI) PLSI is a model that represents data in the probability canonical) simplex as convex combinations of categorical distributions referred to as basis vectors," endmembers," and topics." Here, we look at a derivation of the PLSI update equations used by Hofmann in 20

21 Figure 9: PLSI clustering on a 3-component probability simplex. Data is colored by its activation weight Pi k). his paper. This is not a conventional application of the EM algorithm, but the details can be obscured without effecting the derivation. We consider the following factorization: Pi, j ) Pi, j k)pk) Pi k)pj k)pk), 25) k where i is the word dimension) index and j is the document data point) index. The word and document variables are assumed to be independent given the latent variable z, which captures which topic generated each data point. The log likelihood is just the negative cross-entropy between an observed data distribution X i, j ) and its reconstruction Pi, j ) under the symmetric) PLSI model: D L X i, j )logpi, j ) 26) j j k D X i, j )log Pi k)pj k)pk), 27) k 2

22 where we require that k k Pi k), 28) D Pj k), 29) j Pk). 30) k The Q function is: D Q X i, j ) log Pi z)pj z)pk) ] Pk i, j ) 3) j k D ] X i, j ) logpi z) + logpj z) + logpk) Pk i, j ), 32) j k and the posterior is given by: Pi, j,k) Pk i, j ) Pi, j ) Pi k)pj k)pk) Pi k)pj k)pk) k. 33) Taking partial derivatives with the appropriate Lagrange multiplier expressions, we get: ) Q + λ Pi k) D X i, j ) Pk i, j ) + λ 0, 34) Pi k) j Pi k) ) D Q + λ Pj k) j X i, j ) Pk i, j ) + λ 0, 35) Pj k) Pj k) ) Q + λ Pk) k D X i, j ) Pk i, j ) + λ 0. 36) Pk) Pk) j Solving for the parameters Pi k), Pj k), and Pk), and plugging these expressions into the constraint equations 28)-30), we can solve for the Lagrange multipliers and substitute them into equations 34)-36). This gives the PLSI updates for the M-step: 22

23 D X i, j )Pk i, j ) j Pi k), 37) D X i, j )Pk i, j ) j X i, j )Pk i, j ) Pj k), 38) D X i, j )Pk i, j ) j D Pk) X i, j )Pk i, j ). 39) j. MULTIPLICATIVE UPDATES We can re-arrange 37)-39) for efficient computation using matrix algebra. First, we collapse the last two terms into one: Pj,k) Pj k)pk) X i, j )Pk i, j ), 40) and plug the E step posteriors) into the M step to get: D Y i, j )Pi, j,k) Pi k) D Y i, j )Pj,k) where Pi k) j Pk) j D Pj,k) j, 4) Pj,k) Y i, j )Pi, j,k) Pj,k) Y i, j )Pi k), 42) Writing this in matrix notation, we get: Y i, j ) X i, j ) Pi, j ). 43) W W Y H T ). J H T, J ones N,D), 44) H H W T Y ), 45) where W R N K and H R K D are the probabilities Pi k) and Pj,k) arranged into matrices such that P W H R N D is equivalent to Pi, j ) K Pi k)pj,k). We can recover Pi k) from W and both Pj k) and Pk) from H. These updates look suspiciously similar to the NMF updates derived by Lee and Seung for the KL-divergence error criterion. The two algorithms are actually equivalent up to a scaling factor. In practice, the H matrix should be normalized explicitly. k 23

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Review and Motivation

Review and Motivation Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro

More information

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution Carl Scheffler First draft: September 008 Contents The Student s t Distribution The

More information

Latent Variable View of EM. Sargur Srihari

Latent Variable View of EM. Sargur Srihari Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Lecture 2: GMM and EM

Lecture 2: GMM and EM 2: GMM and EM-1 Machine Learning Lecture 2: GMM and EM Lecturer: Haim Permuter Scribe: Ron Shoham I. INTRODUCTION This lecture comprises introduction to the Gaussian Mixture Model (GMM) and the Expectation-Maximization

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

EXPECTATION- MAXIMIZATION THEORY

EXPECTATION- MAXIMIZATION THEORY Chapter 3 EXPECTATION- MAXIMIZATION THEORY 3.1 Introduction Learning networks are commonly categorized in terms of supervised and unsupervised networks. In unsupervised learning, the training set consists

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Expectation Maximization and Mixtures of Gaussians

Expectation Maximization and Mixtures of Gaussians Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1 Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were

More information

MIXTURE MODELS AND EM

MIXTURE MODELS AND EM Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Computing the MLE and the EM Algorithm

Computing the MLE and the EM Algorithm ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part X Factor analysis When we have data x (i) R n that comes from a mixture of several Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting,

More information

Mixture Models and Expectation-Maximization

Mixture Models and Expectation-Maximization Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018 CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information