Expectation Maximization

Size: px
Start display at page:

Download "Expectation Maximization"

Transcription

1 Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori

2 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger than the to thanradius Probability the radius of theofbright bright circle, Distributions circle, tative p relative to a ref- which which has has In practice to a ref-annitydensity thereafter and thereafter entropy the entropy decreases decreases theasscale scale increases. increases. We discussed In practice this method probabilistic this method gives models gives stable stable identification at length identification of featuresaover variety a variety of sizes of sizes and copes and copes well with well with intra-class intra-class ckground model model as- as- of fea- re arts assumed are assumed to totures in(within a range a range r). r). d p r f p, U p ) dp r f er. e finder. p(d θ) (N, f) p(d θ) detected tures detected using using The n M. second The second is is variable and the and last the last ll ible possible occlusion occlusion l. shape s the shape and ocpearancthe appearance and and compasses many many of of and oc- obabilistic istic way, this way, this constrained objects objects K-Means small then only Gaussian themixture whitemodels circle is seen, and there is no ex- Expectation-Maximization α f small then only the white circle is seen, and there is no extremtrema in entropy. in entropy. There There is anis entropy an entropy extrema extrema when when the the In assignment variability. variability. The 3 and The4saliency you measure see measure that is designed given is designed to fully betoinvari- ant to observed invariant scaling, toparameters scaling, although although experimental training data, setting θ i to experimental tests show probability testsdistributions show that this that this is is not entirely not entirely the case thedue casetodue aliasing to aliasing and other and other effects. effects. Note, Note, straight-forward only monochrome only monochrome information information is used is used to detect to detect and represensent infeatures. many settings not all variables are observed and repre- However, (labelled) in the training data: x i = (x i, h i ).3. e.g..3. Feature Speech Feature recognition: representation have speech signals, but not phoneme The labels feature The feature detector detector identifies identifies regions regions of interest of interest on each on each image. e.g. image. Object The coordinates The recognition: coordinates of theof have centre the centre object give us give labels Xus and X(car, the andsize the bicycle), size but not of the partof region labels the region gives (wheel, gives S. Figure door, S. Figure illustrates seat) illustrates this on this two ontypical two typical images images from the frommotorbike the motorbike dataset. dataset. Unobserved variables are called latent variables figs from Fergus et al. Figure Figure : Output : Output of the c Ghane/Mori of feature the feature detector detector

3 Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 3

4 Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 4

5 Unsupervised Learning (a) We will start with an unsupervised learning (clustering) problem: Given a dataset {x,..., x N }, each x i R D, partition the dataset into K clusters Intuitively, a cluster is a group of points, which are close together and far from others c Ghane/Mori 5

6 Distortion Measure (a) (i) Formally, introduce prototypes (or cluster centers) µ k R D Use binary r nk, if point n is in cluster k, otherwise (-of-k coding scheme again) Find {µ k }, {r nk } to minimize distortion measure: J = N n= k= K r nk x n µ k e.g. two clusters k =, : J = x n µ + x n µ x n C x n C c Ghane/Mori 6

7 Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k n= k= However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence c Ghane/Mori 7

8 Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k n= k= However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence c Ghane/Mori 8

9 Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k n= k= However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence c Ghane/Mori 9

10 Determining Membership Variables (a) (b) Step in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk N K J = r nk x n µ k n= k= Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k k= Simply set r nk = for the cluster center µ k with smallest distance c Ghane/Mori

11 Determining Membership Variables (a) (b) Step in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk N K J = r nk x n µ k n= k= Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k k= Simply set r nk = for the cluster center µ k with smallest distance c Ghane/Mori

12 Determining Membership Variables (a) (b) Step in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk N K J = r nk x n µ k n= k= Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k k= Simply set r nk = for the cluster center µ k with smallest distance c Ghane/Mori

13 Determining Cluster Centers (b) (c) Step : fix r nk, minimize J wrt the cluster centers µ k K N J = r nk x n µ k switch order of sums k= n= So we can minimize wrt each µ k separately Take derivative, set to zero: N r nk (x n µ k ) = n= µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k c Ghane/Mori 3

14 Determining Cluster Centers (b) (c) Step : fix r nk, minimize J wrt the cluster centers µ k K N J = r nk x n µ k switch order of sums k= n= So we can minimize wrt each µ k separately Take derivative, set to zero: N r nk (x n µ k ) = n= µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k c Ghane/Mori 4

15 K-means Algorithm Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Assign points to nearest cluster center Minimize J wrt µ k Set cluster center as average of points in cluster Rinse and repeat until convergence c Ghane/Mori 5

16 K-means example (a) c Ghane/Mori 6

17 K-means example (b) c Ghane/Mori 7

18 K-means example (c) c Ghane/Mori 8

19 K-means example (d) c Ghane/Mori 9

20 K-means example (e) c Ghane/Mori

21 K-means example (f) c Ghane/Mori

22 K-means example (g) c Ghane/Mori

23 K-means example (h) c Ghane/Mori 3

24 K-means example (i) Next step doesn t change membership stop c Ghane/Mori 4

25 K-means Convergence Repeat steps until no change in cluster assignments For each step, value of J either goes down, or we stop Finite number of possible assignments of data points to clusters, so we are guarranteed to converge eventually Note it may be a local maximum rather than a global maximum to which we converge c Ghane/Mori 5

26 K-means Example - Image Segmentation Original image K-means clustering on pixel colour values Pixels in a cluster are coloured by cluster mean Represent each pixel (e.g. 4-bit colour value) by a cluster number (e.g. 4 bits for K = ), compressed version This technique known as vector quantization Represent vector (in this case from RGB, R 3 ) as a single discrete value c Ghane/Mori 6

27 Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 7

28 Hard Assignment vs. Soft Assignment (i) In the K-means algorithm, a hard assignment of points to clusters is made However, for points near the decision boundary, this may not be such a good idea Instead, we could think about making a soft assignment of points to clusters c Ghane/Mori 8

29 Gaussian Mixture Model (b) (a) The Gaussian mixture model (or mixture of Gaussians MoG) models the data as a combination of Gaussians Above shows a dataset generated by drawing samples from three different Gaussians c Ghane/Mori 9

30 Generative Model z (a).5 x.5 The mixture of Gaussians is a generative model To generate a datapoint x n, we first generate a value for a discrete variable z n {,..., K} We then generate a value x n N (x µ k, Σ k ) for the corresponding Gaussian c Ghane/Mori 3

31 Graphical Model π z n (a).5 x n µ Σ N.5 Full graphical model using plate notation Note z n is a latent variable, unobserved Need to give conditional distributions p(z n ) and p(x n z n ) The one-of-k representation is helpful here: z nk {, }, z n = (z n,..., z nk ) c Ghane/Mori 3

32 Graphical Model - Latent Component Variable π z n (a).5 x n µ Σ N.5 Use a Bernoulli distribution for p(z n ) i.e. p(z nk = ) = π k Parameters to this distribution {π k } Must have π k and K k= π k = p(z n ) = K k= πz nk k c Ghane/Mori 3

33 Graphical Model - Observed Variable π z n (a).5 x n µ Σ N.5 Use a Gaussian distribution for p(x n z n ) Parameters to this distribution {µ k, Σ k } p(x n z nk = ) = N (x n µ k, Σ k ) K p(x n z n ) = N (x n µ k, Σ k ) z nk k= c Ghane/Mori 33

34 Graphical Model - Joint distribution π z n (a).5 x n µ Σ N The full joint distribution is given by:.5 p(x, z) = = N p(z n )p(x n z n ) n= N K n= k= π z nk k N (x n µ k, Σ k ) z nk c Ghane/Mori 34

35 MoG Marginal over Observed Variables The marginal distribution p(x n ) for this model is: p(x n ) = z n p(x n, z n ) = z n p(z n )p(x n z n ) = K π k N (x n µ k, Σ k ) k= A mixture of Gaussians c Ghane/Mori 35

36 MoG Conditional over Latent Variable (b) (c) The conditional p(z nk = x n ) will play an important role for learning It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = x n ) = = p(z nk = )p(x n z nk = ) K j= p(z nj = )p(x n z nj = ) π k N (x n µ k, Σ k ) K j= π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n c Ghane/Mori 36

37 MoG Conditional over Latent Variable (b) (c) The conditional p(z nk = x n ) will play an important role for learning It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = x n ) = = p(z nk = )p(x n z nk = ) K j= p(z nj = )p(x n z nj = ) π k N (x n µ k, Σ k ) K j= π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n c Ghane/Mori 37

38 MoG Conditional over Latent Variable (b) (c) The conditional p(z nk = x n ) will play an important role for learning It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = x n ) = = p(z nk = )p(x n z nk = ) K j= p(z nj = )p(x n z nj = ) π k N (x n µ k, Σ k ) K j= π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n c Ghane/Mori 38

39 MoG Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } Answer will be similar to k-means: If we know the latent variables z n, fitting the Gaussians is easy If we know the Gaussians µ k, Σ k, finding the latent variables is easy Rather than latent variables, we will use responsibilities γ(z nk ) c Ghane/Mori 39

40 MoG Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } Answer will be similar to k-means: If we know the latent variables z n, fitting the Gaussians is easy If we know the Gaussians µ k, Σ k, finding the latent variables is easy Rather than latent variables, we will use responsibilities γ(z nk ) c Ghane/Mori 4

41 MoG Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } Answer will be similar to k-means: If we know the latent variables z n, fitting the Gaussians is easy If we know the Gaussians µ k, Σ k, finding the latent variables is easy Rather than latent variables, we will use responsibilities γ(z nk ) c Ghane/Mori 4

42 MoG Maximum Likelihood Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } We can use the maximum likelihood criterion: θ ML = arg max θ = arg max θ N n= k= K π k N (x n µ k, Σ k ) { N K } log π k N (x n µ k, Σ k ) n= k= Unfortunately, closed-form solution not possible this time log of sum rather than log of product c Ghane/Mori 4

43 MoG Maximum Likelihood Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } We can use the maximum likelihood criterion: θ ML = arg max θ = arg max θ N n= k= K π k N (x n µ k, Σ k ) { N K } log π k N (x n µ k, Σ k ) n= k= Unfortunately, closed-form solution not possible this time log of sum rather than log of product c Ghane/Mori 43

44 MoG Maximum Likelihood Learning - Problem Maximum likelihood criterion, -D: { N K θ ML = arg max log π k exp { (x n µ k ) /(σ ) } θ πσ n= k= c Ghane/Mori 44

45 MoG Maximum Likelihood Learning - Problem Maximum likelihood criterion, -D: { N K θ ML = arg max log π k exp { (x n µ k ) /(σ ) } θ πσ n= k= Suppose we set µ k = x n for some k and n, then we have one term in the sum: π k πσk exp { (x n µ k ) /(σ ) } = π k πσk exp { () /(σ ) } c Ghane/Mori 45

46 MoG Maximum Likelihood Learning - Problem Maximum likelihood criterion, -D: { N K θ ML = arg max log π k exp { (x n µ k ) /(σ ) } θ πσ n= k= Suppose we set µ k = x n for some k and n, then we have one term in the sum: π k πσk exp { (x n µ k ) /(σ ) } = π k πσk exp { () /(σ ) } In the limit as σ k, this goes to So ML solution is to set some µ k = x n, and σ k =! c Ghane/Mori 46

47 ML for Gaussian Mixtures Keeping this problem in mind, we will develop an algorithm for ML estimation of the parameters for a MoG model Search for a local optimum Consider the log-likelihood function l(θ) = { N K } log π k N (x n µ k, Σ k ) n= k= We can try taking derivatives and setting to zero, even though no closed form solution exists c Ghane/Mori 47

48 Maximizing Log-Likelihood - Means l(θ) = µ k l(θ) = = { N K } log π k N (x n µ k, Σ k ) n= N n= N n= k= π k N (x n µ k, Σ k ) j π jn (x n µ j, Σ j ) Σ k (x n µ k ) γ(z nk )Σ k (x n µ k ) Setting derivative to, and multiply by Σ k N N γ(z nk )µ k = γ(z nk )x n n= n= µ k = N γ(z nk )x n where N k = N k N γ(z nk ) n= n= c Ghane/Mori 48

49 Maximizing Log-Likelihood - Means l(θ) = µ k l(θ) = = { N K } log π k N (x n µ k, Σ k ) n= N n= N n= k= π k N (x n µ k, Σ k ) j π jn (x n µ j, Σ j ) Σ k (x n µ k ) γ(z nk )Σ k (x n µ k ) Setting derivative to, and multiply by Σ k N N γ(z nk )µ k = γ(z nk )x n n= n= µ k = N γ(z nk )x n where N k = N k N γ(z nk ) n= n= c Ghane/Mori 49

50 Maximizing Log-Likelihood - Means and Covariances Note that the mean µ k is a weighted combination of points x n, using the responsibilities γ(z nk ) for the cluster k µ k = N γ(z nk )x n N k n= N k = N n= γ(z nk) is the effective number of points in the cluster A similar result comes from taking derivatives wrt the covariance matrices Σ k : Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k n= c Ghane/Mori 5

51 Maximizing Log-Likelihood - Means and Covariances Note that the mean µ k is a weighted combination of points x n, using the responsibilities γ(z nk ) for the cluster k µ k = N γ(z nk )x n N k n= N k = N n= γ(z nk) is the effective number of points in the cluster A similar result comes from taking derivatives wrt the covariance matrices Σ k : Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k n= c Ghane/Mori 5

52 Maximizing Log-Likelihood - Mixing Coefficients We can also maximize wrt the mixing coefficients π k Note there is a constraint that k π k = Use Lagrange multipliers End up with: π k = N k N average responsibility that component k takes c Ghane/Mori 5

53 Three Parameter Types and Three Equations These three equations a solution does not make µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= All depend on γ(z nk ), which depends on all 3! But an iterative scheme can be used c Ghane/Mori 53

54 EM for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters π k N (x n µ γ(z nk ) = k, Σ k ) K j= π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= This algorithm is known as the expectation-maximization algorithm (EM) Next we describe its general form, why it works, and why it s called EM (but first an example) c Ghane/Mori 54

55 EM for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters π k N (x n µ γ(z nk ) = k, Σ k ) K j= π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= This algorithm is known as the expectation-maximization algorithm (EM) Next we describe its general form, why it works, and why it s called EM (but first an example) c Ghane/Mori 55

56 EM for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters π k N (x n µ γ(z nk ) = k, Σ k ) K j= π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= This algorithm is known as the expectation-maximization algorithm (EM) Next we describe its general form, why it works, and why it s called EM (but first an example) c Ghane/Mori 56

57 MoG EM - Example (a) Same initialization as with K-means before Often, K-means is actually used to initialize EM c Ghane/Mori 57

58 MoG EM - Example (b) Calculate responsibilities γ(z nk ) c Ghane/Mori 58

59 MoG EM - Example (c) Calculate model parameters {π k, µ k, Σ k } using these responsibilities c Ghane/Mori 59

60 MoG EM - Example (d) Iteration c Ghane/Mori 6

61 MoG EM - Example (e) Iteration 5 c Ghane/Mori 6

62 MoG EM - Example Iteration - converged (f) c Ghane/Mori 6

63 Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 63

64 General Version of EM In general, we are interested in maximizing the likelihood p(x θ) = Z p(x, Z θ) where X denotes all observed variables, and Z denotes all latent (hidden, unobserved) variables Assume that maximizing p(x θ) is difficult (e.g. mixture of Gaussians) But maximizing p(x, Z θ) is tractable (everything observed) p(x, Z θ) is referred to as the complete-data likelihood function, which we don t have c Ghane/Mori 64

65 General Version of EM In general, we are interested in maximizing the likelihood p(x θ) = Z p(x, Z θ) where X denotes all observed variables, and Z denotes all latent (hidden, unobserved) variables Assume that maximizing p(x θ) is difficult (e.g. mixture of Gaussians) But maximizing p(x, Z θ) is tractable (everything observed) p(x, Z θ) is referred to as the complete-data likelihood function, which we don t have c Ghane/Mori 65

66 A Lower Bound The strategy for optimization will be to introduce a lower bound on the likelihood This lower bound will be based on the complete-data likelihood, which is easy to optimize Iteratively increase this lower bound Make sure we re increasing the likelihood while doing so c Ghane/Mori 66

67 A Decomposition Trick To obtain the lower bound, we use a decomposition: ln p(x, Z θ) = ln p(x θ) + ln p(z X, θ) product rule ln p(x θ) = L(q, θ) + KL(q p) L(q, θ) { } p(x, Z θ) q(z) ln q(z) Z KL(q p) { } p(z X, θ) q(z) ln q(z) Z KL(q p) is known as the Kullback-Leibler divergence (KL-divergence), and is (see p.55 PRML, next slide) Hence ln p(x θ) L(q, θ) c Ghane/Mori 67

68 Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 68

69 Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 69

70 Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 7

71 Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 7

72 Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 7

73 Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 73

74 Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 74

75 Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 75

76 Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 76

77 Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 77

78 Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 78

79 Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 79

80 Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 8

81 Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 8

82 Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 8

83 Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 83

84 Bounding Example !.!5!4!3!! Consider component -D MoG with known variances (example from F. Dellaert) Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively. c Ghane/Mori 84

85 ..!.!5!4!3!! Bounding Example Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively !.!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively.!!!!3!3!!! Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. True likelihood function Recall we re fitting means θ, θ !!!! c Ghane/Mori!3!!3 85!

86 !.!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively. Bounding Example i=, Q=! !!!!3!3!!! 3 3!!!!3!3!!! 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. i=, Q=! i=3, Q=!.56 Lower bound the likelihood function using averaging distribution q(z).5.5 ln p(x θ) = L(q, θ) + KL(q(Z) p(z X, θ)).4.4 Since q(z) = p(z X, θ old ), bound is tight (equal to actual.3.3 likelihood) at θ = θ old... c Ghane/Mori 86.

87 !.!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively. Bounding Example i=, Q=!.78856!!!!3!3! !!!!3!3!!! 3 3!!!!3!3!!! 3 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. Lower bound the likelihood function using averaging distribution i=4, Q=!.8749 q(z) ln p(x θ) = L(q, θ) + KL(q(Z) p(z X, θ)).5 Since q(z) = p(z X, θ old ), bound is tight (equal to actual.4 likelihood) at θ = θ old c Ghane/Mori..87

88 K-Means Gaussian Mixture Models Expectation-Maximization..!.!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are andi=,, Q=! respectively.!!!!!3!!3! Bounding Example i=3, Q=! !! 3 3!!!!!!!!!3!!!3!3!3!3!3!!! 3!!! 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. Lower boundi=4, the Q=!.8749 likelihood function using averaging i=5, Q=!.7666 distribution q(z) ln p(x θ) = L(q, θ) + KL(q(Z) p(z X,.5 θ)) Since q(z) = p(z X, θ old ), bound.4 is tight (equal to actual likelihood) at θ = θ old c Ghane/Mori 88

89 K-Means 3 Gaussian Mixture Models 3 Expectation-Maximization..!!!!!!!3!3!3!!3!.!!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, i=4, respectively. Q=!.8749!!!! Bounding Example i=5, Q=! !!! 3 3!!!!!!!3!3!!3!!3!!!!3!3! 3!!! 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. Lower bound the likelihood function using averaging distribution q(z) Figure 3: Lower Bounds ln p(x θ) = L(q, θ) + KL(q(Z) p(z X, θ)) Since q(z) = p(z X, θ old ), bound is tight (equal to actual likelihood) at θ = θ old 3 c Ghane/Mori 89

90 EM - Summary EM finds local maximum to likelihood p(x θ) = Z p(x, Z θ) Iterates two steps: E step fills in the missing variables Z (calculates their distribution) M step maximizes expected complete log likelihood (expectation wrt E step distribution) This works because these two steps are performing a coordinate-wise hill-climbing on a lower bound on the likelihood p(x θ) c Ghane/Mori 9

91 Conclusion Readings: Ch. 9., 9., 9.4 K-means clustering Gaussian mixture model What about K? Model selection: either cross-validation or Bayesian version (average over all values for K) Expectation-maximization, a general method for learning parameters of models when not all variables are observed c Ghane/Mori 9

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Clustering and Gaussian Mixtures

Clustering and Gaussian Mixtures Clustering and Gaussian Mixtures Oliver Schulte - CMPT 883 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 detected tures detected

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Clustering, K-Means, EM Tutorial

Clustering, K-Means, EM Tutorial Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro

More information

Review and Motivation

Review and Motivation Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Expectation maximization

Expectation maximization Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/

More information

CS181 Midterm 2 Practice Solutions

CS181 Midterm 2 Practice Solutions CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,

More information

Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

14 : Mean Field Assumption

14 : Mean Field Assumption 10-708: Probabilistic Graphical Models 10-708, Spring 2018 14 : Mean Field Assumption Lecturer: Kayhan Batmanghelich Scribes: Yao-Hung Hubert Tsai 1 Inferential Problems Can be categorized into three aspects:

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Mixture Models and Expectation-Maximization

Mixture Models and Expectation-Maximization Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Yoan Miche CIS, HUT March 17, 27 Yoan Miche (CIS, HUT) Mixture Models and EM March 17, 27 1 / 23 Mise en Bouche Yoan Miche (CIS, HUT) Mixture Models and EM March 17, 27 2 / 23 Mise

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Probabilistic clustering

Probabilistic clustering Aprendizagem Automática Probabilistic clustering Ludwig Krippahl Probabilistic clustering Summary Fuzzy sets and clustering Fuzzy c-means Probabilistic Clustering: mixture models Expectation-Maximization,

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z) CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Self-Organization by Optimizing Free-Energy

Self-Organization by Optimizing Free-Energy Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

EM Algorithm LECTURE OUTLINE

EM Algorithm LECTURE OUTLINE EM Algorithm Lukáš Cerman, Václav Hlaváč Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech Republic

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline Maximum likelihood (ML) Priors, and

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information