Expectation Maximization
|
|
- Maria Terry
- 5 years ago
- Views:
Transcription
1 Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori
2 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger than the to thanradius Probability the radius of theofbright bright circle, Distributions circle, tative p relative to a ref- which which has has In practice to a ref-annitydensity thereafter and thereafter entropy the entropy decreases decreases theasscale scale increases. increases. We discussed In practice this method probabilistic this method gives models gives stable stable identification at length identification of featuresaover variety a variety of sizes of sizes and copes and copes well with well with intra-class intra-class ckground model model as- as- of fea- re arts assumed are assumed to totures in(within a range a range r). r). d p r f p, U p ) dp r f er. e finder. p(d θ) (N, f) p(d θ) detected tures detected using using The n M. second The second is is variable and the and last the last ll ible possible occlusion occlusion l. shape s the shape and ocpearancthe appearance and and compasses many many of of and oc- obabilistic istic way, this way, this constrained objects objects K-Means small then only Gaussian themixture whitemodels circle is seen, and there is no ex- Expectation-Maximization α f small then only the white circle is seen, and there is no extremtrema in entropy. in entropy. There There is anis entropy an entropy extrema extrema when when the the In assignment variability. variability. The 3 and The4saliency you measure see measure that is designed given is designed to fully betoinvari- ant to observed invariant scaling, toparameters scaling, although although experimental training data, setting θ i to experimental tests show probability testsdistributions show that this that this is is not entirely not entirely the case thedue casetodue aliasing to aliasing and other and other effects. effects. Note, Note, straight-forward only monochrome only monochrome information information is used is used to detect to detect and represensent infeatures. many settings not all variables are observed and repre- However, (labelled) in the training data: x i = (x i, h i ).3. e.g..3. Feature Speech Feature recognition: representation have speech signals, but not phoneme The labels feature The feature detector detector identifies identifies regions regions of interest of interest on each on each image. e.g. image. Object The coordinates The recognition: coordinates of theof have centre the centre object give us give labels Xus and X(car, the andsize the bicycle), size but not of the partof region labels the region gives (wheel, gives S. Figure door, S. Figure illustrates seat) illustrates this on this two ontypical two typical images images from the frommotorbike the motorbike dataset. dataset. Unobserved variables are called latent variables figs from Fergus et al. Figure Figure : Output : Output of the c Ghane/Mori of feature the feature detector detector
3 Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 3
4 Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 4
5 Unsupervised Learning (a) We will start with an unsupervised learning (clustering) problem: Given a dataset {x,..., x N }, each x i R D, partition the dataset into K clusters Intuitively, a cluster is a group of points, which are close together and far from others c Ghane/Mori 5
6 Distortion Measure (a) (i) Formally, introduce prototypes (or cluster centers) µ k R D Use binary r nk, if point n is in cluster k, otherwise (-of-k coding scheme again) Find {µ k }, {r nk } to minimize distortion measure: J = N n= k= K r nk x n µ k e.g. two clusters k =, : J = x n µ + x n µ x n C x n C c Ghane/Mori 6
7 Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k n= k= However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence c Ghane/Mori 7
8 Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k n= k= However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence c Ghane/Mori 8
9 Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k n= k= However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence c Ghane/Mori 9
10 Determining Membership Variables (a) (b) Step in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk N K J = r nk x n µ k n= k= Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k k= Simply set r nk = for the cluster center µ k with smallest distance c Ghane/Mori
11 Determining Membership Variables (a) (b) Step in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk N K J = r nk x n µ k n= k= Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k k= Simply set r nk = for the cluster center µ k with smallest distance c Ghane/Mori
12 Determining Membership Variables (a) (b) Step in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk N K J = r nk x n µ k n= k= Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k k= Simply set r nk = for the cluster center µ k with smallest distance c Ghane/Mori
13 Determining Cluster Centers (b) (c) Step : fix r nk, minimize J wrt the cluster centers µ k K N J = r nk x n µ k switch order of sums k= n= So we can minimize wrt each µ k separately Take derivative, set to zero: N r nk (x n µ k ) = n= µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k c Ghane/Mori 3
14 Determining Cluster Centers (b) (c) Step : fix r nk, minimize J wrt the cluster centers µ k K N J = r nk x n µ k switch order of sums k= n= So we can minimize wrt each µ k separately Take derivative, set to zero: N r nk (x n µ k ) = n= µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k c Ghane/Mori 4
15 K-means Algorithm Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Assign points to nearest cluster center Minimize J wrt µ k Set cluster center as average of points in cluster Rinse and repeat until convergence c Ghane/Mori 5
16 K-means example (a) c Ghane/Mori 6
17 K-means example (b) c Ghane/Mori 7
18 K-means example (c) c Ghane/Mori 8
19 K-means example (d) c Ghane/Mori 9
20 K-means example (e) c Ghane/Mori
21 K-means example (f) c Ghane/Mori
22 K-means example (g) c Ghane/Mori
23 K-means example (h) c Ghane/Mori 3
24 K-means example (i) Next step doesn t change membership stop c Ghane/Mori 4
25 K-means Convergence Repeat steps until no change in cluster assignments For each step, value of J either goes down, or we stop Finite number of possible assignments of data points to clusters, so we are guarranteed to converge eventually Note it may be a local maximum rather than a global maximum to which we converge c Ghane/Mori 5
26 K-means Example - Image Segmentation Original image K-means clustering on pixel colour values Pixels in a cluster are coloured by cluster mean Represent each pixel (e.g. 4-bit colour value) by a cluster number (e.g. 4 bits for K = ), compressed version This technique known as vector quantization Represent vector (in this case from RGB, R 3 ) as a single discrete value c Ghane/Mori 6
27 Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 7
28 Hard Assignment vs. Soft Assignment (i) In the K-means algorithm, a hard assignment of points to clusters is made However, for points near the decision boundary, this may not be such a good idea Instead, we could think about making a soft assignment of points to clusters c Ghane/Mori 8
29 Gaussian Mixture Model (b) (a) The Gaussian mixture model (or mixture of Gaussians MoG) models the data as a combination of Gaussians Above shows a dataset generated by drawing samples from three different Gaussians c Ghane/Mori 9
30 Generative Model z (a).5 x.5 The mixture of Gaussians is a generative model To generate a datapoint x n, we first generate a value for a discrete variable z n {,..., K} We then generate a value x n N (x µ k, Σ k ) for the corresponding Gaussian c Ghane/Mori 3
31 Graphical Model π z n (a).5 x n µ Σ N.5 Full graphical model using plate notation Note z n is a latent variable, unobserved Need to give conditional distributions p(z n ) and p(x n z n ) The one-of-k representation is helpful here: z nk {, }, z n = (z n,..., z nk ) c Ghane/Mori 3
32 Graphical Model - Latent Component Variable π z n (a).5 x n µ Σ N.5 Use a Bernoulli distribution for p(z n ) i.e. p(z nk = ) = π k Parameters to this distribution {π k } Must have π k and K k= π k = p(z n ) = K k= πz nk k c Ghane/Mori 3
33 Graphical Model - Observed Variable π z n (a).5 x n µ Σ N.5 Use a Gaussian distribution for p(x n z n ) Parameters to this distribution {µ k, Σ k } p(x n z nk = ) = N (x n µ k, Σ k ) K p(x n z n ) = N (x n µ k, Σ k ) z nk k= c Ghane/Mori 33
34 Graphical Model - Joint distribution π z n (a).5 x n µ Σ N The full joint distribution is given by:.5 p(x, z) = = N p(z n )p(x n z n ) n= N K n= k= π z nk k N (x n µ k, Σ k ) z nk c Ghane/Mori 34
35 MoG Marginal over Observed Variables The marginal distribution p(x n ) for this model is: p(x n ) = z n p(x n, z n ) = z n p(z n )p(x n z n ) = K π k N (x n µ k, Σ k ) k= A mixture of Gaussians c Ghane/Mori 35
36 MoG Conditional over Latent Variable (b) (c) The conditional p(z nk = x n ) will play an important role for learning It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = x n ) = = p(z nk = )p(x n z nk = ) K j= p(z nj = )p(x n z nj = ) π k N (x n µ k, Σ k ) K j= π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n c Ghane/Mori 36
37 MoG Conditional over Latent Variable (b) (c) The conditional p(z nk = x n ) will play an important role for learning It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = x n ) = = p(z nk = )p(x n z nk = ) K j= p(z nj = )p(x n z nj = ) π k N (x n µ k, Σ k ) K j= π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n c Ghane/Mori 37
38 MoG Conditional over Latent Variable (b) (c) The conditional p(z nk = x n ) will play an important role for learning It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = x n ) = = p(z nk = )p(x n z nk = ) K j= p(z nj = )p(x n z nj = ) π k N (x n µ k, Σ k ) K j= π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n c Ghane/Mori 38
39 MoG Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } Answer will be similar to k-means: If we know the latent variables z n, fitting the Gaussians is easy If we know the Gaussians µ k, Σ k, finding the latent variables is easy Rather than latent variables, we will use responsibilities γ(z nk ) c Ghane/Mori 39
40 MoG Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } Answer will be similar to k-means: If we know the latent variables z n, fitting the Gaussians is easy If we know the Gaussians µ k, Σ k, finding the latent variables is easy Rather than latent variables, we will use responsibilities γ(z nk ) c Ghane/Mori 4
41 MoG Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } Answer will be similar to k-means: If we know the latent variables z n, fitting the Gaussians is easy If we know the Gaussians µ k, Σ k, finding the latent variables is easy Rather than latent variables, we will use responsibilities γ(z nk ) c Ghane/Mori 4
42 MoG Maximum Likelihood Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } We can use the maximum likelihood criterion: θ ML = arg max θ = arg max θ N n= k= K π k N (x n µ k, Σ k ) { N K } log π k N (x n µ k, Σ k ) n= k= Unfortunately, closed-form solution not possible this time log of sum rather than log of product c Ghane/Mori 4
43 MoG Maximum Likelihood Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } We can use the maximum likelihood criterion: θ ML = arg max θ = arg max θ N n= k= K π k N (x n µ k, Σ k ) { N K } log π k N (x n µ k, Σ k ) n= k= Unfortunately, closed-form solution not possible this time log of sum rather than log of product c Ghane/Mori 43
44 MoG Maximum Likelihood Learning - Problem Maximum likelihood criterion, -D: { N K θ ML = arg max log π k exp { (x n µ k ) /(σ ) } θ πσ n= k= c Ghane/Mori 44
45 MoG Maximum Likelihood Learning - Problem Maximum likelihood criterion, -D: { N K θ ML = arg max log π k exp { (x n µ k ) /(σ ) } θ πσ n= k= Suppose we set µ k = x n for some k and n, then we have one term in the sum: π k πσk exp { (x n µ k ) /(σ ) } = π k πσk exp { () /(σ ) } c Ghane/Mori 45
46 MoG Maximum Likelihood Learning - Problem Maximum likelihood criterion, -D: { N K θ ML = arg max log π k exp { (x n µ k ) /(σ ) } θ πσ n= k= Suppose we set µ k = x n for some k and n, then we have one term in the sum: π k πσk exp { (x n µ k ) /(σ ) } = π k πσk exp { () /(σ ) } In the limit as σ k, this goes to So ML solution is to set some µ k = x n, and σ k =! c Ghane/Mori 46
47 ML for Gaussian Mixtures Keeping this problem in mind, we will develop an algorithm for ML estimation of the parameters for a MoG model Search for a local optimum Consider the log-likelihood function l(θ) = { N K } log π k N (x n µ k, Σ k ) n= k= We can try taking derivatives and setting to zero, even though no closed form solution exists c Ghane/Mori 47
48 Maximizing Log-Likelihood - Means l(θ) = µ k l(θ) = = { N K } log π k N (x n µ k, Σ k ) n= N n= N n= k= π k N (x n µ k, Σ k ) j π jn (x n µ j, Σ j ) Σ k (x n µ k ) γ(z nk )Σ k (x n µ k ) Setting derivative to, and multiply by Σ k N N γ(z nk )µ k = γ(z nk )x n n= n= µ k = N γ(z nk )x n where N k = N k N γ(z nk ) n= n= c Ghane/Mori 48
49 Maximizing Log-Likelihood - Means l(θ) = µ k l(θ) = = { N K } log π k N (x n µ k, Σ k ) n= N n= N n= k= π k N (x n µ k, Σ k ) j π jn (x n µ j, Σ j ) Σ k (x n µ k ) γ(z nk )Σ k (x n µ k ) Setting derivative to, and multiply by Σ k N N γ(z nk )µ k = γ(z nk )x n n= n= µ k = N γ(z nk )x n where N k = N k N γ(z nk ) n= n= c Ghane/Mori 49
50 Maximizing Log-Likelihood - Means and Covariances Note that the mean µ k is a weighted combination of points x n, using the responsibilities γ(z nk ) for the cluster k µ k = N γ(z nk )x n N k n= N k = N n= γ(z nk) is the effective number of points in the cluster A similar result comes from taking derivatives wrt the covariance matrices Σ k : Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k n= c Ghane/Mori 5
51 Maximizing Log-Likelihood - Means and Covariances Note that the mean µ k is a weighted combination of points x n, using the responsibilities γ(z nk ) for the cluster k µ k = N γ(z nk )x n N k n= N k = N n= γ(z nk) is the effective number of points in the cluster A similar result comes from taking derivatives wrt the covariance matrices Σ k : Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k n= c Ghane/Mori 5
52 Maximizing Log-Likelihood - Mixing Coefficients We can also maximize wrt the mixing coefficients π k Note there is a constraint that k π k = Use Lagrange multipliers End up with: π k = N k N average responsibility that component k takes c Ghane/Mori 5
53 Three Parameter Types and Three Equations These three equations a solution does not make µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= All depend on γ(z nk ), which depends on all 3! But an iterative scheme can be used c Ghane/Mori 53
54 EM for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters π k N (x n µ γ(z nk ) = k, Σ k ) K j= π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= This algorithm is known as the expectation-maximization algorithm (EM) Next we describe its general form, why it works, and why it s called EM (but first an example) c Ghane/Mori 54
55 EM for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters π k N (x n µ γ(z nk ) = k, Σ k ) K j= π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= This algorithm is known as the expectation-maximization algorithm (EM) Next we describe its general form, why it works, and why it s called EM (but first an example) c Ghane/Mori 55
56 EM for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters π k N (x n µ γ(z nk ) = k, Σ k ) K j= π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= This algorithm is known as the expectation-maximization algorithm (EM) Next we describe its general form, why it works, and why it s called EM (but first an example) c Ghane/Mori 56
57 MoG EM - Example (a) Same initialization as with K-means before Often, K-means is actually used to initialize EM c Ghane/Mori 57
58 MoG EM - Example (b) Calculate responsibilities γ(z nk ) c Ghane/Mori 58
59 MoG EM - Example (c) Calculate model parameters {π k, µ k, Σ k } using these responsibilities c Ghane/Mori 59
60 MoG EM - Example (d) Iteration c Ghane/Mori 6
61 MoG EM - Example (e) Iteration 5 c Ghane/Mori 6
62 MoG EM - Example Iteration - converged (f) c Ghane/Mori 6
63 Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 63
64 General Version of EM In general, we are interested in maximizing the likelihood p(x θ) = Z p(x, Z θ) where X denotes all observed variables, and Z denotes all latent (hidden, unobserved) variables Assume that maximizing p(x θ) is difficult (e.g. mixture of Gaussians) But maximizing p(x, Z θ) is tractable (everything observed) p(x, Z θ) is referred to as the complete-data likelihood function, which we don t have c Ghane/Mori 64
65 General Version of EM In general, we are interested in maximizing the likelihood p(x θ) = Z p(x, Z θ) where X denotes all observed variables, and Z denotes all latent (hidden, unobserved) variables Assume that maximizing p(x θ) is difficult (e.g. mixture of Gaussians) But maximizing p(x, Z θ) is tractable (everything observed) p(x, Z θ) is referred to as the complete-data likelihood function, which we don t have c Ghane/Mori 65
66 A Lower Bound The strategy for optimization will be to introduce a lower bound on the likelihood This lower bound will be based on the complete-data likelihood, which is easy to optimize Iteratively increase this lower bound Make sure we re increasing the likelihood while doing so c Ghane/Mori 66
67 A Decomposition Trick To obtain the lower bound, we use a decomposition: ln p(x, Z θ) = ln p(x θ) + ln p(z X, θ) product rule ln p(x θ) = L(q, θ) + KL(q p) L(q, θ) { } p(x, Z θ) q(z) ln q(z) Z KL(q p) { } p(z X, θ) q(z) ln q(z) Z KL(q p) is known as the Kullback-Leibler divergence (KL-divergence), and is (see p.55 PRML, next slide) Hence ln p(x θ) L(q, θ) c Ghane/Mori 67
68 Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 68
69 Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 69
70 Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 7
71 Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 7
72 Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 7
73 Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 73
74 Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 74
75 Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 75
76 Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 76
77 Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 77
78 Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 78
79 Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 79
80 Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 8
81 Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 8
82 Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 8
83 Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 83
84 Bounding Example !.!5!4!3!! Consider component -D MoG with known variances (example from F. Dellaert) Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively. c Ghane/Mori 84
85 ..!.!5!4!3!! Bounding Example Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively !.!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively.!!!!3!3!!! Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. True likelihood function Recall we re fitting means θ, θ !!!! c Ghane/Mori!3!!3 85!
86 !.!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively. Bounding Example i=, Q=! !!!!3!3!!! 3 3!!!!3!3!!! 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. i=, Q=! i=3, Q=!.56 Lower bound the likelihood function using averaging distribution q(z).5.5 ln p(x θ) = L(q, θ) + KL(q(Z) p(z X, θ)).4.4 Since q(z) = p(z X, θ old ), bound is tight (equal to actual.3.3 likelihood) at θ = θ old... c Ghane/Mori 86.
87 !.!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively. Bounding Example i=, Q=!.78856!!!!3!3! !!!!3!3!!! 3 3!!!!3!3!!! 3 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. Lower bound the likelihood function using averaging distribution i=4, Q=!.8749 q(z) ln p(x θ) = L(q, θ) + KL(q(Z) p(z X, θ)).5 Since q(z) = p(z X, θ old ), bound is tight (equal to actual.4 likelihood) at θ = θ old c Ghane/Mori..87
88 K-Means Gaussian Mixture Models Expectation-Maximization..!.!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are andi=,, Q=! respectively.!!!!!3!!3! Bounding Example i=3, Q=! !! 3 3!!!!!!!!!3!!!3!3!3!3!3!!! 3!!! 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. Lower boundi=4, the Q=!.8749 likelihood function using averaging i=5, Q=!.7666 distribution q(z) ln p(x θ) = L(q, θ) + KL(q(Z) p(z X,.5 θ)) Since q(z) = p(z X, θ old ), bound.4 is tight (equal to actual likelihood) at θ = θ old c Ghane/Mori 88
89 K-Means 3 Gaussian Mixture Models 3 Expectation-Maximization..!!!!!!!3!3!3!!3!.!!5!4!3!! Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, i=4, respectively. Q=!.8749!!!! Bounding Example i=5, Q=! !!! 3 3!!!!!!!3!3!!3!!3!!!!3!3! 3!!! 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. Lower bound the likelihood function using averaging distribution q(z) Figure 3: Lower Bounds ln p(x θ) = L(q, θ) + KL(q(Z) p(z X, θ)) Since q(z) = p(z X, θ old ), bound is tight (equal to actual likelihood) at θ = θ old 3 c Ghane/Mori 89
90 EM - Summary EM finds local maximum to likelihood p(x θ) = Z p(x, Z θ) Iterates two steps: E step fills in the missing variables Z (calculates their distribution) M step maximizes expected complete log likelihood (expectation wrt E step distribution) This works because these two steps are performing a coordinate-wise hill-climbing on a lower bound on the likelihood p(x θ) c Ghane/Mori 9
91 Conclusion Readings: Ch. 9., 9., 9.4 K-means clustering Gaussian mixture model What about K? Model selection: either cross-validation or Bayesian version (average over all values for K) Expectation-maximization, a general method for learning parameters of models when not all variables are observed c Ghane/Mori 9
Latent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationClustering and Gaussian Mixtures
Clustering and Gaussian Mixtures Oliver Schulte - CMPT 883 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 detected tures detected
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationClustering, K-Means, EM Tutorial
Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means
More informationExpectation Maximization
Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13
More informationGaussian Mixture Models
Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro
More informationReview and Motivation
Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationTechnical Details about the Expectation Maximization (EM) Algorithm
Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationExpectation maximization
Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationThe Expectation-Maximization Algorithm
The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationBut if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationLecture 8: Graphical models for Text
Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/
More informationCS181 Midterm 2 Practice Solutions
CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,
More informationLecture 14. Clustering, K-means, and EM
Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationThe Expectation Maximization Algorithm
The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationVariational Autoencoders (VAEs)
September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More information14 : Mean Field Assumption
10-708: Probabilistic Graphical Models 10-708, Spring 2018 14 : Mean Field Assumption Lecturer: Kayhan Batmanghelich Scribes: Yao-Hung Hubert Tsai 1 Inferential Problems Can be categorized into three aspects:
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationMixture Models and Expectation-Maximization
Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?
More informationMixture Models and EM
Mixture Models and EM Yoan Miche CIS, HUT March 17, 27 Yoan Miche (CIS, HUT) Mixture Models and EM March 17, 27 1 / 23 Mise en Bouche Yoan Miche (CIS, HUT) Mixture Models and EM March 17, 27 2 / 23 Mise
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationProbabilistic clustering
Aprendizagem Automática Probabilistic clustering Ludwig Krippahl Probabilistic clustering Summary Fuzzy sets and clustering Fuzzy c-means Probabilistic Clustering: mixture models Expectation-Maximization,
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationAnother Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University
Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis
More informationSeries 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)
Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof
More informationVariables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)
CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training
More informationGaussian Mixture Models, Expectation Maximization
Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationBayesian Machine Learning - Lecture 7
Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationSelf-Organization by Optimizing Free-Energy
Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationEM Algorithm LECTURE OUTLINE
EM Algorithm Lukáš Cerman, Václav Hlaváč Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech Republic
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationMaximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS
Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline Maximum likelihood (ML) Priors, and
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More information