Latent Variable Models and EM Algorithm
|
|
- Duane Caldwell
- 5 years ago
- Views:
Transcription
1 SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
2 Probabilistic Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
3 Probabilistic Methods Mixture Models Algorithmic approach: Data Algorithm Analysis/ Interpretation Probabilistic modelling approach: Unobserved process Generative Model Analysis Interpretation Data Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
4 Mixture Models Probabilistic Unsupervised Learning Mixture Models Mixture models suppose that our dataset X was created by sampling iid from K distinct populations (called mixture components). Samples in population k can be modelled using a distribution F µk with density f (x µ k), where µ k is the model parameter for the k-th component. For a concrete example, consider a Gaussian with unknown mean µ k and known diagonal covariance σ 2 I, ( f (x µ k) = 2πσ 2 p 2 exp 1 ) 2σ x 2 µk 2 2. Generative model: for i = 1, 2,..., n: First determine the assignment variable independently for each data item i: Z i Discrete(π 1,..., π K) i.e., P(Z i = k) = π k where mixing proportions are π k 0 for each k and K k=1 πk = 1. Given the assignment Z i = k, then X i = (X (1) i,..., X (p) i ) is sampled (independently) from the corresponding k-th component: X i Z i = k f (x µ k) We observe X i = x i for each i but not Z i s (latent variables), and would like to infer the parameters {µ k} K k=1 and {π k} K k=1 (σ 2 can also be estimated). Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
5 Mixture Models Mixture Models Unknowns to learn given data are Parameters: θ = (π k, µ k) K k=1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, and Latent variables: z 1,..., z n. The joint probability over all cluster indicator variables {Z i } are: p Z ((z i ) n i=1) = n π zi = i=1 n K i=1 k=1 π 1(zi=k) k The joint density at observations X i = x i given Z i = z i are: n n K p X ((x i ) n i=1 (Z i = z i ) n i=1) = f (x i µ zi ) = f (x i µ k ) 1(zi=k) i=1 i=1 k=1 Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
6 Mixture Models Mixture Models: Joint pmf/pdf of observed and latent variables Unknowns to learn given data are Parameters: θ = (π k, µ k) K k=1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, and Latent variables: z 1,..., z n. The joint probability mass function/density 1 is: n K p X,Z ((x i, z i ) n i=1) = p Z ((z i ) n i=1)p X ((x i ) n i=1 (Z i = z i ) n i=1) = (π k f (x i µ k )) 1(zi=k) i=1 k=1 And the marginal density of x i (resulting model on the observed data) is: p(x i ) = K p(z i = j, x i ) = j=1 K π j f (x i µ j ). j=1 Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
7 Prod. Bernoulli Probabilistic Prod. Unsupervised Bernoulli Learning Sigmoid Mixturebelief Modelsnet 27.7 Table 11.1 Mixture Models: Gaussian Mixtures with Unequal Prod. Discrete in the likelihood means a factored distribution of the form j Gaussian means a factored distribution of the form Covariances j Summary of some popular directed latent variable models. Here Prod means product, so Cat(xij zi), and Prod. N (xij zi). PCA stands for principal components analysis. ICA stands for indepedendent components analysis (a) figure from Murphy, 2012, Ch. 11. Here θ = (π k, µ k, Σ k ) K k=1 are all the model parametes and ( f (x (µ k, Σ k )) = (2π) p 2 Σk 1 2 exp 1 2 (x µ k) Σ 1 k (x µ k ) Figure 11.3 A mixture of 3 Gaussians in 2d. (a) We show the contours of constant probability for each component in the mixture. (b) A surface plot of the overall density. Based on Figure 2.23 of (Bishop ) 2006a). Figure generated by mixgaussplotdemo..1 Mixtures of Gaussians K p(x) = π k f (x (µ k, Σ k )) The most widely used mixturek=1 model is the mixture of Gaussians (MOG), also called a Gaussian mixture model or GMM. In this model, each base distribution in the mixture is a multivariate Gaussian with mean μ and covariance matrix Σ k. Thus the model has the form Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36 (b),
8 Mixture Models Mixture Models: Responsibility Suppose we know the parameters θ = (π k, µ k ) K k=1. Z i is a random variable and its conditional distribution given data set X is: Q ik := p(z i = k x i ) = p(z i = k, x i ) p(x i ) = π kf (x i µ k ) K j=1 π jf (x i µ j ) The conditional probability Q ik is called the responsibility of mixture component k for data point x i. These conditionals softly partitions the dataset among the k components: K k=1 Q ik = 1. Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
9 Mixture Models Mixture Models: Maximum Likehood How can we learn about the parameters θ = (π k, µ k ) K k=1 from data? Standard statistical methodology asks for the maximum likelihood estimator (MLE). The goal is to maximise the marginal probability of the data over the parameters ˆθ ML = argmax p(x θ) = argmax θ (π k,µ k) K k=1 = argmax (π k,µ k) K k=1 = argmax n p(x i (π k, µ k ) K k=1) i=1 n i=1 k=1 (π k,µ k) K k=1 i=1 K π k f (x i µ k ) n K log π k f (x i µ k ). k=1 } {{ } :=l((π k,µ k) K k=1 ) Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
10 Mixture Models Mixture Models: Maximum Likehood Marginal log-likelihood: l((π k, µ k ) K k=1) := log p(x (π k, µ k ) K k=1) = n K log π k f (x i µ k ) i=1 k=1 The gradient w.r.t. µ k : µk l((π k, µ k ) K k=1) = = n i=1 π k f (x i µ k ) K j=1 π jf (x i µ j ) µ k log f (x i µ k ) n Q ik µk log f (x i µ k ). i=1 Difficult to solve, as Q ik depends implicitly on µ k. Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
11 Mixture Models Likelihood Surface for a Simple Example If latent variables z i s were all observed, we would have a unimodal likelihood surface but when we marginalise out the latents, the likelihood surface becomes multimodal: no unique MLE. 320 compbody.tex mu (a) mu 1 (left) n = 200 data points from a mixture of two 1D Gaussians with Figure 11.6: Left: N = 200 data points sampled from a mixture of 2 Gaussians in 1d, with π k = 0.5, σ k = 5, µ 1 = 10 and µ 2 = 10. Right: π 1 = Likelihood π 2 = surface 0.5, p(d µ σ = 1, 5µ 2), and withµ all 1 other = 10, parameters µ 2 = set 10. to their true values. We see the two symmetric modes, reflecting the unidentifiability of the parameters. Produced by mixgaussliksurfacedemo. (right) Observed data log likelihood surface l (µ 1, µ 2 ), all the other parameters being assumed known Unidentifiability Note that Department mixture of Statistics, models Oxford are not identifiable, which SC4/SM8 means ATSML, thereht2018 are many settings of the parameters which have the 11 / same 36 (b)
12 Mixture Models Mixture Models: Maximum Likehood Recall we would like to solve: n µk l((π k, µ k ) K k=1) = Q ik µk log f (x i µ k ) = 0 i=1 What if we ignore the dependence of Q ik on the parameters? Taking the mixture of Gaussian with covariance σ 2 I as example, n i=1 = 1 σ 2 n ( Q ik µk p 2 log(2πσ2 ) 1 ) 2σ 2 x i µ k 2 2 i=1 Q ik (x i µ k ) = 1 σ 2 ( n i=1 Q ik x i µ k ( n i=1 Q ik) ) = 0 n µ ML? i=1 k = Q ikx i n i=1 Q ik Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
13 Mixture Models Mixture Models: Maximum Likehood The estimate is a weighted average of data points, where the estimated mean of cluster k uses its responsibilities to data points as weights. n µ ML? i=1 k = Q ikx i n i=1 Q. ik Makes sense: Suppose we knew that data point x i came from population z i. Then Q izi = 1 and Q ik = 0 for k z i and: µ ML? i:z k = x i=k i i:z 1 = avg{x i : z i = k} i=k Our best guess of the originating population is given by Q ik. Soft K-Means algorithm? Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
14 Mixture Models Mixture Models: Maximum Likehood Gradient w.r.t. mixing proportion π k (including a Lagrange multiplier λ ( k π k 1 ) to enforce constraint k π k = 1). ( πk l((π k, µ k ) K k=1) λ( ) K k=1 π k 1) n f (x i µ k ) = K j=1 π jf (x i µ j ) λ = i=1 n i=1 Q ik π k λ = 0 π k n i=1 Q ik Note: K n Q ik = k=1 i=1 n K i=1 k=1 Q ik }{{} =1 n πk ML? i=1 = Q ik n Again makes sense: the estimate is simply (our best guess of) the proportion of data points coming from population k. Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
15 Mixture Models Mixture Models: The EM Algorithm Putting all the derivations together, we get an iterative algorithm for learning about the unknowns in the mixture model. Start with some initial parameters (π (0) k, µ (0) k ) K k=1. Iterate for t = 1, 2,...: Expectation Step: Maximization Step: Q (t) ik := π (t 1) k K j=1 π(t 1) j f (x i µ (t 1) k ) f (x i µ (t 1) j ) n π (t) i=1 k = Q(t) ik n n µ (t) i=1 k = Q(t) ik xi n i=1 Q(t) ik Will the algorithm converge? What does it converge to? Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
16 Mixture Models Example: Mixture of 3 Gaussians An example with 3 clusters X1 X2 Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
17 Mixture Models Example: Mixture of 3 Gaussians After 1st E and M step. Iteration 1 data[,2] data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
18 Mixture Models Example: Mixture of 3 Gaussians After 2nd E and M step. Iteration 2 data[,2] data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
19 Mixture Models Example: Mixture of 3 Gaussians After 3rd E and M step. Iteration 3 data[,2] data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
20 Mixture Models Example: Mixture of 3 Gaussians After 4th E and M step. Iteration 4 data[,2] data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
21 Mixture Models Example: Mixture of 3 Gaussians After 5th E and M step. Iteration 5 data[,2] data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
22 EM Algorithm EM Algorithm In a maximum likelihood framework, the objective function is the log likelihood, l(θ) = Direct maximisation is not feasible. n K log π kf (x i µ k) i=1 Consider another objective function F(θ, q), where q is any probability distribution on latent variables z, such that: k=1 F(θ, q) l(θ) for all θ, q, max F(θ, q) = l(θ) q F(θ, q) is a lower bound on the log likelihood. We can construct an alternating maximisation algorithm as follows: For t = 1, 2... until convergence: q (t) := argmax F(θ (t 1), q) q θ (t) := argmax F(θ, q (t) ) θ Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
23 EM Algorithm Probabilistic Unsupervised Learning EM Algorithm The lower bound we use is called the variational free energy. q is a probability mass function for a distribution over z := (z i ) n i=1. F(θ, q) =E q [log p(x, z θ) log q(z)] [( n ) ] K =E q 1(z i = k) (log π k + log f (x i µ k )) log q(z) i=1 k=1 = [( n ) ] K q(z) 1(z i = k) (log π k + log f (x i µ k )) log q(z) z i=1 k=1 Lemma F(θ, q) l(θ) for all q and for all θ. Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
24 EM Algorithm - Solving for q EM Algorithm Lemma F(θ, q) = l(θ) for q(z) = p(z x, θ). In combination with previous Lemma, this implies that q(z) = p(z x, θ) maximizes F(θ, q) for fixed θ, i.e., the optimal q is simply the conditional distribution given the data and that fixed θ. In mixture model, q (z) = p(z, x θ) p(x θ) = n i=1 π z i f (x i µ zi ) z n i=1 π z i f (x i µ z i ) = = n π zi f (x i µ zi ) k π kf (x i µ k ) n p(z i x i, θ). i=1 i=1 Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
25 Similar derivation for optimal π k as before. Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36 Probabilistic Unsupervised Learning EM Algorithm - Solving for θ EM Algorithm Setting derivative with respect to µ k to 0, µk F(θ, q) = n q(z) 1(z i = k) µk log f (x i µ k ) z i=1 n = q(z i = k) µk log f (x i µ k ) = 0 i=1 This equation can be solved quite easily. E.g., for mixture of Gaussians, n µ i=1 k = q(z i = k)x i n i=1 q(z i = k) If it cannot be solved exactly, we can use gradient ascent algorithm (generalized EM): n µ k = µ k + α q(z i = k) µk log f (x i µ k ). i=1
26 EM Algorithm Probabilistic Unsupervised Learning EM Algorithm Start with some initial parameters (π (0) k, µ (0) k ) K k=1. Iterate for t = 1, 2,...: Expectation Step: q (t) (z i = k) := p(z i = k x i, θ (t 1) ) = Maximization Step: π (t 1) k K j=1 π(t 1) j f (x i µ (t 1) k ) f (x i µ (t 1) j ) n π (t) i=1 k = q(t) (z i = k) n n µ (t) i=1 k = q(t) (z i = k)x i n i=1 q(t) (z i = k) Theorem EM algorithm does not decrease the log likelihood. Proof: l(θ (t 1) ) = F(θ (t 1), q (t) ) F(θ (t), q (t) ) F(θ (t), q (t+1) ) = l(θ (t) ). Additional assumption, that 2 θ F(θ(t), q (t) ) are negative definite with eigenvalues < ɛ < 0, implies that θ (t) θ where θ is a local MLE. Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
27 EM Algorithm Notes on Probabilistic Approach and EM Algorithm Some good things: Guaranteed convergence to locally optimal parameters. Formal reasoning of uncertainties, using both Bayes Theorem and maximum likelihood theory. Rich language of probability theory to express a wide range of generative models, and straightforward derivation of algorithms for ML estimation. Some bad things: Can get stuck in local minima so multiple starts are recommended. Slower and more expensive than K-means. Choice of K still problematic, but rich array of methods for model selection comes to rescue. Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
28 EM Algorithm Flexible Gaussian Mixture Models We can allow each cluster to have its own mean and covariance structure to enable greater flexibility in the model. Different covariances Different, but diagonal covariances Identical covariances Identical and spherical covariances Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
29 Probabilistic PCA Probabilistic Unsupervised Learning Probabilistic PCA A probabilistic model related to PCA (also known as sensible PCA) has the following generative model: for i = 1, 2,..., n: Let k < n, p be given. Let Y i be a (latent) k-dimensional normally distributed random variable with 0 mean and identity covariance: Y i N (0, I k) We model the distribution of the ith data point given Y i as a p-dimensional normal: X i N (µ + LY i, σ 2 I) where the parameters are a vector µ R p, a matrix L R p k and σ 2 > 0. Tipping and Bishop, 1999 Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
30 Probabilistic PCA Probabilistic PCA: EM vs MLE EM algorithm can be used for ML estimation (lecture notes), but PPCA can more directly give an MLE (which is not unique). Let λ 1 λ p be the eigenvalues of the sample covariance and V 1:k R p k the top k eigenvectors as before. Let Q R k k be any orthogonal matrix. Then an MLE is given by: µ MLE = x (σ 2 ) MLE = 1 p k p j=k+1 λ j L MLE = V 1:k diag((λ 1 (σ 2 ) MLE ) 1 2,..., (λk (σ 2 ) MLE ) 1 2 )Q However, EM can be faster, can be implemented online, can handle missing data and can be extended to more complicated models! Tipping and Bishop, 1999 Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
31 Probabilistic PCA Probabilistic PCA PPCA latents principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
32 Probabilistic PCA Probabilistic PCA PPCA latents PCA projection principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
33 Probabilistic PCA Probabilistic PCA PPCA latents PPCA noise PPCA latent prior principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
34 Probabilistic PCA Probabilistic PCA PPCA latents PPCA posterior PPCA latent prior PPCA noise PPCA projection principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
35 Probabilistic PCA Mixture of Probabilistic PCAs We have learnt two types of unsupervised learning techniques: Dimensionality reduction, e.g. PCA, MDS, Isomap. Clustering, e.g. K-means, linkage and mixture models. Probabilistic models allow us to construct more complex models from simpler pieces. Mixture of probabilistic PCAs allows both clustering and dimensionality reduction at the same time. Z i Discrete(π 1,..., π K ) Y i N (0, I d ) X i Z i = k, Y i = y i N (µ k + Ly i, σ 2 I p ) Allows flexible modelling of covariance structure without using too many parameters. Ghahramani and Hinton 1996 Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
36 Probabilistic PCA Further reading Hastie et al, 8.5 Bishop, Chapter 9 Roweis and Ghahramani: A unifying review of linear Gaussian models Department of Statistics, Oxford SC4/SM8 ATSML, HT / 36
Latent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationCOMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017
COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationMachine Learning for Data Science (CS4786) Lecture 12
Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationExpectation Maximization
Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationGaussian Mixture Models, Expectation Maximization
Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationBut if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationEM Algorithm LECTURE OUTLINE
EM Algorithm Lukáš Cerman, Václav Hlaváč Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech Republic
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization
More informationProbabilistic & Unsupervised Learning. Latent Variable Models
Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationProbabilistic Unsupervised Learning
Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods
More informationIntractable Likelihood Functions
Intractable Likelihood Functions Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap p(x y o ) = z p(x,y o,z) x,z
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationVariables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)
CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationExpectation-Maximization
Expectation-Maximization Léon Bottou NEC Labs America COS 424 3/9/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression,
More informationLecture 6: April 19, 2002
EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationLecture 8: Graphical models for Text
Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationClustering and Gaussian Mixtures
Clustering and Gaussian Mixtures Oliver Schulte - CMPT 883 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 detected tures detected
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationReview and Motivation
Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationMixtures of Gaussians continued
Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationExpectation maximization tutorial
Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationSTATS 306B: Unsupervised Learning Spring Lecture 3 April 7th
STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationExpectation maximization
Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationFinite Singular Multivariate Gaussian Mixture
21/06/2016 Plan 1 Basic definitions Singular Multivariate Normal Distribution 2 3 Plan Singular Multivariate Normal Distribution 1 Basic definitions Singular Multivariate Normal Distribution 2 3 Multivariate
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationCSC411 Fall 2018 Homework 5
Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationIntroduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond
PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009 Motivation
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationK-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543
K-Means, Expectation Maximization and Segmentation D.A. Forsyth, CS543 K-Means Choose a fixed number of clusters Choose cluster centers and point-cluster allocations to minimize error can t do this by
More informationProbabilistic clustering
Aprendizagem Automática Probabilistic clustering Ludwig Krippahl Probabilistic clustering Summary Fuzzy sets and clustering Fuzzy c-means Probabilistic Clustering: mixture models Expectation-Maximization,
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More information