Gaussian Mixture Models
|
|
- Ronald Bishop
- 5 years ago
- Views:
Transcription
1 Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
2 Intro Question Intro Question David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
3 Intro Question Intro Question Suppose we begin with a dataset D = {x 1,...,x n } R 2 and we run k-means (or k-means++) to obtain k cluster centers. Below we have drawn the cluster centers. If we are given a new x R 2, we can assign it a label based on which cluster center is closest. What regions of the plane below correspond to each possible labeling? David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
4 Intro Question Intro Solution Note that each cell is disjoint (except for the boarders), and convex. This can be thought of as a limitation of k-means: neither will be true for GMMs David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
5 Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
6 Gaussian Mixture Models Yesterday's Intro Question Consider the following probability model for generating data. 1 Roll a weighted k-sided die to choose a label z {1,...,k}. Let π denote the PMF for the die. 2 Draw x R d randomly from the multivariate normal distribution N(µ z,σ z ). Solve the following questions. 1 What is the joint distribution of x,z given π and the µ z,σ z values? 2 Suppose you were given the dataset D = {(x 1,z 1 ),...,(x n,z n )}. How would you estimate the die weightings, and the µ z,σ z values? 3 How would you determine the label for a new datapoint x? David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
7 Gaussian Mixture Models Yesterday's Intro Solution 1 The joint PDF/PMF is given by p(x,z) = π(z)f (x;µ z,σ z ) where f (x;µ z,σ z ) = 1 ( 2πΣz exp 1 ) 2 (x µ)t Σ 1 (x µ). 2 We could use maximum likelihood estimation. Our estimates are 3 arg max z p(x,z) n z = n 1(z i=1 i = z) ˆπ(z) = n z n ˆµ z = 1 n z i:z i =z x i ˆΣ z = 1 n z i:z i =z (x i ˆµ z )(x i ˆµ z ) T. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
8 Gaussian Mixture Models Probabilistic Model for Clustering Let's consider a generative model for the data. Suppose 1 There are k clusters. 2 We have a probability density for each cluster. Generate a point as follows 1 Choose a random cluster z {1,2,...,k}. 2 Choose a point from the distribution for cluster Z. The clustering algorithm is then: 1 Use training data to t the parameters of the generative model. 2 For each point, choose the cluster with the highest likelihood based on model. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
9 Gaussian Mixture Models Gaussian Mixture Model (k = 3) 1 Choose z {1,2,3} 2 Choose x z N(X µ z,σ z ). David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
10 Gaussian Mixture Models Gaussian Mixture Model Parameters (k Components) Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) What if one cluster had many more points than another cluster? David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
11 Gaussian Mixture Models Gaussian Mixture Model: Joint Distribution Factorize the joint distribution: p(x, z) = p(z)p(x z) = π z N (x µ z,σ z ) π z is probability of choosing cluster z. x z has distribution N(µ z,σ z ). z corresponding to x is the true cluster assignment. Suppose we know all the parameters of the model. Then we can easily compute the joint p(x,z), and the conditional p(z x). David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
12 Gaussian Mixture Models Latent Variable Model We observe x. In the intro problem we had labeled data, but here we don't observe z, the cluster assignment. Cluster assignment z is called a hidden variable or latent variable. Denition A latent variable model is a probability model for which certain variables are never observed. e.g. The Gaussian mixture model is a latent variable model. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
13 Gaussian Mixture Models The GMM Inference Problem We observe x. We want to know z. The conditional distribution of the cluster z given x is p(z x) = p(x,z)/p(x) The conditional distribution is a soft assignment to clusters. A hard assignment is z = arg max p(z x). z {1,...,k} So if we have the model, clustering is trivial. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
14 Mixture Models Mixture Models David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
15 Mixture Models Gaussian Mixture Model: Marginal Distribution The marginal distribution for a single observation x is p(x) = = k p(x, z) z=1 k π z N (x µ z,σ z ) z=1 Note that p(x) is a convex combination of probability densities. This is a common form for a probability model... David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
16 Mixture Models Mixture Distributions (or Mixture Models) Denition A probability density p(x) represents a mixture distribution or mixture model, if we can write it as a convex combination of probability densities. That is, k p(x) = w i p i (x), where w i 0, k i=1 w i = 1, and each p i is a probability density. i=1 In our Gaussian mixture model, x has a mixture distribution. More constructively, let S be a set of probability distributions: 1 Choose a distribution randomly from S. 2 Sample x from the chosen distribution. Then x has a mixture distribution. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
17 Learning in Gaussian Mixture Models Learning in Gaussian Mixture Models David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
18 Learning in Gaussian Mixture Models The GMM Learning Problem Given data x 1,...,x n drawn from a GMM, Estimate the parameters: Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) Once we have the parameters, we're done. Just do inference to get cluster assignments. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
19 Learning in Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model One approach to learning is maximum likelihood nd parameter values that give observed data the highest likelihood. The model likelihood for D = {x 1,...,x n } is L(π,µ,Σ) = = n p(x i ) i=1 n i=1 z=1 k π z N (x i µ z,σ z ). As usual, we'll take our objective function to be the log of this: { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z ) i=1 z=1 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
20 Learning in Gaussian Mixture Models Properties of the GMM Log-Likelihood GMM log-likelihood: J(π,µ,Σ) = { n k } log π z N (x i µ z,σ z ) i=1 z=1 Let's compare to the log-likelihood for a single Gaussian: n logn(x i µ,σ) i=1 = nd 2 log(2π) n 2 log Σ 1 2 n (x i µ) Σ 1 (x i µ) For a single Gaussian, the log cancels the exp in the Gaussian density. = Things simplify a lot. For the GMM, the sum inside the log prevents this cancellation. = Expression more complicated. No closed form expression for MLE. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42 i=1
21 Issues with MLE for GMM Issues with MLE for GMM David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
22 Issues with MLE for GMM Identiability Issues for GMM Suppose we have found parameters Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) that are at a local minimum. What happens if we shue the clusters? e.g. Switch the labels for clusters 1 and 2. We'll get the same likelihood. How many such equivalent settings are there? Assuming all clusters are distinct, there are k! equivalent solutions. Not a problem per se, but something to be aware of. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
23 Issues with MLE for GMM Singularities for GMM Consider the following GMM for 7 data points: Let σ 2 be the variance of the skinny component. What happens to the likelihood as σ 2 0? In practice, we end up in local minima that do not have this problem. Or keep restarting optimization until we do. Bayesian approach or regularization will also solve the problem. From Bishop's Pattern recognition and machine learning, Figure 9.7. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
24 Issues with MLE for GMM Gradient Descent / SGD for GMM What about running gradient descent or SGD on { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z )? i=1 z=1 Can be done but need to be clever about it. Each matrix Σ 1,...,Σ k has to be positive semidenite. How to maintain that constraint? Rewrite Σ i = M i Mi T, where M i is an unconstrained matrix. Then Σ i is positive semidenite. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
25 The EM Algorithm for GMM The EM Algorithm for GMM David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
26 The EM Algorithm for GMM MLE for GMM From yesterday's intro questions, we know that we can solve the MLE problem if the cluster assignments z i are known n z = n 1(z i = z) i=1 ˆπ(z) = n z n ˆµ z = 1 n z i:z i =z x i ˆΣ z = 1 (x i ˆµ z )(x i ˆµ z ) T. n z i:z i =z In the EM algorithm we will modify the equations to handle our evolving soft assignments, which we will call responsibilities. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
27 The EM Algorithm for GMM Cluster Responsibilities: Some New Notation Denote the probability that observed value x i comes from cluster j by γ j i = P(Z = j X = x i). The responsibility that cluster j takes for observation x i. Computationally, γ j i = P(Z = j X = x i ). = p (Z = j,x = x i )/p(x) π j N (x i µ j,σ j ) = k π c=1 cn (x i µ c,σ c ) The vector ( ) γ 1 i,...,γk i is exactly the soft assignment for xi. Let n c = n i=1 γc i be the number of points soft assigned to cluster c. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
28 The EM Algorithm for GMM EM Algorithm for GMM: Overview If we know π and µ j,σ j for all j then we can easily nd γ j i = P(Z = j X = x i). If we know the (soft) assignments, we can easily nd estimates for π, µ j,σ j for all j. Repeatedly alternate the previous 2 steps. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
29 The EM Algorithm for GMM EM Algorithm for GMM: Overview 1 Initialize parameters µ, Σ, π. 2 E step. Evaluate the responsibilities using current parameters: γ j i = π j N (x i µ j,σ j ) k c=1 π cn (x i µ c,σ c ), for i = 1,...,n and j = 1,...,k. 3 M step. Re-estimate the parameters using responsibilities. [Compare with intro question.] µ new c = 1 n c Σ new c = 1 n c π new c = n c n, n γ c i x i i=1 n i=1 γ c i (x i µ new c )(x i µ new 4 Repeat from Step 2, until log-likelihood converges. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42 c ) T
30 The EM Algorithm for GMM EM for GMM Initialization From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
31 The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
32 The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
33 The EM Algorithm for GMM EM for GMM After 5 rounds of EM: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
34 The EM Algorithm for GMM EM for GMM After 20 rounds of EM: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
35 The EM Algorithm for GMM Relation to K -Means EM for GMM seems a little like k-means. In fact, there is a precise correspondence. First, x each cluster covariance matrix to be σ 2 I. Then the density for each Gausian only depends on distance to the mean. As we take σ 2 0, the update equations converge to doing k-means. If you do a quick experiment yourself, you'll nd Soft assignments converge to hard assignments. Has to do with the tail behavior (exponential decay) of Gaussian. Can use k-means++ to initialize parameters of EM algorithm. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
36 Math Prerequisites for General EM Algorithm Math Prerequisites for General EM Algorithm David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
37 Math Prerequisites for General EM Algorithm Jensen's Inequality Which is larger: E[X 2 ] or E[X ] 2? avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
38 Math Prerequisites for General EM Algorithm Jensen's Inequality Theorem Which is larger: E[X 2 ] or E[X ] 2? Must be E[X 2 ] since Var[X ] = E[X 2 ] E[X ] 2 0. More general result is true: Jensen's Inequality If f : R R is convex and X is a random variable then E[f (X )] f (E[X ]). If f is strictly convex then we have equality i X = E[X ] with probability 1 (i.e., X is constant). avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
39 Math Prerequisites for General EM Algorithm Proof of Jensen Exercise Suppose X can take exactly two value: x 1 with probability π 1 and x 2 with probability π 2. Then prove Jensen's inequality. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
40 Math Prerequisites for General EM Algorithm Proof of Jensen Exercise Suppose X can take exactly two value: x 1 with probability π 1 and x 2 with probability π 2. Then prove Jensen's inequality. Let's compute E[f (X )]: E[f (X )] = π 1 f (x 1 ) + π 2 f (x 2 ) f (π 1 x 1 + π 2 x 2 ) = f (E[X ]). For the general proof, what do we know is true about all convex functions f : R R? avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
41 Math Prerequisites for General EM Algorithm Proof of Jensen 1 Let e = E[X ]. (Remember e is just a number.) 2 Since f has a subgradient at e, there is an underestimating line g(x) = ax + b that passes through the point (e,f (e)). 3 Then we have E[f (X )] E[g(X )] = E[aX + b] = ae[x ] + b = ae + b = f (e) = f (E[X ]). 4 If f is strictly convex then f = g at exactly 1 point, so equality i X is constant. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
42 Math Prerequisites for General EM Algorithm KL-Divergence Let p(x) and q(x) be probability mass functions (PMFs) on X. We want to measure how dierent they are. The Kullback-Leibler or KL Divergence is dene by KL(p q) = x X p(x)log p(x) q(x). (Assumes absolute continuity: q(x) = 0 implies p(x) = 0.) Can also write KL(p q) = E x p log p(x) q(x). Note, the KL-divergence is not symmetric and doesn't satisfy the triangle inequality. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
43 Math Prerequisites for General EM Algorithm Gibbs' Inequality Theorem Gibbs' Inequality Let p(x) and q(x) be PMFs on X. Then KL(p q) 0, with equality i p(x) = q(x) for all x X. Since KL(p q) = E p [ log this is screaming for Jensen's inequality. ( )] q(x), p(x) avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
44 Math Prerequisites for General EM Algorithm Gibbs' Inequality: Proof ( )] q(x) KL(p q) = E p [ log p(x) ( [ ]) q(x) log E p p(x) = log p(x) q(x) p(x) x:p(x)>0 ( ) = log q(x) x = log 1 = 0. Since log is strictly convex, we have equality i q/p is constant, i.e., q = p. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, / 42
K-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationGaussian Mixture Models, Expectation Maximization
Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationInformation Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18
Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationThe Multivariate Gaussian Distribution [DRAFT]
The Multivariate Gaussian Distribution DRAFT David S. Rosenberg Abstract This is a collection of a few key and standard results about multivariate Gaussian distributions. I have not included many proofs,
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationIntroduction to Machine Learning
Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationMachine Learning for Data Science (CS4786) Lecture 12
Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We
More informationA minimalist s exposition of EM
A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability
More informationExpectation maximization
Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationSTATS 306B: Unsupervised Learning Spring Lecture 3 April 7th
STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationMixture Models and Expectation-Maximization
Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?
More informationClustering, K-Means, EM Tutorial
Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More informationLecture 1 October 9, 2013
Probabilistic Graphical Models Fall 2013 Lecture 1 October 9, 2013 Lecturer: Guillaume Obozinski Scribe: Huu Dien Khue Le, Robin Bénesse The web page of the course: http://www.di.ens.fr/~fbach/courses/fall2013/
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Mixture Models, Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 2: 1 late day to hand it in now. Assignment 3: Posted,
More informationLecture 11: Unsupervised Machine Learning
CSE517A Machine Learning Spring 2018 Lecture 11: Unsupervised Machine Learning Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml Ch6 (Intro), 6.2 (k-means), 6.3 (Mixture Models); [optional]:
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationExpectation Maximization and Mixtures of Gaussians
Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationStochastic Variational Inference
Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationCS229 Lecture notes. Andrew Ng
CS229 Lecture notes Andrew Ng Part X Factor analysis When we have data x (i) R n that comes from a mixture of several Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting,
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationReview and Motivation
Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationBut if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Contents in latter part Linear Dynamical Systems What is different from HMM? Kalman filter Its strength and limitation Particle Filter
More informationLecture 14. Clustering, K-means, and EM
Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.
More informationQuick Tour of Basic Probability Theory and Linear Algebra
Quick Tour of and Linear Algebra Quick Tour of and Linear Algebra CS224w: Social and Information Network Analysis Fall 2011 Quick Tour of and Linear Algebra Quick Tour of and Linear Algebra Outline Definitions
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationChapter 16. Structured Probabilistic Models for Deep Learning
Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe
More informationCOMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017
COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING
More informationTechnical Details about the Expectation Maximization (EM) Algorithm
Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationMachine Learning for Signal Processing Bayes Classification
Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationVariational Autoencoders (VAEs)
September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationK-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1
EM Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 19 th, 2007 2005-2007 Carlos Guestrin 1 K-means 1. Ask user how many clusters they d like. e.g. k=5 2. Randomly guess
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationMixtures of Gaussians continued
Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More information