Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Similar documents
Machine Learning Lecture 5

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Machine Learning Lecture 2

Mixtures of Gaussians. Sargur Srihari

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning

Machine Learning Techniques for Computer Vision

Machine Learning Lecture 2

Latent Variable View of EM. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

STA 4273H: Statistical Machine Learning

Machine Learning Lecture 14

Clustering and Gaussian Mixture Models

Expectation maximization

The Expectation-Maximization Algorithm

Recent Advances in Bayesian Inference Techniques

Machine Learning Lecture 2

STA 4273H: Statistical Machine Learning

Bayesian Methods for Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Machine Learning Lecture 7

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning Lecture 3

Expectation Maximization

Latent Variable Models and Expectation Maximization

STA414/2104 Statistical Methods for Machine Learning II

STA 414/2104: Machine Learning

Expectation Maximization

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Latent Variable Models

Latent Variable Models and EM Algorithm

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

A Note on the Expectation-Maximization (EM) Algorithm

K-Means and Gaussian Mixture Models

Latent Variable Models and Expectation Maximization

Statistical Pattern Recognition

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

STA 4273H: Statistical Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

The Expectation-Maximization Algorithm

STA 4273H: Statistical Machine Learning

Lecture 6: April 19, 2002

Latent Variable Models and EM algorithm

Data Mining Techniques

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Advanced Introduction to Machine Learning

Based on slides by Richard Zemel

Introduction to Graphical Models

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

p L yi z n m x N n xi

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Mixtures of Bernoulli Distributions

Introduction to Machine Learning

Machine Learning using Bayesian Approaches

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering and Gaussian Mixtures

Variational Principal Components

Bayesian Nonparametrics for Speech and Signal Processing

Machine Learning for Signal Processing Bayes Classification

13: Variational inference II

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Non-Parametric Bayes

Bayesian Nonparametrics

CSC411 Fall 2018 Homework 5

Variational Mixture of Gaussians. Sargur Srihari

Lecture 7: Con3nuous Latent Variable Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

ECE 5984: Introduction to Machine Learning

CPSC 540: Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Probabilistic Graphical Models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Probabilistic Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

An Introduction to Bayesian Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS181 Midterm 2 Practice Solutions

Chris Bishop s PRML Ch. 8: Graphical Models

Machine Learning Overview

Gaussian Mixture Models, Expectation Maximization

Technical Details about the Expectation Maximization (EM) Algorithm

Machine Learning Lecture 1

Approximate Inference using MCMC

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Lecture 13 : Variational Inference: Mean Field Approximation

Linear Dynamical Systems

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Gaussian Mixture Model

Transcription:

Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance Sampling Metropolis-Hastings EM Mixtures of Bernoulli distributions [today s topic] Exercise will be on Wednesday, 07.12. Please submit your results until 06.12. midnight. leibe@vision.rwth-aachen.de 2 This Lecture: Advanced Machine Learning Topics of This Lecture Regression Approaches Linear Regression Regularization (Ridge, Lasso) Gaussian Processes Learning with Latent Variables Probability Distributions Approximate Inference Mixture Models EM and Generalizations Deep Learning Neural Networks CNNs, RNNs, RBMs, etc. The EM algorithm in general Recap: General EM Example: Mixtures of Bernoulli distributions Monte Carlo EM Bayesian Mixture Models Towards a full Bayesian treatment Dirichlet priors Finite mixtures Infinite mixtures Approximate inference (only as supplementary material) 4 Recap: Mixture of Gaussians Recap: GMMs as Latent Variable Models Generative model j Write GMMs in terms of latent variables z Marginal distribution of x p(j) = ¼ j 1 3 2 p(xjµ 3) p(xjµ 1) p(xjµ 2) 3X p(xjµ) = ¼ jp(xjµ j) j=1 Advantage of this formulation We have represented the marginal distribution in terms of latent variables z. Since p(x) = z p(x, z), there is a corresponding latent variable z n for each data point x n. We are now able to work with the joint distribution p(x, z) instead of the marginal distribution p(x). This will lead to significant simplifications 5 Slide credit: Bernt Schiele Image source: C.M. Bishop, 2006 6 1

Recap: Sampling from a Gaussian Mixture Recap: Gaussian Mixtures Revisited MoG Sampling We can use ancestral sampling to generate random samples from a Gaussian mixture model. 1. Generate a value from the marginal distribution p(z). 2. Generate a value from the conditional distribution. Applying the latent variable view of EM Goal is to maximize the log-likelihood using the observed data X Samples from the joint p(x, z) Samples from the marginal p(x) Evaluating the responsibilities (z nk ) Corresponding graphical model: Suppose we are additionally given the values of the latent variables Z. The corresponding graphical model for the complete data now looks like this: Straightforward to marginalize 7 Image source: C.M. Bishop, 2006 8 Image source: C.M. Bishop, 2006 Recap: Alternative View of EM Recap: General EM Algorithm In practice, however, We are not given the complete data set {X,Z}, but only the incomplete data X. All we can compute about Z is the posterior distribution. Since we cannot use the complete-data log-likelihood, we consider instead its expected value under the posterior distribution of the latent variable: Algorithm 1. Choose an initial setting for the parameters 2. E-step: Evaluate 3. M-step: Evaluate given by where This corresponds to the E-step of the EM algorithm. In the subsequent M-step, we then maximize the expectation to obtain the revised parameter set µ new. 4. While not converged, let and return to step 2. 9 10 Recap: MAP-EM Gaussian Mixtures Revisited Modification for MAP The EM algorithm can be adapted to find MAP solutions for models for which a prior is defined over the parameters. Only changes needed: Maximize the likelihood For the complete-data set {X,Z}, the likelihood has the form 2. E-step: Evaluate Taking the logarithm, we obtain 3. M-step: Evaluate given by Suitable choices for the prior will remove the ML singularities! 11 Compared to the incomplete-data case, the order of the sum and logarithm has been interchanged. Much simpler solution to the ML problem. Maximization w.r.t. a mean or covariance is exactly as for a single Gaussian, except that it involves only the subset of data points that are assigned to that component. 12 2

Gaussian Mixtures Revisited Gaussian Mixtures Revisited Maximization w.r.t. mixing coefficients More complex, since the ¼ k are coupled by the summation constraint In practice, we don t have values for the latent variables Consider the expectation w.r.t. the posterior distribution of the latent variables instead. The posterior distribution takes the form Solve with a Lagrange multiplier Solution (after a longer derivation): and factorizes over n, so that the {z n } are independent under the posterior. Expected value of indicator variable z nk under the posterior. Pznk znk [¼kN (xnj¹ k ; k)]znk E[z nk] = P ¼jN (x nj¹ j ; j) znj The complete-data log-likelihood can be maximized trivially in closed form. 13 = znj ¼ kn (x nj¹ k ; k) P K j=1 ¼jN (xnj¹ j; j) = (znk) 14 Gaussian Mixtures Revisited Summary So Far Continuing the estimation The complete-data log-likelihood is therefore NX KX E Z[log p(x; Zj¹; ; ¼)] = (z nk) flog ¼ k + log N(x nj¹ k ; k)g n=1 k=1 We have now seen a generalized EM algorithm Applicable to general estimation problems with latent variables In particular, also applicable to mixtures of other base distributions In order to get some familiarity with the general EM algorithm, let s apply it to a different class of distributions This is precisely the EM algorithm for Gaussian mixtures as derived before. 15 16 Topics of This Lecture Mixtures of Bernoulli Distributions The EM algorithm in general Recap: General EM Example: Mixtures of Bernoulli distributions Monte Carlo EM Discrete binary variables Consider D binary variables x = (x 1,,x D ) T, each of them described by a Bernoulli distribution with parameter ¹ i, so that Bayesian Mixture Models Towards a full Bayesian treatment Dirichlet priors Finite mixtures Infinite mixtures Approximate inference (only as supplementary material) Mean and covariance are given by E[x] = ¹ cov[x] = diag f¹(1 ¹)g Diagonal covariance variables independently modeled 17 18 3

Mixtures of Bernoulli Distributions Mixtures of discrete binary variables Now, consider a finite mixture of those distributions Mixtures of Bernoulli Distributions Log-likelihood for the model G iven a data set X = {x 1,,x N }, Again observation: summation inside logarithm difficult. Mean and covariance of the mixture are given by KX Covariance not diagonal E[x] = ¼ k ¹ k Model can capture depenk=1 dencies between variables KX ª cov[x] = ¼ k k + ¹ k ¹ T k E[x]E[x] T k=1 In the following, we will derive the EM algorithm for mixtures of Bernoulli distributions. This will show how we can derive EM algorithms in the general case where k = diag{¹ ki (1 - ¹ ki )}. 19 20 EM for Bernoulli Mixtures Recap: General EM Algorithm Latent variable formulation Introduce latent variable z = (z 1,,z K ) T with 1-of-K coding. Conditional distribution of x: Algorithm 1. Choose an initial setting for the parameters 2. E-step: Evaluate 3. M-step: Evaluate given by Prior distribution for the latent variables where Again, we can verify that 4. While not converged, let and return to step 2. 21 22 EM for Bernoulli Mixtures: E-Step EM for Bernoulli Mixtures: E-Step Complete-data likelihood E-Step Evaluate the responsibilities (z nk ) = E[z nk ] = X znk [¼ k p(x n j¹ z k )] znk nk P K j=1 ¼ jp(x n j¹ j ) ¼ k p(x n j¹ = k ) P K j=1 ¼ jp(x n j¹ j ) Posterior distribution of the latent variables Z Note: we again get the same form as for Gaussian mixtures j(x n) Ã ¼jN(xnj¹ j; j) P N k=1 ¼kN(xnj¹ k; k) 23 24 4

Recap: General EM Algorithm Algorithm 1. Choose an initial setting for the parameters EM for Bernoulli Mixtures: M-Step Complete-data log-likelihood 2. E-step: Evaluate 3. M-step: Evaluate given by where 4. While not converged, let and return to step 2. 25 Expectation w.r.t. the posterior distribution of Z NX KX E Z [log p(x; Zj¹; ¼)] = (z nk ) flog ¼ k n=1 k=1 ) DX + [x ni log ¹ ki + (1 x ni ) log(1 ¹ ki )] i=1 where (z nk ) = E[z nk ] are again the responsibilities for each x n. 26 EM for Bernoulli Mixtures: M-Step EM for Bernoulli Mixtures: M-Step Remark The (z nk ) only occur in two forms in the expectation: Interpretation N k is the effective number of data points associated with component k. is the responsibility-weighted mean of the data points softly assigned to component k. M-Step Maximize the expected complete-data log-likelihood w.r.t the parameter ¹ k. @ E Z[p(X; Zj¹; ¼)] @¹ k = @ NX KX (z nk) flog ¼ k + [x n log ¹ @¹ k + (1 x n) log(1 ¹ k )]g k n=1 k=1 = 1 X N 1 NX (z nk)x n (z nk)(1 x n) =! 0 ¹ k 1 ¹ n=1 k n=1. ¹ k = 1 X N (z nk)x n = ¹x k N k n=1 27 28 EM for Bernoulli Mixtures: M-Step Discussion M-Step Maximize the expected complete-data log-likelihood w.r.t the parameter ¼ k under the constraint k ¼ k = 1. Solution with Lagrange multiplier à X K! Comparison with Gaussian mixtures In contrast to Gaussian mixtures, there are no singularities in which the likelihood goes to infinity. This follows from the property of Bernoulli distributions that arg max E Z[p(X; Zj¹; ¼)] + ¼k k=1 ¼ k 1 However, there are still problem cases when ¹ ki becomes 0 or 1 E Z [log p(x; Zj¹; ¼)] = : : : [x ni log ¹ ki + (1 x ni ) log(1 ¹ ki )] Need to enforce a range [MIN_VAL,1-MIN_VAL] for either ¹ ki or. General remarks Bernoulli mixtures are used in practice in order to represent binary data. The resulting model is also known as latent class analysis. 29 30 5

Example: Handwritten Digit Recognition Topics of This Lecture Binarized digit data (examples from set of 600 digits) The EM algorithm in general Recap: General EM Example: Mixtures of Bernoulli distributions Monte Carlo EM Means of a 3-component Bernoulli mixture (10 EM iter.) Bayesian Mixture Models Towards a full Bayesian treatment Dirichlet priors Finite mixtures Infinite mixtures Approximate inference (only as supplementary material) Comparison: ML result of single multivariate Bernoulli distribution 31 Image source: C.M. Bishop, 2006 32 Monte Carlo EM Topics of This Lecture EM procedure M-step: Maximize expectation of complete-data log-likelihood The EM algorithm in general Recap: General EM Example: Mixtures of Bernoulli distributions Monte Carlo EM For more complex models, we may not be able to compute this analytically anymore Idea Use sampling to approximate this integral by a finite sum over samples {Z (l) } drawn from the current estimate of the posterior Bayesian Mixture Models Towards a full Bayesian treatment Dirichlet priors Finite mixtures Infinite mixtures Approximate inference (only as supplementary material) This procedure is called the Monte Carlo EM algorithm. 33 34 Towards a Full Bayesian Treatment Bayesian Mixture Models Mixture models We have discussed mixture distributions with K components Let s be Bayesian about mixture models Place priors over our parameters Again, introduce variable z n as indicator which component data point x n belongs to. So far, we have derived the ML estimates EM Introduced a prior p(µ) over parameters MAP-EM This is similar to the graphical model we ve used before, but now the ¼ and µ k = (¹ k, k ) are also treated as random variables. One question remains open: how to set K? Let s also set a prior on the number of components What would be suitable priors for them? 35 Slide inspired by Yee Whye Teh 36 6

Bayesian Mixture Models Bayesian Mixture Models Let s be Bayesian about mixture models Place priors over our parameters Again, introduce variable z n as indicator which component data point x n belongs to. Full Bayesian Treatment Given a dataset, we are interested in the cluster assignments where the likelihood is obtained by marginalizing over the parameters µ Introduce conjugate priors over parameters Normal Inverse Wishart The posterior over assignments is intractable! Denominator requires summing over all possible partitions of the data into K groups! Slide inspired by Yee Whye Teh 37 Need efficient approximate inference methods to solve this... 38 Bayesian Mixture Models Sneak Preview: Dirichlet Process MoG Let s examine this model more closely Role of Dirichlet priors? How can we perform efficient inference? What happens when K goes to infinity? Samples drawn from DP mixture This will lead us to an interesting class of models Dirichlet Processes Possible to express infinite mixture distributions with their help Clustering that automatically adapts the number of clusters to the data and dynamically creates new clusters on-the-fly. More structure appears as more points are drawn 39 Slide credit: Zoubin Gharamani 40 Recap: The Dirichlet Distribution Dirichlet Samples Dirichlet Distribution Conjugate prior for the Categorical and the Multinomial distrib. with Symmetric version (with concentration parameter ) Properties (symmetric version) E[¹ k] = k = 1 0 K k( 0 k) K 1 var[¹ k] = 2 = 0 ( 0 + 1) K 2 ( + 1) cov[¹ j¹ k] = j k 1 0 2 ( 0 + 1) = K 2 ( + 1) 41 Image source: C. Bishop, 2006 Effect of concentration parameter Controls sparsity of the resulting samples Slide credit: Erik Sudderth 42 Image source: Erik Sudderth 7

Mixture Model with Dirichlet Priors Finite mixture of K components Mixture Model with Dirichlet Priors Integrating out the mixing proportions ¼: The distribution of latent variables z n given ¼ is multinomial This is again a Dirichlet distribution (reason for conjugate priors) Assume mixing proportions have a given symmetric conjugate Dirichlet prior Completed Dirichlet form integrates to 1 Slide adapted from Zoubin Gharamani 43 Image source: Zoubin Gharamani 44 Mixture Models with Dirichlet Priors Integrating out the mixing proportions ¼ (cont d) Mixture Models with Dirichlet Priors Conditional probabilities Conditional probabilities Let s examine the conditional of z n given all other variables where z -n denotes all indizes except n. Slide adapted from Zoubin Gharamani 45 46 Mixture Models with Dirichlet Priors Conditional probabilities Finite Dirichlet Mixture Models Conditional probabilities: Finite K This is a very interesting result. Why? We directly get a numerical probability, no distribution. The probability of joining a cluster mainly depends on the number of existing entries in a cluster. The more populous a class is, the more likely it is to be joined! In addition, we have a base probability of also joining as-yet empty clusters. This result can be directly used in Gibbs Sampling 47 Slide adapted from Zoubin Gharamani 48 8

Infinite Dirichlet Mixture Models Conditional probabilities: Finite K Discussion Infinite Mixture Models What we have just seen is a first example of a Dirichlet Process. Conditional probabilities: Infinite K Taking the limit as K! 1 yields the conditionals Left-over mass countably infinite number of indicator settings Slide adapted from Zoubin Gharamani if k represented if all k not represented 49 DPs allow us to work with models that have an infinite number of components. This will raise a number of issues How to represent infinitely many parameters? How to deal with permutations of the class labels? How to control the effective size of the model? How to perform efficient inference? More background needed here! DPs are a very interesting class of models, but would take us too far here. If you re interested in learning more about them, take a look at the Advanced ML slides from Winter 2012. 50 Next Lecture... References and Further Reading More information about EM estimation is available in Chapter 9 of Bishop s book (recommendable to read). Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 Deep Learning 62 Additional information Original EM paper: A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-Likelihood from incomplete data via EM algorithm, In Journal Royal Statistical Society, Series B. Vol 39, 1977 EM tutorial: J.A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, TR-97-021, ICSI, U.C. Berkeley, CA,USA 63 9