Latent Variable Models and Expectation Maximization

Similar documents
Latent Variable Models and Expectation Maximization

Clustering and Gaussian Mixtures

Expectation Maximization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Linear Models for Classification

Statistical Pattern Recognition

Mixtures of Gaussians. Sargur Srihari

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Clustering and Gaussian Mixture Models

CSCI-567: Machine Learning (Spring 2019)

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 7: Con3nuous Latent Variable Models

CPSC 540: Machine Learning

Lecture 8: Clustering & Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Latent Variable Models and EM Algorithm

Clustering, K-Means, EM Tutorial

Introduction to Machine Learning

Gaussian Mixture Models, Expectation Maximization

Data Mining Techniques

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Gaussian Mixture Models

Expectation Maximization

The Expectation-Maximization Algorithm

Clustering with k-means and Gaussian mixture distributions

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Review and Motivation

Statistical Pattern Recognition

COM336: Neural Computing

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Mixture Models and EM

Lecture 4: Probabilistic Learning

Probabilistic clustering

The Expectation-Maximization Algorithm

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Latent Variable View of EM. Sargur Srihari

Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

CS181 Midterm 2 Practice Solutions

Linear Dynamical Systems

Kernel Methods and Support Vector Machines

Clustering with k-means and Gaussian mixture distributions

Data Preprocessing. Cluster Similarity

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Lecture 14. Clustering, K-means, and EM

Machine Learning Techniques for Computer Vision

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Statistical learning. Chapter 20, Sections 1 4 1

K-Means and Gaussian Mixture Models

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

Pattern Recognition and Machine Learning

Hidden Markov Models in Language Processing

Technical Details about the Expectation Maximization (EM) Algorithm

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Unsupervised Learning

Clustering with k-means and Gaussian mixture distributions

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

L11: Pattern recognition principles

Mathematical Formulation of Our Example

Brief Introduction of Machine Learning Techniques for Content Analysis

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Expectation maximization

10-701/15-781, Machine Learning: Homework 4

Machine Learning Lecture 5

ECE 5984: Introduction to Machine Learning

Mobile Robot Localization

Mixture of Gaussians Models

Outline lecture 6 2(35)

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

But if z is conditioned on, we need to model it:

Lecture 15. Probabilistic Models on Graph

MIXTURE MODELS AND EM

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Lecture 3: Pattern Classification

Max Margin-Classifier

Probabilistic & Unsupervised Learning

STA 414/2104: Machine Learning

Self-Organization by Optimizing Free-Energy

The Expectation Maximization or EM algorithm

Data Mining Techniques

CS 6140: Machine Learning Spring 2017

Expectation Maximization Algorithm

Dimension Reduction. David M. Blei. April 23, 2012

STA 4273H: Statistical Machine Learning

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Introduction to Machine Learning Midterm Exam

Mobile Robot Localization

Probabilistic Graphical Models

Expectation Maximization (EM)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Graphical Models

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Transcription:

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 small then only the white circle is seen, and there is no extrema in entropy. in entropy. There There is anis entropy an entropy extrema extrema when when the the µ, Σ) α f trema scale scale is slightly is slightly larger larger than the thanradius the radius of theofbright bright circle, circle, tative p relative to a ref- which which has has In practice In practice this method this method gives gives stable stable identification identification of fea- of fea- to a refnitydensity re arts assumed are assumed to totures turesaover variety a variety of sizes of sizes and copes and copes well with well with intra-class intra-class ckground model model asin(within a range a range r). r). ant to scaling, although experimental tests show that this We as-variability. discussed variability. Theprobabilistic saliency saliency measure measure models is designed is designed attolength betoinvari- invari- is α f K-Means The Expectation Maximization Algorithm EM Example: Gaussian Mixture Models d p r f p, U p ) dp r f er. e finder. 1 p(d θ) (N, f) p(d θ) Learning and thereafter and Parameters thereafter entropy the entropy todecreases Probability theasscale scale increases. increases. Distributions detected tures detected using using The n M. second The second is is variable and the and last the last ll ible possible occlusion occlusion l. shape s the shape and ocpearancthe appearance and and compasses many many of of and oc- obabilistic istic way, this way, this constrained objects objects ll a small covariance) covariance) Assignment ant to scaling, 3: given although fullyexperimental observedtests training show that data, this issetting not entirely not entirely the case thedue casetodue aliasing to aliasing and other and other effects. effects. Note, Note, parameters only monochrome only monochrome θ i forinformation Bayes information nets is used is is used to straight-forward detect to detect and represent features. sent infeatures. many settings not all variables are observed and repre- However, (labelled) in the training data: x i = (x i, h i ) 2.3. e.g. 2.3. Feature Speech Feature representation recognition: have speech signals, but not The phoneme feature The feature detector labels detector identifies identifies regions regions of interest of interest on each on each image. e.g. image. Object The coordinates Therecognition: coordinates of theof centre the have centre give object us give Xus and labels Xthe andsize (car, the size bicycle), of but theof not region thepart region gives labels gives S. Figure S. (wheel, Figure 2 illustrates 2door, illustrates this seat) on this two ontypical two typical images images from the frommotorbike the motorbike dataset. dataset. Unobserved variables are called latent variables figs from Fergus et al. Figure Figure 2: Output 2: Output of theof feature the feature detector detector

Latent Variable Models: Pros Statistically powerful, often good predictions. Many applications: Learning with missing data. Clustering: missing cluster label for data points. Principal Component Analysis: data points are generated in linear fashion from a small set of unobserved components. (more later) Matrix Factorization, Recommender Systems: Assign users to unobserved user types, assign items to unobserved item types. Use similarity between user type, item type to predict preference of user for item. Winner of $1M Netflix challenge. If latent variables have an intuitive interpretation (e.g., action movies, factors ), discovers new features.

Latent Variable Models: Cons From a user s point of view, like a black box if latent variables don t have an intuitive interpretation. Statistically, hard to guarantee convergence to a correct model with more data (the identifiability problem). Harder computationally, usually no closed form for maximum likelihood estimates. However, the Expectation-Maximization algorithm provides a general-purpose local search algorithm for learning parameters in probabilistic models with latent variables.

Outline K-Means The Expectation Maximization Algorithm EM Example: Gaussian Mixture Models

Outline K-Means The Expectation Maximization Algorithm EM Example: Gaussian Mixture Models

Unsupervised Learning 2 2 (a) 2 2 We will start with an unsupervised learning (clustering) problem: Given a dataset {x 1,..., x N }, each x i R D, partition the dataset into K clusters Intuitively, a cluster is a group of points, which are close together and far from others

Distortion Measure 2 (a) 2 2 2 2 (i) 2 2 2 Formally, introduce prototypes (or cluster centers) µ k R D Use binary r nk, 1 if point n is in cluster k, otherwise (1-of-K coding scheme again) Find {µ k }, {r nk } to minimize distortion measure: J = N n=1 k=1 K r nk x n µ k 2

Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k 2 n=1 k=1 However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence

Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k 2 n=1 k=1 However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence

Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k 2 n=1 k=1 However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence

2 (a) Determining Membership Variables Step 1 in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk 2 2 2 2 2 (b) 2 2 J = N n=1 k=1 K r nk x n µ k 2 Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k 2 k=1 Simply set r nk = 1 for the cluster center µ with smallest distance

2 (a) Determining Membership Variables Step 1 in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk 2 2 2 2 2 (b) 2 2 J = N n=1 k=1 K r nk x n µ k 2 Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k 2 k=1 Simply set r nk = 1 for the cluster center µ with smallest distance

2 (a) Determining Membership Variables Step 1 in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk 2 2 2 2 2 (b) 2 2 J = N n=1 k=1 K r nk x n µ k 2 Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k 2 k=1 Simply set r nk = 1 for the cluster center µ with smallest distance

Determining Cluster Centers Step 2: fix r nk, minimize J wrt the cluster centers µ k 2 (b) K N J = r nk x n µ k 2 switch order of sums k=1 n=1 2 2 2 2 (c) 2 2 2 So we can minimze wrt each µ k separately Take derivative, set to zero: 2 N r nk (x n µ k ) = n=1 µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k

Determining Cluster Centers Step 2: fix r nk, minimize J wrt the cluster centers µ k 2 (b) K N J = r nk x n µ k 2 switch order of sums k=1 n=1 2 2 2 2 (c) 2 2 2 So we can minimze wrt each µ k separately Take derivative, set to zero: 2 N r nk (x n µ k ) = n=1 µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k

K-means Algorithm Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Assign points to nearest cluster center Minimize J wrt µ k Set cluster center as average of points in cluster Rinse and repeat until convergence

K-means example 2 (a) 2 2 2

K-means example 2 (b) 2 2 2

K-means example 2 (c) 2 2 2

K-means example 2 (d) 2 2 2

K-means example 2 (e) 2 2 2

K-means example 2 (f) 2 2 2

K-means example 2 (g) 2 2 2

K-means example 2 (h) 2 2 2

K-means example 2 (i) 2 2 2 Next step doesn t change membership stop

K-means Convergence Repeat steps until no change in cluster assignments For each step, value of J either goes down, or we stop Finite number of possible assignments of data points to clusters, so we are guaranteed to converge eventually Note it may be a local maximum rather than a global maximum to which we converge

K-means Example - Image Segmentation Original image K-means clustering on pixel colour values Pixels in a cluster are coloured by cluster mean Represent each pixel (e.g. 24-bit colour value) by a cluster number (e.g. 4 bits for K = 1), compressed version This technique known as vector quantization Represent vector (in this case from RGB, R 3 ) as a single discrete value

K-means Generalized: the set-up Let s generalize the idea. Suppose we have the following set-up. X denotes all observed variables (e.g., data points). Z denotes all latent (hidden, unobserved) variables (e.g., cluster means). J(X, Z θ) where J measures the goodness of an assignment of latent variable models given the data points and parameters θ. e.g., J = -dispersion measure. parameters = assignment of points to clusters. It s easy to maximize J(X, Z θ) wrt θ for fixed Z. It s easy to maximize J(X, Z θ) wrt Z for fixed θ.

K-means Generalized: The Algorithm The fact that conditional maximization is simple suggests an iterative algorithm. 1. Guess an initial value for latent variables Z. 2. Repeat until convergence: 2.1 Find best parameter values θ given the current guess for the latent variables. Update the parameter values. 2.2 Find best value for latent variables Z given the current parameter values. Update the latent variable values.

Outline K-Means The Expectation Maximization Algorithm EM Example: Gaussian Mixture Models

EM Algorithm: The set-up We assume a probabilistic model, specifically the complete-data likelihood function p(x, Z θ). Goodness of the model is the log-likelihood ln p(x, Z θ). Key difference: instead of guessing a single best value for latent variables given current parameter settings, we use the conditional distribution p(z X, θ old ) over latent variables. Given a latent variable distribution, parameter values θ are evaluated by taking the expected goodness ln p(x, Z θ) over all possible latent variable settings.

EM Algorithm: The procedure 1. Guess an initial parameter setting θ old. 2. Repeat until convergence: 3. The E-step: Evaluate p(z X, θ old ). (Ideally, find a closed form as a function of Z). 4. The M-step: 4.1 Evaluate the function Q(θ, θ old ) = Z p(z X, θold ) ln p(x, Z θ). 4.2 Maximize Q(θ, θ old ) wrt θ. Update θ old. 5. This procedure is guaranteed to increase at each step. the data log-likelihood ln p(x θ) = Z ln p(x, Z θ). 6. Therefore converges to local log-likelihood maximum. More theoretical analysis in text.

Outline K-Means The Expectation Maximization Algorithm EM Example: Gaussian Mixture Models

Hard Assignment vs. Soft Assignment 2 (i) In the K-means algorithm, a hard 2 2 2 assignment of points to clusters is made However, for points near the decision boundary, this may not be such a good idea Instead, we could think about making a soft assignment of points to clusters

Gaussian Mixture Model 1 (a) 1 (b).5.2.5.5.3.5 1.5 1 The Gaussian mixture model (or mixture of Gaussians MoG) models the data as a combination of Gaussians. a: constant density contours. b: marginal probability p(x). c: surface plot. Widely used general approximation for multi-modal distributions.

Gaussian Mixture Model 1 (b) 1 (a).5.5.5 1.5 1 Above shows a dataset generated by drawing samples from three different Gaussians

Generative Model z 1 (a).5 x.5 1 The mixture of Gaussians is a generative model To generate a datapoint x n, we first generate a value for a discrete variable z n {1,..., K} We then generate a value x n N (x µ k, Σ k ) for the corresponding Gaussian

Graphical Model π z n 1 (a).5 x n µ Σ N.5 1 Full graphical model using plate notation Note z n is a latent variable, unobserved Need to give conditional distributions p(z n ) and p(x n z n ) The one-of-k representation is helpful here: z nk {, 1}, z n = (z n1,..., z nk )

Graphical Model - Latent Component Variable π z n 1 (a).5 x n µ Σ N.5 1 Use a Bernoulli distribution for p(z n ) i.e. p(z nk = 1) = π k Parameters to this distribution {π k } Must have π k 1 and K k=1 π k = 1 p(z n ) = K k=1 πz nk k

Graphical Model - Observed Variable π z n 1 (a).5 x n µ Σ N.5 1 Use a Gaussian distribution for p(x n z n ) Parameters to this distribution {µ k, Σ k } p(x n z nk = 1) = N (x n µ k, Σ k ) K p(x n z n ) = N (x n µ k, Σ k ) z nk k=1

Graphical Model - Joint distribution π z n 1 (a).5 x n µ Σ N.5 1 The full joint distribution is given by: p(x, z) = = N p(z n )p(x n z n ) n=1 N K n=1 k=1 π z nk k N (x n µ k, Σ k ) z nk

MoG Marginal over Observed Variables The marginal distribution p(x n ) for this model is: p(x n ) = z n p(x n, z n ) = z n p(z n )p(x n z n ) = K π k N (x n µ k, Σ k ) k=1 A mixture of Gaussians

MoG Conditional over Latent Variable 1 (b) 1 (c).5.5.5 1.5 1 To apply EM, need the conditional distribution p(z nk = 1 x n, θ) where θ are the model parameters. It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = 1 x n ) = = p(z nk = 1)p(x n z nk = 1) K j=1 p(z nj = 1)p(x n z nj = 1) π k N (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n

MoG Conditional over Latent Variable 1 (b) 1 (c).5.5.5 1.5 1 To apply EM, need the conditional distribution p(z nk = 1 x n, θ) where θ are the model parameters. It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = 1 x n ) = = p(z nk = 1)p(x n z nk = 1) K j=1 p(z nj = 1)p(x n z nj = 1) π k N (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n

MoG Conditional over Latent Variable 1 (b) 1 (c).5.5.5 1.5 1 To apply EM, need the conditional distribution p(z nk = 1 x n, θ) where θ are the model parameters. It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = 1 x n ) = = p(z nk = 1)p(x n z nk = 1) K j=1 p(z nj = 1)p(x n z nj = 1) π k N (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n

EM for Gaussian Mixtures: E-step The complete-data log-likelihood is ln p(x, Z θ) = N n=1 k=1 K z nk [ln π k + lnn (x n µ k, Σ k )]. E step: Calculate responsibilities using current parameters θ old : γ(z nk ) = π kn (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) The z n vectors assigning data point n to components are independent of each other. Therefore under the posterior distribution p(z nk = 1 x n, θ) the expected value of z nk is γ(z nk ).

EM for Gaussian Mixtures: M-step M step: Because of the independence of the component assignments, we can calculate the expectation wrt the component assignments by using the expectations of the component assignments. So Q(θ, θ old ) = N n=1 K k=1 γ(z nk)[ln π k + lnn (x n µ k, Σ k )]. Maximizing Q(θ, θ old ) with respect to the model parameters is more or less straightforward.

EM for Gaussian Mixtures II Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters γ(z nk ) = π kn (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) N k = N γ(z nk ) n=1 µ k = 1 N γ(z nk )x n N k n=1 Σ k = 1 N γ(z nk )(x n µ N k )(x n µ k ) T k n=1 π k = N k N Think of N k as effective number of points in component k.

EM for Gaussian Mixtures II Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters γ(z nk ) = π kn (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) N k = N γ(z nk ) n=1 µ k = 1 N γ(z nk )x n N k n=1 Σ k = 1 N γ(z nk )(x n µ N k )(x n µ k ) T k n=1 π k = N k N Think of N k as effective number of points in component k.

MoG EM - Example 2 2 2 (a) 2 Same initialization as with K-means before Often, K-means is actually used to initialize EM

MoG EM - Example 2 2 2 (b) 2 Calculate responsibilities γ(z nk )

MoG EM - Example 2 2 2 (c) 2 Calculate model parameters {π k, µ k, Σ k } using these responsibilities

MoG EM - Example 2 2 2 (d) 2 Iteration 2

MoG EM - Example 2 2 2 (e) 2 Iteration 5

MoG EM - Example 2 2 Iteration 2 - converged 2 (f) 2

EM - Summary EM finds local maximum to likelihood p(x θ) = Z p(x, Z θ) Iterates two steps: E step fills in the missing variables Z (calculates their distribution) M step maximizes expected complete log likelihood (expectation wrt E step distribution)

Conclusion Readings: Ch. 9.1, 9.2, 9.4 K-means clustering Gaussian mixture model What about K? Model selection: either cross-validation or Bayesian version (average over all values for K) Expectation-maximization, a general method for learning parameters of models when not all variables are observed