Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Similar documents
But if z is conditioned on, we need to model it:

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Based on slides by Richard Zemel

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

STA 414/2104: Machine Learning

Gaussian Mixture Models

MACHINE LEARNING ADVANCED MACHINE LEARNING

Latent Variable Models and EM Algorithm

Lecture 3: Pattern Classification. Pattern classification

Lecture 7: Con3nuous Latent Variable Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

MACHINE LEARNING ADVANCED MACHINE LEARNING

CS281 Section 4: Factor Analysis and PCA

Probabilistic & Unsupervised Learning

Latent Variable Models and EM algorithm

Factor Analysis (10/2/13)

Artificial Neural Networks 2

Lecture 6: April 19, 2002

Factor Analysis and Kalman Filtering (11/2/04)

Expectation Maximization

Chapter 4: Factor Analysis

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Machine Learning 4771

Introduction to Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Week 3: The EM algorithm

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

STA 4273H: Statistical Machine Learning

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Expectation Maximization Algorithm

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Probabilistic Graphical Models

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Bayesian Machine Learning

Expectation Maximization

ECE 5984: Introduction to Machine Learning

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Probabilistic Machine Learning. Industrial AI Lab.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Variational Autoencoders

Metric-based classifiers. Nuno Vasconcelos UCSD

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Clustering with k-means and Gaussian mixture distributions

Latent Variable Models

Notes on Discriminant Functions and Optimal Classification

Clustering with k-means and Gaussian mixture distributions

Latent Variable Models and Expectation Maximization

Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

Expectation maximization tutorial

Latent Variable Models and Expectation Maximization

STA 414/2104: Lecture 8

Clustering with k-means and Gaussian mixture distributions

Unsupervised Learning

Unsupervised Learning

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

The Expectation-Maximization Algorithm

Pattern Recognition and Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Data Mining Techniques

Mobile Robot Localization

STA 4273H: Statistical Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Factor Analysis. Qian-Li Xue

STA 414/2104, Spring 2014, Practice Problem Set #1

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

1 Kernel methods & optimization

Probabilistic Graphical Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Where now? Machine Learning and Bayesian Inference

CPSC 540: Machine Learning

Probabilistic Graphical Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 16 Deep Neural Generative Models

Machine Learning Basics III

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

K-Means and Gaussian Mixture Models

Probabilistic & Unsupervised Learning. Latent Variable Models

STA 4273H: Statistical Machine Learning

Lecture 3: Pattern Classification

CSC411 Fall 2018 Homework 5

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture Note 2: Estimation and Information Theory

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

STA 414/2104: Lecture 8

Introduction to Machine Learning

Machine Learning Lecture 3

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 2: Simple Classifiers

Review and Motivation

ECE521 week 3: 23/26 January 2017

Non-Parametric Bayes

Mixtures of Gaussians. Sargur Srihari

Graphical Models for Collaborative Filtering

Introduction to Graphical Models

Transcription:

Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Latent Variable Models What to do when a variable z is always unobserved? Depends on where it appears in our model. If we never condition on it when computing the probability of the variables we do observe, then we can just forget about it and integrate it out. e.g. given y, fit the model p(z, y ) = p(z y)p(y, w)p(w). (In other words if it is a leaf node.) But if z is conditioned on, we need to model it: e.g. given y, fit the model p(y ) = z p(y, z)p(z) Z X

Where Do Latent Variables Come From? Latent variables may appear naturally, from the structure of the problem, because something wasn t measured, because of faulty sensors, occlusion, privacy, etc. But also, we may want to intentionally introduce latent variables to model comple dependencies between variables without looking at the dependencies between them directly. This can actually simplify the model (e.g. mitures). Z n X n N X n N (a) (b)

Clustering vs. Classification Latent Factor Models vs. Regression You can think of clustering as the problem of classification with missing class labels. Z X You can think of factor models (such as factor analysis, PCA, ICA, etc.) as linear or nonlinear regression with missing inputs.

Why is Learning Harder? In fully observed iid settings, the probability model is a product thus the log likelihood is a sum where terms decouple. (At least for directed models.) l(θ; D) = log p(, z θ) = log p(z θ z ) + log p( z, θ ) With latent variables, the probability already contains a sum, so the log likelihood has all parameters coupled together via log (): l(θ; D) = log z p(, z θ) = log z p(z θ z )p( z, θ ) (Just as with the partition function in undirected models.)

Learning with Latent Variables Likelihood l(θ; D) = log z p(z θ z)p( z, θ ) couples parameters: Z Z X1 X 2 X3 X1 X 2 X3 (a) We can treat this as a black bo probability function and just try to optimize the likelihood as a function of θ (e.g. gradient descent). However, sometimes taking advantage of the latent variable structure can make parameter estimation easier. Good news: soon we will see the EM algorithm which allows us to treat learning with latent variables using fully observed tools. Basic trick: guess the values you don t know. Basic math: use conveity to lower bound the likelihood. (b)

Miture Models (1 discrete latent var) Most basic latent variable model with a single discrete node z. Allows different submodels (eperts) to contribute to the (conditional) density model in different parts of the space. Divide and conquer idea: use simple parts to build comple models. (e.g. multimodal densities, or piecewise-linear regressions). (a) X n Y n Z n N y z n = 0 z n = 1 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3 2 1 0 1 2 3 (b) X n Z n y Y n N

Miture Densities Eactly like a classification model but the class is unobserved and so we sum it out. What we get is a perfectly valid density: K p( θ) = p(z = k θ z )p( z = k, θ k ) k=1 = α k p k ( θ k ) k where the miing proportions add to one: k α k = 1. We can use Bayes rule to compute the posterior probability of the miture component given some data: p(z = k, θ) = α kp k ( θ k ) j α jp j ( θ j ) these quantities are sometimes called the responsibilities.

Clustering Eample: Gaussian Miture Models Consider a miture of K Gaussian components: p( θ) = α k N ( µ k, Σ k ) k p(z = k, θ) = α kn ( µ k, Σ k ) j α jn ( µ k, Σ k ) l(θ; D) = n log k α k N ( n µ k, Σ k ) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3 2 1 0 1 2 3 Density model: p( θ) is a familiarity signal. Clustering: p(z, θ) is the assignment rule, l(θ) is the cost.

Eample: Miture of Linear Regression Eperts Each epert generates data according to a linear function of the input plus additive Gaussian noise: p(y, θ) = α k ()N (y βk, σ2 k ) k The gating function can be a softma classification machine: α k () = p(z = k ) = eη k j eη j Remember: we are not modeling the density of the inputs. X n y (a) Y n Z n N z n = 0 z n = 1

Gradient Learning with Mitures We can learn miture densities using gradient descent on the likelihood as usual. The gradients are quite interesting: l(θ) = log p( θ) = log α k p k ( θ k ) k l θ = 1 p α k ( θ k ) p( θ) k θ = k = k k 1 α k p( θ) p k( θ k ) log p k( θ k ) θ α k p k ( θ k ) p( θ) l k = θ k k α k r k l k θ k In other words, the gradient is the responsibility weighted sum of the individual log likelihood gradients.

Recap: Learning with Latent Variables With latent variables, the probability contains a sum, so the log likelihood has all parameters coupled together: l(θ; D) = log p(, z θ) = log p(z θ z )p( z, θ ) z z (we can also consider continuous z and replace with ) If the latent variables were observed, parameters would decouple again and learning would be easy: l(θ; D) = log p(, z θ) = log p(z θ z ) + log p( z, θ ) One idea: ignore this fact, compute l/ θ, and do learning with a smart optimizer like conjugate gradient. Another idea: what if we use our current parameters to guess the values of the latent variables, and then do fully-observed learning? This back-and-forth trick might make optimization easier.

Epectation-Maimization (EM) Algorithm Iterative algorithm with two linked steps: E-step: fill in values of ẑ t using p(z, θ t ). M-step: update parameters using θ t+1 argma l(θ;, ẑ t ). E-step involves inference, which we need to do at runtime anyway. M-step is no harder than in fully observed case. We will prove that this procedure monotonically improves l (or leaves it unchanged). Thus it always converges to a local optimum of the likelihood (as any optimizer should). Note: EM is an optimization strategy for objective functions that can be interpreted as likelihoods in the presence of missing data. EM is not a cost function such as maimum-likelihood. EM is not a model such as miture-of-gaussians.

Epected Complete Log Likelihood For any distribution q(z) define epected complete log likelihood: l q (θ; ) = l c (θ;, z) q z q(z ) log p(, z θ) Amazing fact: l(θ) l q (θ) + H(q) because of concavity of log: l(θ; ) = log p( θ) = log z = log z z p(, z θ) p(, z θ) q(z ) q(z ) q(z ) log p(, z θ) q(z ) Where the inequality is called Jensen s inequality. (It is only true for distributions: q(z) = 1; q(z) > 0.)

Lower Bounds and Free Energy For fied data, define a functional called the free energy: F (q, θ) p(, z θ) q(z ) log l(θ) q(z ) z The EM algorithm is coordinate-ascent on F : E-step: q t+1 = argma q F (q, θ t ) M-step: θ t+1 = argma θ F (q t+1, θ t ) EM Constructs Sequential Conve Lower Bounds likelihood F( θ,q t+1 ) θ t θ

M-step: maimization of epected l c Note that the free energy breaks into two terms: F (q, θ) = z q(z ) log p(, z θ) z q(z ) log q(z ) = l q (θ; ) + H(q) (this is where its name comes from) The first term is the epected complete log likelihood (energy) and the second term, which does not depend on θ, is the entropy. Thus, in the M-step, maimizing with respect to θ for fied q we only need to consider the first term: θ t+1 = argma θ l q (θ; ) = argma θ q(z ) log p(, z θ) Usually optimizing l c (θ) given both z and is straightforward. (e.g. class conditional Gaussian fitting, linear regression) z

E-step: inferring latent posterior Claim: the optimum setting of q in the E-step is: q t+1 = p(z, θ t ) This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification). Proof (easy): this setting saturates the bound l(θ; ) F (q, θ) F (p(z, θ t ), θ t ) = p(z, θ t ) log p(, z θt ) p(z, θ z t ) = p(z, θ t ) log p( θ t ) z = log p( θ t ) z p(z, θt ) = l(θ; ) 1 Can also show this result using variational calculus or the fact that l(θ) F (q, θ) = KL[q p(z, θ)]

Eample: Mitures of Gaussians Recall: a miture of K Gaussians: p( θ) = k α kn ( µ k, Σ k ) l(θ; D) = n log k α kn ( n µ k, Σ k ) Learning with EM algorithm: E step : p t kn = N (n µ t k, Σt k ) M step : µ t+1 k = qkn t+1 = p(z =k n, θ t ) = αt k pt kn Σ t+1 k = α t+1 k n qt+1 kn n n qt+1 kn n qt+1 = 1 M n j αt j pt kn kn (n µ t+1 k )( n µ t+1 k q t+1 kn n qt+1 kn )

EM for MOG L = 1 L = 4 (a) (c) (d) (e) L = 6 L = 8 L = 10 L = 12 (f) (g) (h) (i)

Compare: K-means The EM algorithm for mitures of Gaussians is just like a soft version of the K-means algorithm. In the K-means E-step we do hard assignment: c t+1 n = argmin k ( n µ t k ) Σ 1 k (n µ t k ) In the K-means M-step we update the means as the weighted sum of the data, but now the weights are 0 or 1: n µ t+1 k = [ct+1 k = n] n n [ct+1 k = n] (a) (b) (c) (d) (e) (f)

Continuous Latent Variables In many models there are some underlying causes of the data. Miture models use a discrete class variable: clustering. Sometimes, it is more appropriate to think in terms of continuous factors which control the data we observe. Geometrically, this is equivalent to thinking of a data manifold or subspace. y 3 λ 1 µ λ2 y 1 y 2 To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.

When we assume that the subspace is linear and that the underlying latent variable has a Gaussian distribution we get a model known as factor analysis: data y (p-dim); latent variable (k-dim) Factor Analysis µ y 1 y 2 p() = N ( 0, I) p(y, θ) = N (y µ + Λ, Ψ) where µ is the mean vector, Λ is the p by k factor loading matri, and Ψ is the sensor noise covariance (ususally diagonal). Important: since the product of Gaussians is still Gaussian, the joint distribution p(, y), the other marginal p(y) and the conditional p( y) are also Gaussian. y 3 λ 1 λ2

FA = Constrained Covariance Gaussian Marginal density for factor analysis (y is p-dim, is k-dim): p(y θ) = N (y µ, ΛΛ +Ψ) (integrate out ) So the effective covariance is the low-rank outer product of two long skinny matrices plus a diagonal matri: Cov[y] Λ In other words, factor analysis is just a constrained Gaussian model. (If Ψ were not diagonal then we could model any Gaussian and it would be pointless.) Learning: how should we fit the ML parameters? It is easy to find µ: just take the mean of the data. From now on assume we have done this and re-centred y. What about the other parameters? Λ T Ψ

EM for Factor Analysis We will do maimum likelihood learning using the EM algorithm. E-step: qn t+1 = p( n y n, θ t ) M-step: θ t+1 = argma θ n qt+1 ( n y n ) log p(y n, n θ)d n For E-step we need the conditional distribution (inference) For M-step we need the epected log of the complete data. E step : qn t+1 = p( n y n, θ t ) = N ( n m n, V n ) M step : Λ t+1 = argma Λ n Ψ t+1 = argma Ψ n l c ( n, y n ) q t+1 n l c ( n, y n ) q t+1 n Inferring the posterior mean is just a linear operation, and theposterior covariance does not depend on observed data: m n = β(y n µ) V = (I + Λ Ψ 1 Λ) 1 where β can be computed beforehand given the model parameters.

EM Algorithm for Factor Analysis First, set µ equal to the sample mean (1/N) n y n, and subtract this mean from all the data. Now run the following iterations: E step : q t+1 = p( y, θ t ) = N ( n m n, V n ) V n = (I + Λ Ψ 1 Λ) 1 m n = V n Λ Ψ 1 (y µ) ( ) ( M step : Λ t+1 = y n m n n n Ψ t+1 = 1 N diag [ n V n ) 1 y n y n + Λ t+1 n m n y n ]

Principal Component Analysis In Factor Analysis, we can write the marginal density eplicitly: p(y θ) = p()p(y, θ)d = N (y µ, ΛΛ +Ψ) Noise Ψ mut be restricted for model to be interesting. (Why?) In Factor Analysis the restriction is that Ψ is diagonal (ais-aligned). What if we further restrict Ψ = σ 2 I (ie spherical)? We get the Probabilistic Principal Component Analysis (PPCA) model: p() = N ( 0, I) p(y, θ) = N (y µ + Λ, σ 2 I) where µ is the mean vector, columns of Λ are the principal components (usually orthogonal), and σ 2 is the global sensor noise.

PCA: Zero Noise Limit The traditional PCA model is actually a limit as σ 2 0. The model we saw is actually called probabilistic PCA. However, the ML parameters Λ are the same. The only difference is the global sensor noise σ 2. In the zero noise limit inference is easier: orthogonal projection. lim Λ (ΛΛ + σ 2 I) 1 = (Λ Λ) 1 Λ σ 2 0 y y 3 µ y 1 y 2

Direct Fitting For FA the parameters are coupled in a way that makes it impossible to solve for the ML params directly. We must use EM or other nonlinear optimization techniques. But for (P)PCA, the ML params can be solved for directly: The k th column of Λ is the k th largest eigenvalue of the sample covariance S times the associated eigenvector. The global sensor noise σ 2 is the sum of all the eigenvalues smaller than the k th one. This technique is good for initializing FA also. Actually PCA is the limit as the ratio of the noise variance on the output to the prior variance on the latent variables goes to zero. We can either achieve this with zero noise or with infinite variance priors.

Gaussians are Footballs in High-D Recall the intuition that Gaussians are hyperellipsoids. Mean == centre of football Eigenvectors of covariance matri == aes of football Eigenvalues == lengths of aes In FA our football is an ais aligned cigar. In PPCA our football is a sphere of radius σ 2. FA Ψ PCA ει

Scale Invariance in Factor Analysis In FA the scale of the data is unimportant: we can multiply y i by α i without changing anything: µ i α i µ i Λ ij α i Λ ij j Ψ i α 2 i Ψ i However, the rotation of the data is important. FA looks for directions of large correlation in the data, so it is not fooled by large variance noise. FA PCA

Rotational Invariance in PCA In PCA the rotation of the data is unimportant: we can multiply the data y by and rotation Q without changing anything: µ Qµ Λ QΛ Ψ unchanged However, the scale of the data is important. PCA looks for directions of large variance, so it will chase big noise directions. FA PCA

Recap: EM Algorithm A way of maimizing likelihood function for latent variable models. Finds ML parameters when the original (hard) problem can be broken up into two (easy) pieces: 1. Estimate some missing or unobserved data from observed data and current parameters. 2. Using this complete data, find the maimum likelihood parameter estimates. Alternate between filling in the latent variables using our best guess (posterior) and updating the paramters based on this guess: E-step: q t+1 = p(z, θ t ) M-step: θ t+1 = argma θ z q(z ) log p(, z θ) In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood.