But if z is conditioned on, we need to model it:

Similar documents
Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Based on slides by Richard Zemel

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Gaussian Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Latent Variable Models and EM algorithm

Expectation Maximization Algorithm

CPSC 540: Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

10-701/15-781, Machine Learning: Homework 4

Technical Details about the Expectation Maximization (EM) Algorithm

Latent Variable Models and EM Algorithm

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

An Introduction to Expectation-Maximization

Latent Variable Models

CSE446: Clustering and EM Spring 2017

Expectation Maximization

Clustering with k-means and Gaussian mixture distributions

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

K-Means and Gaussian Mixture Models

Gaussian Mixture Models, Expectation Maximization

Linear Models for Classification

Graphical Models for Collaborative Filtering

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Probabilistic Graphical Models

CS Lecture 18. Expectation Maximization

Week 3: The EM algorithm

CSC321 Lecture 18: Learning Probabilistic Models

Computing the MLE and the EM Algorithm

Clustering with k-means and Gaussian mixture distributions

Mixtures of Gaussians continued

ECE 5984: Introduction to Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Statistical learning. Chapter 20, Sections 1 4 1

Pattern Recognition and Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Machine Learning

Clustering, K-Means, EM Tutorial

Expectation Maximization (EM)

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Notes on Machine Learning for and

Introduction to Machine Learning

Introduction to Machine Learning

Weighted Finite-State Transducers in Computational Biology

EM Algorithm LECTURE OUTLINE

Variational Autoencoder

Expectation Maximization and Mixtures of Gaussians

The Multivariate Gaussian Distribution [DRAFT]

Expectation Maximization

Estimating Gaussian Mixture Densities with EM A Tutorial

Lecture 14. Clustering, K-means, and EM

Mixture Models and Expectation-Maximization

COM336: Neural Computing

STA 4273H: Statistical Machine Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

CS534 Machine Learning - Spring Final Exam

Bias-Variance Tradeoff

CSCI-567: Machine Learning (Spring 2019)

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Expectation Maximization

Machine Learning Lecture 5

CPSC 540: Machine Learning

Review and Motivation

Outline lecture 6 2(35)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Lecture 6: April 19, 2002

Density Estimation. Seungjin Choi

Mixtures of Gaussians. Sargur Srihari

Clustering with k-means and Gaussian mixture distributions

Lecture 4: Probabilistic Learning

Data Mining Techniques

Machine Learning Lecture Notes

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Support Vector Machines and Bayes Regression

Statistical Pattern Recognition

Generative Clustering, Topic Modeling, & Bayesian Inference

CS281 Section 4: Factor Analysis and PCA

Regression with Numerical Optimization. Logistic

Probabilistic & Unsupervised Learning

STA 4273H: Statistical Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Expectation maximization

Week 5: Logistic Regression & Neural Networks

Introduction to Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Transcription:

Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or both. If the are occasionally unobserved they are missing data. e.g. undefinied inputs, missing class labels, erroneous target values In this case, we define a new cost function in which we integrate out the missing values at training or test time: l(θ; D) = log p(x c, y c θ) + log p(x m θ) complete missing = log p(x c, y c θ) + log complete missing y p(x m, y θ) Variables which are always unobserved are called latent variables. Missing Outputs You can thin of unsupervised learning as supervised learning in which all the outputs are missing: Clustering == classification with missing labels. Dimensionality reduction == regression with missing targets. Density estimation is actually very general and encompasses the two problems above and a whole lot more. Let s focus on the idea of unobserved variables... Latent Variables What to do when a variable is always unobserved? Depends on where it appears in our model. If we never condition on it when computing the probability of the variables we do observe, then we can just forget about it and integrate it out. e.g. given y, x fit the model p(, y x) = p( y)p(y x, w)p(w). But if is conditioned on, we need to model it: e.g. given y, x fit the model p(y x) = p(y x, )p() Latent variables may appear naturally, from the structure of the problem. But also, we may want to intentionally introduce latent variables to model complex dependencies between variables without looing at the dependencies between them directly. This can actually simplify the model (e.g. mixtures).

Why is Learning Harder? In fully observed settings, the probability model is a product thus the log lielihood is a sum where terms decouple. l(θ; D) = n = n log p(y n, x n θ) log p(x n θ x ) + n log p(y n x n, θ y ) Mixture Models Most basic latent variable model with a single discrete node. Allows different submodels (experts) to contribute to the (conditional) density model in different parts of the space. Divide and conquer idea: use simple parts to build complex models. (e.g. multimodal densities, or piecewise-linear regressions). With latent variables, the probability already contains a sum, so the log lielihood has all parameters coupled together: l(θ; D) = n log p(x n, θ) = n log p( θ )p(x n, θ x ) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3 2 1 0 1 2 3 x Learning with Latent Variables Lielihood l(θ) = log p( θ )p(x, θ x ) couples parameters: We can treat this as a blac box probability function and just try to optimie the lielihood as a function of θ. We did this many times before by taing gradients. However, sometimes taing advantage of the latent variable structure can mae parameter estimation easier. Good news: today we will see the EM algorithm which allows us to treat learning with latent variables using fully observed tools. Basic tric: guess the values you don t now. Basic math: use convexity to lower bound the lielihood. Mixture Densities Exactly lie a classification model but the class is unobserved and so we sum it out. What we get is a perfectly valid density: K p(x θ) = p( = θ )p(x =, θ ) =1 = α p (x θ ) where the mixing proportions add to one: α = 1. We can use Bayes rule to compute the posterior probability of the mixture component given some data: r (x) = p( = x, θ) = α p (x θ ) j α jp j (x θ j ) these quantities are called responsibilities. You ve seen them many times before; now you now their names!

Learning with Mixtures We can learn mixture densities using gradient descent on the lielihood as usual. The gradients are quite interesting: l(θ) = log p(x θ) = log α p (x θ ) l θ = 1 p α (x θ ) p(x θ) θ = 1 α p(x θ) p (x θ ) log p (x θ ) θ = p α (x θ ) l = p(x θ) θ α r l θ In other words, the gradient is the responsibility weighted sum of the individual log lielihood gradients. Mixture of Linear Regression Experts Each expert generates data according to a linear function of the input plus additive Gaussian noise: p(y x, θ) = α (x)n (y β x, σ2 ) The gate function can be a softmax classification machine: eη x α (x) = p( = x) = j eη j x Remember: we are not modeling the density of the inputs x. Learning? Gradient descent is one option. You have seen the gradients for this example before. In a minute you will see another one... Conditional Mixtures: MOEs Revisited Mixtures of Experts are also called conditional mixtures. Exactly lie a class-conditional classification model, except the class is unobserved and so we sum it out: K p(y x, θ) = p( = x, θ )p(y =, x, θ ) where α (x) = 1 =1 = α (x θ )p (y x, θ ) x. Harder: must learn α (x) (unless chose independent of x). The α (x) are exactly what we called the gating function. We can still use Bayes rule to compute the posterior probability of the mixture component given some data: p( = x, y, θ) = α (x)p (y x, θ ) j α j(x)p j (y x, θ j ) Clustering Example: Gaussian Mixture Models Consider a mixture of K Gaussian components: 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3 2 1 0 1 2 3 x p(x θ) = α N (x µ, Σ ) p( = x, θ) = α N (x µ, Σ ) j α jn (x µ, Σ ) l(θ; D) = n log α N (x n µ, Σ ) Density model: p(x θ) is a familiarity signal. Clustering: p( x, θ) is the assignment rule, l(θ) is the cost.

Mixure of Gaussians Learning We can learn mixtures of Gaussians using gradient descent. For example, the gradients of the means: l(θ) = log p(x θ) = log l θ = l µ = α r l θ = α r Σ 1 (x µ ) α p (x θ ) α r log p (x θ ) θ Gradients of covariance matrices are harder: require derivatives of log determinants and quadratic forms. Must ensure that mixing proportions α are positive and sum to unity and that covariance matrices are positive definite. Be careful: logsum Often you can easily compute b = log p(x =, θ ), but it will be very negative, say -10 6 or smaller. Now, to compute l = log p(x θ) you need to compute log eb. Careful! Do not compute this by doing log(sum(exp(b))). You will get underflow and an incorrect answer. Instead do this: Add a constant exponent B to all the values b such that the largest value comes close to the maxiumum exponent allowed by machine precision: B = MAXEXPONENT-log(K)-max(b). Compute log(sum(exp(b+b)))-b. Example: if log p(x = 1) = 420 and log p(x = 2) = 420, what is log p(x) = log [p(x = 1) + p(x = 2)]? Answer: log[2e 420 ] = 420 + log 2. Parameter Constraints If we want to use general optimiations (e.g. conjugate gradient) to learn latent variable models, we often have to mae sure parameters respect certain constraints. (e.g. α = 1, Σ pos.definite). A good tric is to reparameterie these quantities in terms of unconstrained values. For mixing proportions, use the softmax: α = exp(q ) j exp(q j) For covariance matrices, use the Cholesy decomposition: Σ 1 = A A Σ 1/2 = A ii i where A is upper diagonal with positive diagonal: A ii = exp(r i ) > 0 A ij = a ij (j > i) A ij = 0 (j < i) Recap: Learning with Latent Variables With latent variables, the probability contains a sum, so the log lielihood has all parameters coupled together: l(θ; D) = log p(x, θ) = log p( θ )p(x, θ x ) (we can also consider continuous and replace with ) If the latent variables were observed, parameters would decouple again and learning would be easy: l(θ; D) = log p(x, θ) = log p( θ ) + log p(x, θ x ) One idea: ignore this fact, compute l/ θ, and do learning with a smart optimier lie conjugate gradient. Another idea: what if we use our current parameters to guess the values of the latent variables, and then do fully-observed learning? This bac-and-forth tric might mae optimiation easier.

Expectation-Maximiation (EM) Algorithm Iterative algorithm with two lined steps: E-step: fill in values of ẑ t using p( x, θ t ). M-step: update parameters using θ t+1 argmax l(θ; x, ẑ t ). E-step involves inference, which we need to do at runtime anyway. M-step is no harder than in fully observed case. We will prove that this procedure monotonically improves l (or leaves it unchanged). Thus it always converges to a local optimum of the lielihood (as any optimier should). Note: EM is an optimiation strategy for objective functions that can be interpreted as lielihoods in the presence of missing data. EM is not a cost function such as maximum-lielihood. EM is not a model such as mixture-of-gaussians. Expected Complete Log Lielihood For any distribution q() define expected complete log lielihood: l q (θ; x) = l c (θ; x, ) q q( x) log p(x, θ) Amaing fact: l(θ) l q (θ) + H(q) because of concavity of log: l(θ; x) = log p(x θ) = log = log p(x, θ) p(x, θ) q( x) q( x) q( x) log p(x, θ) q( x) Where the inequality is called Jensen s inequality. (It is only true for distributions: q() = 1; q() > 0.) Complete & Incomplete Log Lielihoods Observed variables x, latent variables, parameters θ: is the complete log lielihood. l c (θ; x, ) = log p(x, θ) Usually optimiing l c (θ) given both and x is straightforward. (e.g. class conditional Gaussian fitting, linear regression) With unobserved, we need the log of a marginal probability: l(θ; x) = log p(x θ) = log p(x, θ) Lower Bounds and Free Energy For fixed data x, define a functional called the free energy: F (q, θ) p(x, θ) q( x) log l(θ) q( x) The EM algorithm is coordinate-ascent on F : E-step: q t+1 = argmax q F (q, θ t ) M-step: θ t+1 = argmax θ F (q t+1, θ t ) which is the incomplete log lielihood.

M-step: maximiation of expected l c Note that the free energy breas into two terms: F (q, θ) = = q( x) log p(x, θ) q( x) q( x) log p(x, θ) q( x) log q( x) EM Constructs Sequential Convex Lower Bounds Consider the lielihood function and the function F (q t+1, ). lielihood F( θ,q t+1 ) = l q (θ; x) + H(q) (this is where its name comes from) The first term is the expected complete log lielihood (energy) and the second term, which does not depend on θ, is the entropy. Thus, in the M-step, maximiing with respect to θ for fixed q we only need to consider the first term: θ t+1 = argmax θ l q (θ; x) = argmax θ q( x) log p(x, θ) θ t θ E-step: inferring latent posterior Claim: the optimim setting of q in the E-step is: q t+1 = p( x, θ t ) This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification). Proof (easy): this setting saturates the bound l(θ; x) F (q, θ) F (p( x, θ t ), θ t ) = p( x, θ t ) log p(x, θt ) p( x, θ t ) = p( x, θ t ) log p(x θ t ) Partially Hidden Data Of course, we can learn when there are missing (hidden) variables on some cases and not on others. In this case the cost function was: l(θ; D) = log p(x c, y c θ) + log complete missing y log p(x m, y θ) Now you can thin of this in a new way: in the E-step we estimate the hidden variables on the incomplete cases only. The M-step optimies the log lielihood on the complete data plus the expected lielihood on the incomplete data using the E-step. = log p(x θ t ) p( x, θt ) = l(θ; x) 1 Can also show this result using variational calculus or the fact that l(θ) F (q, θ) = KL[q p( x, θ)]

Recap: EM Algorithm EM for MOG A way of maximiing lielihood function for latent variable models. Finds ML parameters when the original (hard) problem can be broen up into two (easy) pieces: L = 1 L = 4 1. Estimate some missing or unobserved data from observed data and current parameters. 2. Using this complete data, find the maximum lielihood parameter estimates. L = 6 (a) (c) (d) L = 8 L = 10 L = 12 (e) Alternate between filling in the latent variables using our best guess (posterior) and updating the paramters based on this guess: E-step: q t+1 = p( x, θ t ) M-step: θ t+1 = argmax θ q( x) log p(x, θ) In the M-step we optimie a lower bound on the lielihood. In the E-step we close the gap, maing bound=lielihood. (f) (g) (h) (i) Example: Mixtures of Gaussians Recall: a mixture of K Gaussians: p(x θ) = α N (x µ, Σ ) l(θ; D) = n log α N (x n µ, Σ ) Learning with EM algorithm: E step : p t n = N (xn µ t, Σt ) qn t+1 = p( = xn, θ t ) = αt pt n j αt j pt n n M step : µ t+1 = qt+1 n xn n qt+1 n n qt+1 n (xn µ t+1 )(x n µ t+1 Σ t+1 = α t+1 = 1 M n q t+1 n n qt+1 n ) Derivation of M-step Expected complete log lielihood l q (θ; D): [ q n log α 1 2 (xn µ t+1 ) Σ 1 (xn µ t+1 ) 1 ] 2 log 2πΣ n For fixed q we can optimie the parameters: l q = Σ 1 µ q n (x n µ ) n l q Σ 1 = 1 [ ] q 2 n Σ (xn µ t+1 )(x n µ t+1 ) n l q = 1 q α α n λ (λ = M) n Fact: log A 1 A 1 = A and x Ax A = xx

Compare: K-Means The EM algorithm for mixtures of Gaussians is just lie a soft version of the K-means algorithm with fixed priors and covariance. Instead of hard assignment in the E-step, we do soft assignment based on the softmax of the squared distance from each point to each cluster. Each centre is then moved to the weighted mean of the data, with weights given by soft assignments. In K-means, the weights are 0 or 1. E step : d t n = 1 2 (xn µ t ) Σ 1 (x n µ t ) Variants Sparse EM: Do not recompute exactly the posterior probability on each data point under all models, because it is almost ero. Instead eep an active list which you update every once in a while. Generalied (Incomplete) EM: It might be hard to find the ML parameters in the M-step, even given the completed data. We can still mae progress by doing an M-step that improves the lielihood a bit (e.g. gradient step). q t+1 n = exp( dt n ) j exp( dt jn ) = p(ct n = x n, µ t ) n M step : µ t+1 = qt+1 n xn n qt+1 n Some good things about EM: A Report Card for EM no learning rate parameter very fast for low dimensions each iteration guaranteed to improve lielihood adapts unused units rapidly Some bad things about EM: can get stuc in local minima both steps require considering all explanations of the data which is an exponential amount of wor in the dimension of θ EM is typically used with mixture models, for example mixtures of Gaussians or mixtures of experts. The missing data are the labels showing which sub-model generated each datapoint. Very common: also used to train HMMs, Boltmann machines,...