MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Similar documents
Gaussian Mixture Models

CSE446: Clustering and EM Spring 2017

Latent Variable Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

ECE521 Lecture 19 HMM cont. Inference in HMM

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

But if z is conditioned on, we need to model it:

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

The Expectation Maximization or EM algorithm

Lecture 8: Graphical models for Text

Quantitative Biology II Lecture 4: Variational Methods

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 3: Pattern Classification

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

EM Algorithm LECTURE OUTLINE

K-Means and Gaussian Mixture Models

Expectation Maximization Algorithm

13: Variational inference II

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Expectation Maximization

Mixtures of Gaussians continued

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

Statistical Pattern Recognition

Lecture 13 : Variational Inference: Mean Field Approximation

Expectation maximization

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

The Expectation-Maximization Algorithm

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

STA 414/2104: Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Basic Sampling Methods

Gaussian Mixture Models, Expectation Maximization

ECE 5984: Introduction to Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Unsupervised Learning

Posterior Regularization

Variational Autoencoders

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Latent Variable Models and EM Algorithm

Nonparametric Bayesian Methods (Gaussian Processes)

Clustering and Gaussian Mixture Models

CSC411 Fall 2018 Homework 5

Variational Inference (11/04/13)

EM (cont.) November 26 th, Carlos Guestrin 1

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Variational Auto-Encoders (VAE)

Lecture 9: PGM Learning

Expectation Maximization

CPSC 540: Machine Learning

STA 4273H: Statistical Machine Learning

EM-algorithm for motif discovery

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning CMU-10701

Lecture 3: Pattern Classification. Pattern classification

Data Mining Techniques

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

The Expectation-Maximization Algorithm

Mixtures of Gaussians. Sargur Srihari

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Part 4: Conditional Random Fields

STA 4273H: Statistical Machine Learning

CS Lecture 18. Topic Models and LDA

Generative v. Discriminative classifiers Intuition

Probabilistic Graphical Models & Applications

Bayesian Machine Learning

Introduction to Logistic Regression and Support Vector Machine

Naïve Bayes classification

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Probabilistic Graphical Models

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

An Introduction to Expectation-Maximization

Bayesian Deep Learning

Notes on Machine Learning for and

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Expectation Maximization (EM)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

14 : Mean Field Assumption

Probabilistic Graphical Models for Image Analysis - Lecture 4

Week 3: The EM algorithm

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Graphical Models for Collaborative Filtering

Clustering, K-Means, EM Tutorial

Based on slides by Richard Zemel

Probabilistic & Unsupervised Learning

Mixture Models and Expectation-Maximization

Gaussian Mixture Models

Statistical Machine Learning Lectures 4: Variational Bayes

Transcription:

Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com

Y. LeCun: Machine Learning and Pattern Recognition p. 2/? Latent Variables Latent variables are unobserved random variables Z that enter into the energy function E(Y, Z, X, W). The X variable (input) is always observed, the Y must be predicted. The Z variable is latent: it is not observed. We need to marginalize the joint probability P(Y, Z X, W) over Z to get P(Y X, W): P(Y X, W) = P(Y, z X, W)dz The following discussion treats the case where an observation X is present. In the unsupervised case, there is no observation. We can simply remove the symbol X from all the slides below.

Y. LeCun: Machine Learning and Pattern Recognition p. 3/? Latent Variables: example Let s say we have a bunch of images of a Boeing 747 under various viewing angles (let s call the angle Z), and another bunch of images of an Airbus A-380, also under various viewing angles. Let s assume that we are given a similarity function E(Y, Z, X) where Y is the label (Boeing or Aribus), Z is the latent variable (the viewing angle), and X the image. For example, E(Airbus, 20, X) will give us a low energy if X is similar to our prototype image of an Airbus under 20 degree viewing angle. For example, E could be defined as: E(Y, Z, X) = X R Y Z 2 where R Y Z is our prototype image of plane Y at angle Z. When asked about the category of an image, we are never given the viewing angle, but knowing it would make our task simpler.

Y. LeCun: Machine Learning and Pattern Recognition p. 4/? Latent Variables: marginalization In terms of energy function, P(Y, Z X, W) can be written as: P(Y, Z X, W) = exp( βe(y, Z, X, W)) exp( βe(y, z, X, W))dzdy Therefore, P(Y X, W) = P(Y, z X, W)dz becomes: P(Y X, W) = exp( βe(y, z, X, W)) exp( βe(y, z, X, W))dz dy dz since the denominator doesn t depend on z: P(Y X, W) = exp( βe(y, z, X, W))dz exp( βe(y, z, X, W))dz dy If Z is a multidimensional variable, this could be very difficult to compute.

Y. LeCun: Machine Learning and Pattern Recognition p. 5/? Latent Variables: example of marginalization E(Y, Z, X) = X R Y Z 2 P(Y, Z X, W) = exp( β X R Y Z 2 ) exp( β X RY Z 2 dzdy It s a Gaussian with mean R Y Z, and variance 1/β. P(Airbus X) = Z exp( β X R Airbus Z 2 ) Z exp( β X R Boeing Z 2 ) + exp( β X R Airbus Z 2 ) It s a sum of Gaussians.

Y. LeCun: Machine Learning and Pattern Recognition p. 6/? Latent Variables: max likelihood inference Very often, given an observation X, we merely want to know the value of Y that is the most likely: Y = argmax Y P(Y X, W) Y = argmax Y exp( βe(y, z, X, W))dz exp( βe(y, z, X, W))dz dy Since the denominator does not depend on Y, we can simply remove it: Y = argmax Y exp( βe(y, z, X, W))dz By taking log and dividing by β, we get: Y = argmin Y 1 β log [ ] exp( βe(y, z, X, W))dz This is the logsum of the energies for all values of Z, also called the Helmholtz free Energy of the ensemble of states when Z varies.

Y. LeCun: Machine Learning and Pattern Recognition p. 7/? Latent Variables: example of max likelihood E(Y, Z, X) = X R Y Z 2 [ ] Y = argmin Y 1 β log exp( β X R Y Z 2 ) z

Y. LeCun: Machine Learning and Pattern Recognition p. 8/? Latent Variables: zero-temperature limit Computing the most likely Y using the free energy: Y = argmin Y 1 β log [ ] exp( βe(y, z, X, W))dz still requires to compute a (possibly horrible) integral over Z. One possible shortcut is to make β go to infinity. Then, as we have seen before, the logsum reduces to the min, hence: lim Y = argmin Y mine(y, Z, X, W) β Z In this case, inference is a lot simpler: to find the best value of Y, find the combination of values of both Z and Y that minimize the energy: and return Y. E(Y, Z, X, W) = min Y,Z E(Y, Z, X, W)

Y. LeCun: Machine Learning and Pattern Recognition p. 9/? Latent Variables: example of zero-temp limit and return Y. E(Y, Z, X) = X R Y Z 2 E(Y, Z, X, W) = min Y,Z X R Y Z 2

Y. LeCun: Machine Learning and Pattern Recognition p. 10/? Example: Mixture Models We have K normalized densities P k (Y W k ), each of which has a positive coefficient α k (whose sum over k is 1), and a switch controlled by a discrete latent variable Z that picks one of the component densities. There is no input X, only an output Y (whose distribution is to be modeled) and a latent variable Z. The likelihood for one sample Y i : P(Y i, Z W) = k α k P k (Y i W k ) with k αk = 1. Using Bayes rule, we can compute the posterior prob of the mixture components for each data point Y i : r k (Y i ) = P(Z = k Y i, W) = αk P k (Y i W k ) j αj P j (Y i W j ) These quantities are called responsabilities.

Y. LeCun: Machine Learning and Pattern Recognition p. 11/? Learning a Mixture Model with Gradient We can learn a mixture with gradient descent, but there are much better methods as we will see later. The negative log-likelihood of the data is: L = log i P(Y i W) = i logp(y i W) Let us consider the likelihood of one data point Y i : L i = logp(y i W) = log k α k P k (Y i W) L i W = 1 P(Y i W) k α k P k (Y i W) W

Y. LeCun: Machine Learning and Pattern Recognition p. 12/? Learning a Mixture Model with Gradient (cont) L i W = 1 P(Y i W) k α k P k (Y i W) W = k 1 α k P(Y i W) P k(y i W) log P k(y i W) W = k α k P k (Y i W) P(Y i W) log P k (Y i W) W == k r k (Y i )α k log P k (Y i W) W The gradient is the weighted sum of gradients of the individual components weighted by the responsabilities.

Y. LeCun: Machine Learning and Pattern Recognition p. 13/? Example: Gaussian Mixture P(Y W) = k α k 2πV k 1/2 exp( 1/2(Y M k ) (V k ) 1 (Y M k )) This is used a lot in speech recognition.

Y. LeCun: Machine Learning and Pattern Recognition p. 14/? The Expectation-Maximization Algorithm Optimizing likelihoods with gradient is the only option in some cases, but there is a considerably more efficient procedure known as EM. Every time we update the parameters W, the distribution over latent variables Z must be updated as well (because it depends on W. The basic idea of EM is to keep the distribution over Z constant while we find the optimal W, then we recompute the new distribution over Z that result from the new W, and we iterate. This process is sometimes called coordinate descent.

Y. LeCun: Machine Learning and Pattern Recognition p. 15/? EM: The Trick The negative log likelihood for a sample Y i is: L i = log P(Y i W) = log P(Y i, Z W)dZ For any distribution q(z) we can write: L i = log q(z) P(Y i, Z W) dz q(z) We now use Jensen s inequality, which says that for any concave function G (such as log) G( p(z)f(z)dz) p(z)g(f(z))dz We get: L i F i = q(z) log P(Y i, Z W) dz q(z)

Y. LeCun: Machine Learning and Pattern Recognition p. 16/? EM L i F i = q(z) log P(Y i, Z W) dz q(z) EM minimizes F i by alternately finding the q(z) that mininizes F (E-step) then finding the W that minimizes F M-step) E-step: q(z) t+1 argmin q F i (q(z) t, W t ) M-step: W(Z) t+1 argmin W F i (q(z) t+1, W t )

Y. LeCun: Machine Learning and Pattern Recognition p. 17/? M Step We can decompose the free energy: = F i (q(z), W) = q(z) log P(Y i, Z W)dZ + q(z) log P(Y i, Z W) dz q(z) q(z) log q(z)dz The first term is the expected energy with distribution q(z), the second is the entropy of q(z), and does not depend on W. So in the M-step, we only need to consider the first term when minimizing with respect to q(z). W(Z) t+1 argmin W q(z) log P(Y i, Z W)dZ

Y. LeCun: Machine Learning and Pattern Recognition p. 18/? E Step Proposition: the value of q(z) that minimizes the free energy is q(z) = P(Z Y i, W) This is the posterior distrib over the latent variabled given teh sample and the current parameter. Proof: F i (P(Z Y i, W), W) = P(Z Y i, W) log P(Y i, Z W) P(Z Y i, W) dz = P(Z Y i, W) log P(Y i W)dZ = log P(Y i W) z P(Z Y i, W) = log P(Y i W).1