Latent Variable Models

Similar documents
Generative Adversarial Networks

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Discrete Latent Variable Models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Variational Autoencoder

Unsupervised Learning

STA 4273H: Statistical Machine Learning

Gaussian Mixture Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

13: Variational inference II

Nonparametric Bayesian Methods (Gaussian Processes)

Variational Autoencoders

Expectation Maximization Algorithm

STA 4273H: Statistical Machine Learning

Clustering, K-Means, EM Tutorial

Expectation Maximization

Auto-Encoding Variational Bayes

ECE 5984: Introduction to Machine Learning

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

The Expectation-Maximization Algorithm

Lecture 6: Gaussian Mixture Models (GMM)

Machine Learning for Signal Processing Bayes Classification and Regression

Expectation Maximization

Mixtures of Gaussians. Sargur Srihari

But if z is conditioned on, we need to model it:

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Unsupervised learning (part 1) Lecture 19

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Based on slides by Richard Zemel

Lecture 14. Clustering, K-means, and EM

Machine Learning Techniques for Computer Vision

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation-Maximization (EM) algorithm

Introduction to Machine Learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 13 : Variational Inference: Mean Field Approximation

Introduction to Machine Learning

Latent Variable Models and EM algorithm

CSCI-567: Machine Learning (Spring 2019)

Gaussian Mixture Models, Expectation Maximization

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

CPSC 540: Machine Learning

Linear Dynamical Systems

STA 4273H: Statistical Machine Learning

Lecture 16 Deep Neural Generative Models

Non-Parametric Bayes

Density Estimation. Seungjin Choi

Bayesian Machine Learning

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Generative Clustering, Topic Modeling, & Bayesian Inference

Variational Autoencoders (VAEs)

STA 4273H: Sta-s-cal Machine Learning

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Variational Inference (11/04/13)

Clustering and Gaussian Mixture Models

Statistical Pattern Recognition

CSC411 Fall 2018 Homework 5

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Lecture 7: Con3nuous Latent Variable Models

Machine Learning Lecture 5

Latent Variable Models and EM Algorithm

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Expectation Maximization

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Nonparametric Inference for Auto-Encoding Variational Bayes

Latent Variable Models and Expectation Maximization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Notes on Machine Learning for and

Contrastive Divergence

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

STA 4273H: Statistical Machine Learning

K-Means and Gaussian Mixture Models

The Expectation Maximization or EM algorithm

Latent Variable Models and Expectation Maximization

Probabilistic Graphical Models

EM-algorithm for motif discovery

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Statistical learning. Chapter 20, Sections 1 4 1

10-701/15-781, Machine Learning: Homework 4

STA 414/2104: Machine Learning

Introduction To Machine Learning

Variational Bayes on Monte Carlo Steroids

Generative models for missing value completion

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Approximate inference in Energy-Based Models

A Note on the Expectation-Maximization (EM) Algorithm

Variational Auto Encoders

CSE446: Clustering and EM Spring 2017

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Transcription:

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31

Recap of last lecture 1 Autoregressive models: Chain rule based factorization is fully general Compact representation via conditional independence and/or neural parameterizations 2 Autoregressive models Pros: Easy to evaluate likelihoods Easy to train 3 Autoregressive models Cons: Requires an ordering Generation is sequential Cannot learn features in an unsupervised way Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 2 / 31

Plan for today 1 Latent Variable Models Variational EM Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 3 / 31

Latent Variable Models: Motivation 1 Lots of variability in images x due to gender, eye color, hair color, pose, etc. However, unless images are annotated, these factors of variation are not explicitly available (latent). 2 Idea: explicitly model these factors using latent variables z Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 4 / 31

Latent Variable Models: Motivation 1 Only shaded variables x are observed in the data (pixel values) 2 Latent variables z correspond to high level features If z chosen properly, p(x z) could be much simpler than p(x) If we had trained this model, then we could identify features via p(z x), e.g., p(eyecolor = Blue x) 3 Challenge: Very difficult to specify these conditionals by hand Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 5 / 31

Deep Latent Variable Models 1 z N (0, I ) 2 p(x z) = N (µ θ (z), Σ θ (z)) where µ θ,σ θ are neural networks 3 Hope that after training, z will correspond to meaningful latent factors of variation (features). Unsupervised representation learning. 4 As before, features can be computed via p(z x) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 6 / 31

Mixture of Gaussians: a Shallow Latent Variable Model Mixture of Gaussians. Bayes net: z x. 1 z Categorical(1,, K) 2 p(x z = k) = N (µ k, Σ k ) Generative process 1 Pick a mixture component k by sampling z 2 Generate a data point by sampling from that Gaussian Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 7 / 31

Mixture of Gaussians: a Shallow Latent Variable Model Mixture of Gaussians: 1 z Categorical(1,, K) 2 p(x z = k) = N (µ k, Σ k ) 3 Clustering: The posterior p(z x) identifies the mixture component 4 Unsupervised learning: We are hoping to learn from unlabeled data (ill-posed problem) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 8 / 31

Unsupervised learning Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 9 / 31

Unsupervised learning Shown is the posterior probability that a data point was generated by the i-th mixture component, P(z = i x) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 10 / 31

Unsupervised learning Unsupervised clustering of handwritten digits. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 11 / 31

Unsupervised learning Combine simple models into a more complex and expressive one p(x) = z p(x, z) = z p(z)p(x z) = K k=1 p(z = k) N (x; µ k, Σ k ) }{{} component Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 12 / 31

Variational Autoencoder A mixture of an infinite number of Gaussians: 1 z N (0, I ) 2 p(x z) = N (µ θ (z), Σ θ (z)) where µ θ,σ θ are neural networks µ θ (z) = σ(az + c) = (σ(a 1 z + c 1 ), σ(a 2 z + c 2 )) = (µ 1 (z), µ 2 (z)) Σ θ (z) = diag(exp(σ(bz + d))) = θ = (A, B, c, d) ( exp(σ(b1z+d 1)) 0 0 exp(σ(b 2z+d 2)) 3 Even though p(x z) is simple, the marginal p(x) is very complex/flexible Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 13 / 31 )

Recap Latent Variable Models Allow us to define complex models p(x) in terms of simple building blocks p(x z) Natural for unsupervised learning tasks (clustering, unsupervised representation learning, etc.) No free lunch: much more difficult to learn compared to fully observed, autoregressive models Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 14 / 31

Marginal Likelihood Suppose some pixel values are missing at train time (e.g., top half) Let X denote observed random variables, and Z the unobserved ones (also called hidden or latent) Suppose we have a model for the joint distribution (e.g., PixelCNN) p(x, Z; θ) What is the probability p(x = x; θ) of observing a training data point x? p(x = x, Z = z; θ) = p( x, z; θ) z z Need to consider all possible ways to complete the image (fill green part) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 15 / 31

Partially observed data Suppose that our joint distribution is p(x, Z; θ) We have a dataset D, where for each datapoint the X variables are observed (e.g., pixel values) and the variables Z are never observed (e.g., cluster or class id.). D = {x (1),, x (M) }. Maximum likelihood learning: l(θ; D) = log x D p(x; θ) = x D log p(x; θ) = x D log z p(x, z; θ) Evaluating log z p(x, z; θ) can be intractable. Suppose we have 30 binary latent features, z {0, 1} 30. Evaluating z p(x, z; θ) involves a sum with 2 30 terms. For continuous variables, log p(x, z; θ)dz is often intractable. z Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 16 / 31

Why is parameter learning in presence of Partially Observed Data challenging? Likelihood function for Fully Observed Data: p θ (x) }{{} easy to compute = i log p(x i x pa(i) ) Compare with Likelihood function for Partially Observed Data: p θ (x, z) z }{{} hard to compute Likelihood function for Partially Observed Data: is not decomposable (by variable and parent assignment) and not unimodal as a function of θ. Could still try gradient descent. Hard to compute and take gradients θ (too many completions) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 17 / 31

First attempt: Naive Monte Carlo Likelihood function for Partially Observed Data is hard to compute: All possible values of z p θ (x, z) = Z z Z 1 Z p θ(x, z) = Z E z Uniform(Z) [p θ (x, z)] We can think of it as an (intractable) expectation. Monte Carlo to the rescue: 1 Sample z (1),, z (k) uniformly at random 2 Approximate expectation with sample average p θ (x, z) Z 1 k z k p θ (x, z (j) ) j=1 Works in theory but not in practice. For most z, p θ (x, z) is very low (most completions don t make sense). Some are very large but will never hit likely completions by uniform random sampling. Need a clever way to select z (j) to reduce variance of the estimator. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 18 / 31

The Expectation Maximization (EM) Algorithm Start with an initial guess (random) of the parameters θ (0) Repeat until convergence 1 Complete ( hallucinate ) the incomplete data (the z part) using current parameters (E-step) 2 Train: Update the parameters based on the completed data (M-step) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 19 / 31

The Expectation Maximization Algorithm: Example A C D B For example, θ a =.3 θ b =.9 θ c ā, b =.83 θ c ā,b =.09 θ c a, b =.6 θ c a,b =.2 θ d c =.1 θ d c =.8 P(b, c a, d) = P(a, b, c, d) P(a, d) P( b, c a, d) = P(a, b, c, d) P(a, d) Data instance: (a,?,?, d) How to complete this example? For each possible completion STEP 1: Compute how likely the completion is (given the observed part). Compute P(z x). In this example P(B, C A = a, D = d). 0.3 0.9 0.2 (1 0.8) = 0.3 0.9 0.2 (1 0.8) + + 0.3 0.1 0.4 0.9 =.0492 0.3 0.1 0.6 (1 0.8) = 0.3 0.9 0.2 (1 0.8) + + 0.3 0.1 0.4 0.9 =.0164 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 20 / 31

The Expectation Maximization Algorithm Data set is now bigger and weighted If binary variables, (a,?,?, d) corresponds to four weighted examples (a, b, c, d), weight =.0492 = P(b, c a, d) (a, b, c, d), weight =.8852 = P(b, c a, d) (a, b, c, d), weight =.0164 = P( b, c, a, d) (a, b, c, d), weight =.0492 = P( b, c a, d) weight = probability according to current parameter estimates After completion, the dataset is fully observed, so we can train as usual via gradient descent or closed-form (M-Step) Two problems: θ t+1 = arg max θ M E p(zm x m;θ t)[log p(x m, z m ; θ)] m=1 1 p(z m x m ; θ t ) = p(x m, z m ; θ t )/p(x m ; θ t ) requires p(x m ; θ t ) = z p(x m, z; θ t ) (what you d need for gradient descent) 2 Still requires looking at all possible completions Solution: replace p(z x; θ) with a tractable q(z) and do Monte Carlo Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 21 / 31

The EM Algorithm Suppose q(z) is any probability distribution over the hidden variables D KL (q(z) p(z x; θ)) = z q(z) log q(z) p(z x; θ) = z = z = z q(z) log p(z x; θ) + q(z) log q(z) z }{{} H(q) q(z) log p(z x; θ) H(q) q(z) log p(z, x; θ) p(x; θ) H(q) = z q(z) log p(z, x; θ) + z q(z) log p(x; θ) H(q) = z q(z) log p(z, x; θ) + log p(x; θ) H(q) 0 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 22 / 31

The EM Algorithm Suppose q(z) is any probability distribution over the hidden variables D KL (q(z) p(z x; θ)) = z q(z) log p(z, x; θ) + log p(x; θ) H(q) 0 Evidence lower bound (ELBO) holds for any q log p(x; θ) z q(z) log p(z, x; θ) + H(q) Equality holds if q = p(z x; θ) log p(x; θ)= z q(z) log p(z, x; θ) + H(q) This is what we compute in the E-step of the EM algorithm Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 23 / 31

The EM Algorithm q(z) is an arbitrary probability distribution over z D KL (q p(z x; θ)) = z q(z) log p(z, x; θ) + log p(x; θ) H(q) Re-arranging can rewrite as H(q) + z q(z) log p(z, x; θ) = log p(x; θ) D KL (q p(z x; θ)) F [q, θ] Can interpret EM as coordinate ascent on F [q, θ] 1 Initialize θ (0) 2 q (0) = arg max q F [q, θ (0) ] = p(z x; θ (0) ) 3 θ (1) = arg max θ F [q (0), θ] = arg max θ z q(0) (z) log p(z, x; θ) 4 q (1) = arg max q F [q, θ (1) ] 5... Marginal likelihood never decreases log p(x; θ (0) ) = F [q (0), θ (0) ] F [q (0), θ (1) ] = log p(x; θ (1) ) D KL (q (0) p(z x; θ (1) )) log p(x; θ (1) ) = F [q (1), θ (1) ] = Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 24 / 31

Coordinate ascent Maximize along one direction at a time: Initialize x 0, y 0 y 1 = arg max y f (x 0, y) x 1 = arg max x f (x, y 1 ) y 2 = arg max y f (x 1, y)... Objective keeps improving. f (x 0, y 0 ) f (x 0, y 1 ) f (x 1, y 1 ) f (x 1, y 2 ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 25 / 31

Derivation of EM algorithm Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 26 / 31

The Evidence Lower bound What if the posterior p(z x; θ) is intractable to compute in the E-step? Suppose q(z; ϕ) is a (tractable) probability distribution over the hidden variables parameterized by ϕ (variational parameters) E.g., a Gaussian with mean and covariance specified by ϕ, a fully factored probability distribution, a FVSBN, etc. q(z; ϕ) = unobserved variables z i (ϕ i ) zi (1 ϕ i ) (1 zi ) Note: conditioned on the bottom part (x), choosing pixels independently in z is not a terrible approximation Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 27 / 31

The Evidence Lower bound log p(x; θ) q(z; ϕ) log p(z, x; θ) + H(q(z; ϕ)) = L(x; θ, ϕ) }{{} z ELBO = L(x; θ, ϕ) + D KL (q(z; ϕ) p(z x; θ)) The better q(z; ϕ) can approximate the posterior p(z x; θ), the smaller D KL (q(z; ϕ) p(z x; θ)) we can achieve, the closer ELBO will be to log p(x; θ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 28 / 31

The Variational EM Algorithm q(z; ϕ) is an arbitrary probability distribution over z D KL (q(z; ϕ) p(z x; θ)) = z q(z; ϕ) log p(z, x; θ) + log p(x; θ) H(q(z; ϕ)) Re-arranging can rewrite as H(q(z; ϕ)) + z q(z; ϕ) log p(z, x; θ) = log p(x; θ) D KL (q(z; ϕ) p(z x; θ)) = F [ϕ, θ] Variational EM as coordinate ascent on F [ϕ, θ] 1 Initialize θ (0) 2 ϕ (1) = arg max ϕ F [ϕ, θ (0) ] p(z x; θ (0) ) in general 3 θ (1) = arg max θ F [ϕ (1), θ] (can do Monte Carlo!) 4 ϕ (2) = arg max ϕ F [ϕ, θ (1) ] p(z x; θ (1) ) in general 5... Unlike EM, variational EM not guaranteed to reach a local maximum. Marginal likelihood might decrease! Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 29 / 31

Potential issues with Variational EM algorithm (Figure adapted from tutorial by Sean Borman) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 30 / 31

Summary Latent Variable Models Pros: Easy to build flexible models Suitable for unsupervised learning Latent Variable Models Cons: Hard to evaluate likelihoods Hard to train via maximum-likelihood Fundamentally, the challenge is that posterior inference p(z x) is hard. Typically requires variational approximations Alternative: give up on KL-divergence and likelihood (GANs) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 31 / 31