Discrete Latent Variable Models

Similar documents
Latent Variable Models

Discrete Variables and Gradient Estimators

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Generative Adversarial Networks

BACKPROPAGATION THROUGH THE VOID

Backpropagation Through

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Reinforcement Learning

Reinforcement Learning and NLP

Variational Autoencoders

Variational Inference for Monte Carlo Objectives

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Lecture 8: Policy Gradient

arxiv: v1 [stat.ml] 3 Nov 2016

Deep Latent-Variable Models of Natural Language

arxiv: v5 [stat.ml] 5 Aug 2017

Learning Tetris. 1 Tetris. February 3, 2009

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Bayesian Deep Learning

Recurrent Latent Variable Networks for Session-Based Recommendation

Unsupervised Learning

Auto-Encoding Variational Bayes

Introduction of Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Machine Learning Basics: Maximum Likelihood Estimation

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

The connection of dropout and Bayesian statistics

GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution

Approximation Methods in Reinforcement Learning

Introduction to Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Semi-supervised Learning with Deep Generative Models

arxiv: v2 [cs.cl] 1 Jan 2019

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Variational Autoencoder

Lecture 16 Deep Neural Generative Models

Reinforcement Learning for NLP

COMP90051 Statistical Machine Learning

Reinforcement Learning via Policy Optimization

CSC321 Lecture 18: Learning Probabilistic Models

Evaluating the Variance of

ECE521 week 3: 23/26 January 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Reinforcement Learning

Density Estimation. Seungjin Choi

Nonparametric Bayesian Methods (Gaussian Processes)

Variational Autoencoders (VAEs)

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Annealing Between Distributions by Averaging Moments

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Learning Energy-Based Models of High-Dimensional Data

Gradient Estimation Using Stochastic Computation Graphs

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Variational Inference. Sargur Srihari

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 5: Logistic Regression. Neural Networks

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 8: Policy Gradient I 2

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Variational Inference via Stochastic Backpropagation

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Bayesian Inference and MCMC

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

13: Variational inference II

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Reinforcement Learning

Variational Inference (11/04/13)

Policy Gradient Methods. February 13, 2017

Lecture : Probabilistic Machine Learning

Counterfactual Model for Learning Systems

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models for Image Analysis - Lecture 4

Lecture 7 Introduction to Statistical Decision Theory

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Probabilistic Graphical Models

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Deep Reinforcement Learning. Scratching the surface

ECE521 Lectures 9 Fully Connected Neural Networks

Gaussian Mixture Models

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Linear Models for Regression CS534

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bayesian Machine Learning

Lecture 9: Policy Gradient II (Post lecture) 2

Lightweight Probabilistic Deep Networks Supplemental Material

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Transcription:

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 14 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 1 / 29

Summary Major themes in the course Representation: Latent variable vs. fully observed Objective function and optimization algorithm: Many divergences and distances optimized via likelihood-free (two sample test) or likelihood based methods Evaluation of generative models Combining different models and variants Plan for today: Discrete Latent Variable Modeling Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 2 / 29

Why should we care about discreteness? Discreteness is all around us! Decision Making: Should I attend CS 236 lecture or not? Structure learning Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 3 / 29

Why should we care about discreteness? Many data modalities are inherently discrete Networks Text DNA Sequences, Program Source Code, Molecules and lots more Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 4 / 29

Stochastic Optimization Consider the following optimization problem max E q φ (z)[f (z)] φ Recap example: Think of q( ) as the inference distribution for a VAE [ ] log max E pθ (x, z) q φ (z x). θ,φ log q(z x) Gradients w.r.t. θ can be derived via linearity of expectation θ E q(z;φ) [log p(z, x; θ) log q(z; φ)] = E q(z;φ) [ θ log p(z, x; θ)] 1 θ log p(z k, x; θ) k If z is continuous, q( ) is reparameterizable, and f ( ) is differentiable in φ, then we can use reparameterization to compute gradients w.r.t. φ What if any of the above assumptions fail? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 5 / 29 k

Stochastic Optimization with REINFORCE Consider the following optimization problem max E q φ φ (z)[f (z)] For many class of problem scenarios, reparameterization trick is inapplicable Scenario 1: f ( ) is non-differentiable in φ e.g., optimizing a black box reward function in reinforcement learning Scenario 2: q φ (z) cannot be reparameterized as a differentiable function of φ with respect to a fixed base distribution e.g., discrete distributions REINFORCE is a general-purpose solution to both these scenarios We will first analyze it in the context of reinforcement learning and then extend it to latent variable models with discrete latent variables Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 6 / 29

REINFORCE for reinforcement learning Example: Pulling arms of slot machines Which arm to pull? Set A of possible actions. E.g., pull arm 1, arm 2,..., etc. Each action z A has a reward f (z) Randomized policy for choosing actions q φ (z) parameterized by φ. For example, φ could be the parameters of a multinomial distribution Goal: Learn the parameters φ that maximize our earnings (in expectation) max E q φ φ (z)[f (z)] Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 7 / 29

Policy Gradients Want to compute a gradient with respect to φ of the expected reward E qφ (z)[f (z)] = z q φ (z)f (z) φ i E qφ (z)[f (z)] = z q φ (z) φ i f (z) = z q φ(z) 1 = z q φ(z) log q φ(z) φ i f (z) = E qφ (z) q φ (z) q φ (z) φ i [ log qφ (z) φ i f (z) ] f (z) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 8 / 29

REINFORCE Gradient Estimation Want to compute a gradient with respect to φ of E qφ (z)[f (z)] = z q φ (z)f (z) The REINFORCE rule is φ E qφ (z)[f (z)] = E qφ (z) [f (z) φ log q φ (z)] We can now construct a Monte Carlo estimate Sample z 1,, z K from q φ (z) and estimate φ E qφ (z)[f (z)] 1 f (z k ) φ log q φ (z k ) K k Assumption: The distribution q( ) is easy to sample from and evaluate probabilities Works for both discrete and continuous distributions Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 9 / 29

Variational Learning of Latent Variable Models To learn the variational approximation we need to compute the gradient with respect to φ of L(x; θ, φ) = z q φ (z x) log p(z, x; θ) + H(q φ (z x)) = E qφ (z x)[log p(z, x; θ) log q φ (z x))] The function inside the brackets also depends on φ (and θ, x). Want to compute a gradient with respect to φ of E qφ (z x)[f (φ, θ, z, x)] = z q φ (z x)f (φ, θ, z, x) The REINFORCE rule is φ E qφ (z x)[f (φ, θ, z, x)] = E qφ (z x) [f (φ, θ, z, x) φ log q φ (z x) + φ f (φ, θ, z, x)] We can now construct a Monte Carlo estimate of φ L(x; θ, φ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 10 / 29

REINFORCE Gradient Estimates have High Variance Want to compute a gradient with respect to φ of E qφ (z)[f (z)] = q φ (z)f (z) z The REINFORCE rule is φ E qφ (z)[f (z)] = E qφ (z) [f (z) φ log q φ (z)] Monte Carlo estimate: Sample z 1,, z K from q φ (z) φ E qφ (z)[f (z)] 1 f (z k ) φ log q φ (z k ) := f MC (z 1,, z K ) K k Monte Carlo estimates are unbiased [ fmc (z 1,, z K ) ] = E qφ (z)[f (z)] E z 1,,z K q φ (z) Almost never used in practice because of high variance Variance can be reduced via carefully designed control variates Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 11 / 29

Control Variates The REINFORCE rule is φ E qφ (z)[f (z)] = E qφ (z) [f (z) φ log q φ (z)] Given any constant B (a control variate) φ E qφ (z)[f (z)] = E qφ (z) [(f (z) B) φ log q φ (z)] To see why, E qφ (z) [B φ log q φ (z)] = B q φ (z) φ log q φ (z) = B z z = B φ q φ (z) = B φ 1 = 0 z φ q φ (z) Monte Carlo estimates of both f (z) and f (z) B have same expectation These estimates can however have different variances Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 12 / 29

Control variates Suppose we want to compute E qφ (z)[f (z)] = z q φ (z)f (z) Define f (z) = f (z) + a ( h(z) E qφ (z)[h(z)] ) h(z) is referred to as a control variate Assumption: E qφ (z)[h(z)] is known Monte Carlo estimates of f (z) and f (z) have the same expectation E z 1,,z K q φ (z)[ f MC (z 1,, z K )] = E z 1,,z K q φ (z)[f MC (z 1,, z K )] but different variances Can try to learn and update the control variate during training Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 13 / 29

Control variates Can derive an alternate Monte Carlo estimate for REINFORCE gradients based on control variates Sample z 1,, z K from q φ (z) 1 K φ E qφ (z)[f (z)] ( ) = E qφ (z)[f (z) + a h(z) E qφ (z)[h(z)] ] ( k f (zk ) φ log q φ (z k ) + a 1 ) K K k=1 h(zk ) E qφ (z)[h(z)] ( ) := f MC (z 1,, z K ) + a h MC (z 1,, z K ) E qφ (z)[h(z)] What is Var( f MC ) vs. Var(f MC )? := f MC (z 1,, z K ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 14 / 29

Control variates Comparing Var( f MC ) vs. Var(f MC ) ( ) Var( f MC ) = Var(f MC + a h MC E qφ (z)[h(z)] ) = Var(f MC + ah MC ) = Var(f MC ) + a 2 Var(h MC ) + 2aCov(f MC, h MC ) To get the optimal coefficient a that minimizes the variance, take derivatives w.r.t. a and set them to 0 a = Cov(f MC, h MC ) Var(h MC ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 15 / 29

Control variates Comparing Var( f MC ) vs. Var(f MC ) Var( f MC ) = Var(f MC ) + a 2 Var(h MC ) + 2aCov(f MC, h MC ) Setting the coefficient a = a = Cov(f MC,h MC ) Var(h MC ) Var( f MC ) = Var(f MC ) Cov(f MC, h MC ) 2 Var(h MC ) = Var(f MC ) Cov(f MC, h MC ) 2 Var(h MC )Var(f MC ) Var(f MC) = (1 ρ(f MC, h MC ) 2 )Var(f MC ) Correlation coefficient ρ(f MC, h MC ) is between -1 and 1. For maximum variance reduction, we want f MC and h MC to be highly correlated Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 16 / 29

Neural Variational Inference and Learning (NVIL) Latent variable models with discrete latent variables are often referred to as belief networks Variational learning objective is same as ELBO L(x; θ, φ) = z q φ (z x) log p(z, x; θ) + H(q φ (z x)) = E qφ (z x)[log p(z, x; θ) log q φ (z x))] := E qφ (z x)[f (φ, θ, z, x)] Here, z is discrete and hence we cannot use reparameterization Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 17 / 29

Neural Variational Inference and Learning (NVIL) NVIL (Mnih&Gregor, 2014) learns belief networks via REINFORCE + control variates Learning objective L(x; θ, φ, ψ, B) = E qφ (z x)[f (φ, θ, z, x) h ψ (x) B] Control Variate 1: Constant baseline B Control Variate 2: Input dependent baseline h ψ (x) Both B and ψ are learned via gradient descent Gradient estimates w.r.t. φ φ L(x; θ, φ, ψ, B) = E qφ (z x) [(f (φ, θ, z, x) h ψ (x) B) φ log q φ (z x) + φ f (φ, θ, z, x)] Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 18 / 29

Towards reparameterized, continuous relaxations Consider the following optimization problem max E q φ φ (z)[f (z)] What if z is a discrete random variable? Categories Permutations Reparameterization trick is not directly applicable REINFORCE is a general-purpose solution, but needs careful design of control variates Alternative: Relax z to a continuous random variable with a reparameterizable distribution Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 19 / 29

Gumbel Distribution Setting: We are given i.i.d. samples y 1, y 2,..., y n from some underlying distribution How can we model the distribution of g = max{y 1, y 2,..., y n } E.g., predicting maximum water level in a river for a particular river based on historical data to detect flooding The Gumbel distribution is very useful for modeling extreme, rare events, e.g., natural disasters, finance CDF for a Gumbel random variable g is parameterized by a location parameter µ and a scale parameter β F (g; µ, β) = exp ( exp ( g µ )) β Note: If g is a Gumbel(µ, β) r.v., log g is an Exponential(µ, β) r.v. Often, Gumbel r.v. are referred to as doubly exponential r.v. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 20 / 29

Categorical Distributions Let z denote a k-dimensional categorical random variable with distribution q parameterized by class probabilities π = {π 1, π 2,..., π k }. We will represent z as a one-hot vector Gumbel-Max reparameterization trick for sampling from categorical random variables ( ) z = one hot arg max(g i + log π i ) i where g 1, g 2,..., g k are i.i.d. samples drawn from Gumbel(0, 1) In words, we can sample from Categorical(π) by taking the arg max over k Gumbel perturbed log-class probabilities g i + log π i Reparametrizable since randomness is transferred to a fixed Gumbel(0, 1) distribution! Problem: arg max is non-differentiable w.r.t. π Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 21 / 29

Relaxing Categorical Distributions to Gumbel-Softmax Gumbel-Max Sampler (non-differentiable w.r.t. π): ( ) z = one hot arg max(g i + log π) i Key idea: Replace arg max with soft max to get a Gumbel-Softmax random variable ẑ Ouput of softmax is differentiable w.r.t. π Gumbel-Softmax Sampler (differentiable w.r.t. π): ( ) gi + log π ẑ = soft max i τ where τ > 0 is a tunable parameter referred to as the temperature Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 22 / 29

Bias-variance tradeoff via temperature control Gumbel-Softmax distribution is parameterized by both class probabilities π and the temperature τ > 0 ( ) gi + log π ẑ = soft max i τ Temperature τ controls the degree of the relaxation via a bias-variance tradeoff As τ 0, samples from Gumbel-Softmax(π, τ) are similar to samples from Categorical(π) Pro: low bias in approximation Con: High variance in gradients As τ, samples from Gumbel-Softmax(π, τ) are similar to samples from Categorical ([ 1 k, 1 k,..., 1 k ]) (i.e., uniform over k categories) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 23 / 29

Geometric Interpretation Consider a categorical distibution with class probability vector π = [0.60, 0.25, 0.15] Define a probability simplex with the one-hot vectors as vertices For a categorical distribution, all probability mass is concentrated at the vertices of the probability simplex Gumbel-Softmax samples points within the simplex (lighter color intensity implies higher probability) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 24 / 29

Gumbel-Softmax in action Original optimization problem max E q φ φ (z)[f (z)] where q φ (z) is a categorical distribution and φ = π Relaxed optimization problem max E q φ φ (ẑ)[f (ẑ)] where q φ (ẑ) is a Gumbel-Softmax distribution and φ = {π, τ} Usually, temperature τ is explicitly annealed. Start high for low variance gradients and gradually reduce to tighten approximation Note that ẑ is not a discrete category. If the function f ( ) explicitly requires a discrete z, then we estimate straight-through gradients: Use hard z Categorical(z) for evaluating objective in forward pass Use soft ẑ GumbelSoftmax(ẑ, τ) for evaluating gradients in backward pass Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 25 / 29

Combinatorial, Discrete Objects: Permutations For many applications which require ranking and matching data in an unsupervised manner, z is represented as a latent permutation A k-dimensional permutation z is a ranked list of k indices {1, 2,..., k} Stochastic optimization problem max E q φ φ (z)[f (z)] where q φ (z) is a distribution over k-dimensional permutations First attempt: Each permutation can be viewed as a distinct category. Relax categorical distribution to Gumbel-Softmax Infeasible because number of possible k-dimensional permutations is k!. Gumbel-softmax does not scale for combinatorially large number of categories Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 26 / 29

Plackett-Luce (PL) Distribution In many fields such as information retrieval and social choice theory, we often want to rank our preferences over k items. The Plackett-Luce (PL) distribution is a common modeling assumption for such rankings A k-dimensional PL distribution is defined over the set of permutations S k and parameterized by k positive scores s Sequential sampler for PL distribution Sample z 1 without replacement with probability proportional to the scores of all k items p(z 1 = i) s i Repeat for z 2, z 3,..., z k PDF for PL distribution q s (z) = s z 1 Z s z2 Z s z1 s z3 Z s zk 2 i=1 s z i Z k 1 i=1 s z i where Z = k i=1 s i is the normalizing constant Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 27 / 29

Relaxing PL Distribution to Gumbel-PL Gumbel-PL reparameterized sampler Add i.i.d. standard Gumbel noise g 1, g 2,..., g k to the log scores log s 1, log s 2,..., log s k s i = g i + log s i Set z to be the permutation that sorts the Gumbel perturbed log-scores, s 1, s 2,..., s k s z (a) Sequential Sampler f θ log s + g sort (b) Reparameterized Sampler Figure: Squares and circles denote deterministic and stochastic nodes. Challenge: the sorting operation is non-differentiable in the inputs Solution: Use a differentiable relaxation. See the paper Stochastic Optimization for Sorting Networks via Continuous Relaxations for more details Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 28 / 29 z f θ

Summary Discovering discrete latent structure e.g., categories, rankings, matchings etc. has several applications Stochastic Optimization w.r.t. parameterized discrete distributions is challenging REINFORCE is the general purpose technique for gradient estimation, but suffers from high variance Control variates can help in controlling the variance Continuous relaxations to discrete distributions offer a biased, reparameterizable alternative with the trade-off in significantly lower variance Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 29 / 29