Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Similar documents
Variational Autoencoders

Variational Autoencoders (VAEs)

Variational Inference via Stochastic Backpropagation

Auto-Encoding Variational Bayes

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Deep latent variable models

Variational Auto Encoders

Bayesian Deep Learning

Lecture 13 : Variational Inference: Mean Field Approximation

The Expectation Maximization or EM algorithm

Probabilistic Reasoning in Deep Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Variational Auto-Encoders (VAE)

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

Variational Autoencoder

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Generating Sentences by Editing Prototypes

Latent Variable Models

Approximate Bayesian inference

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Natural Gradients via the Variational Predictive Distribution

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami

STA 4273H: Statistical Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Unsupervised Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Bayesian Semi-supervised Learning with Deep Generative Models

13: Variational inference II

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CSC321 Lecture 18: Learning Probabilistic Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Series 7, May 22, 2018 (EM Convergence)

CPSC 540: Machine Learning

Bayesian Machine Learning

Pattern Recognition and Machine Learning

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Advances in Variational Inference

Probabilistic Graphical Models for Image Analysis - Lecture 4

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Introduction to Bayesian inference

The connection of dropout and Bayesian statistics

Probabilistic Graphical Models

Algorithms for Variational Learning of Mixture of Gaussians

Data Mining Techniques

MODULE -4 BAYEIAN LEARNING

Probabilistic & Bayesian deep learning. Andreas Damianou

Bayesian Deep Generative Models for Semi-Supervised and Active Learning

Generative models for missing value completion

Probabilistic & Unsupervised Learning

STA 4273H: Statistical Machine Learning

Quasi-Monte Carlo Flows

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Introduction to Machine Learning

Generative v. Discriminative classifiers Intuition

Clustering, K-Means, EM Tutorial

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

UNSUPERVISED LEARNING

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Probabilistic Graphical Models

CSC411 Fall 2018 Homework 5

Variational Bayesian Logistic Regression

Variational Inference (11/04/13)

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Deep Latent-Variable Models of Natural Language

Expectation Maximization Algorithm

arxiv: v1 [cs.lg] 15 Jun 2016

Deep Generative Models

Probabilistic Graphical Models

Variational inference

ECE521 week 3: 23/26 January 2017

arxiv: v1 [cs.lg] 6 Dec 2018

Recurrent Latent Variable Networks for Session-Based Recommendation

Nonparametric Inference for Auto-Encoding Variational Bayes

Algorithmisches Lernen/Machine Learning

Lecture 14: Deep Generative Learning

Evaluating the Variance of

Graphical Models for Collaborative Filtering

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

Statistical learning. Chapter 20, Sections 1 4 1

Generative Models for Sentences

Statistical learning. Chapter 20, Sections 1 3 1

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Variational Autoencoders

Inference Suboptimality in Variational Autoencoders

Variational Inference. Sargur Srihari

Probabilistic numerics for deep learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

EM & Variational Bayes

Denoising Criterion for Variational Auto-Encoding Framework

Density Estimation. Seungjin Choi

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

p(d θ ) l(θ ) 1.2 x x x

Variational Autoencoders for Classical Spin Models

Computing with Distributed Distributional Codes Convergent Inference in Brains and Machines?

Transcription:

Variational Inference in TensorFlow Danijar Hafner Stanford CS 20 2018-02-16 University College London, Google Brain

Outline Variational Inference Tensorflow Distributions VAE in TensorFlow

Variational Inference

Learning unknown variables Images consist of millions of pixels, but there is likely a more compact representation of the content (objects, positions, etc) z Finding the mapping to this representation allows us to find semantically similar images and even to generate new images We call the compact representation z and its corresponding pixels x, and this is the same structure for every of our N images x N

Learning unknown variables Assume model where data x was generated from latent variables z We want to find the latent variables that explain our data best But we cannot directly maximize the data likelihood since it depends on the latent variables and we don't know their values θ* = argmax P θ (x) = argmax P θ (x z)p(z) dz We define a prior assumption P(z) about how z is distributed We are interested in the posterior belief P(z x) that depends on the corresponding data point x z x N

Variational lower bound: Overview Iteratively optimize approximate posterior Q(z) until Q(z) P(z x) P(z x) Objective for Q(z) is the variational lower bound lnp(x) E Q(z) [lnp(x,z) lnq(z)] = E Q(z) [lnp(x z) + lnp(z) lnq(z)] = E Q(z) [lnp(x z)] D KL [Q(z) P(z)] Q(z) This is a lower bound since the KL is non-negative Choose Q(z) ϵ that allows differentiable sampling, often Gaussian distribution KL becomes a regularization terms; computed analytically or sampled

Variational lower bound: Algorithm Initialize fixed prior on latents P(z), for example zero mean unit variance P(z x) Initialize network weights θ and sufficient stats μ and σ of Q(z) Remember objective lnp(x) E Q(z) [lnp θ (x z)] D KL [Q(z) P(z)] Iterate until convergence: 1. Sample a z ~ Q(z) as z = σε + μ with ε ~ N(0, 1) 2. Compute data likelihood P θ (x z) using neural network 3. Compute KL between Q(z) and P(z) analytically 4. Perform gradient ascent on objective to optimize θ, μ, σ Q(z)

Amortized inference: Overview We could learn sufficient stats of Q(z) for every data point via gradient descent But need multiple gradient steps for every data point, even during evaluation Instead, learn the result of this process using an encoder network Q(z x) Assume similarity in how latents are inferred for all data points Backpropagate to optimize encoder weights instead of posterior sufficient statistics Q(z x) P(z x)

Amortized inference: Algorithm Initialize fixed prior on latents P(z), for example zero mean unit variance P(z x) Initialize encoder weights ϕ and decoder weights θ Remember objective lnp(x) E Q(z) [lnp θ (x z)] D KL [Q ϕ (z x) P(z)] Iterate until convergence: Q(z x) 1. Select data point x and compute Q ϕ (z x) using encoder 2. Sample a z ~ Q(z x) as z = σε + μ with ε ~ N(0, 1) 3. Compute data likelihood P θ (x z) using decoder 4. Compute KL between Q(z) and P(z) analytically 5. Perform gradient ascent on objective to optimize θ and ϕ

Variational Auto-Encoder Encoder for amortized inference Q(z x) Decoder for generative model P(x z) Variational lower bound objective E Q(z x) [lnp(x z)] D KL [Q(z x) P(z)] Trained end-to-end via gradients by reparameterized sampling z x N Kingma et al. 2014, Rezende et al. 2014

Bayesian Neural Network Independent latent Q(θ) is diagonal Gaussian Conditional generative model P θ (y x) Variational lower bound objective E Q(θ) [lnp θ (y x)] D KL [Q(θ) P(θ)] Divide KL term by the data set size since parameters are shared for whole data set Trained end-to-end via gradients by reparameterized sampling θ x y N Blundell et al. 2015

TensorFlow Distributions

TensorFlow Distributions: Overview Probabilistic programming made easy! tfd = tf.contrib.distributions

TensorFlow Distributions: Overview Probabilistic programming made easy! tfd = tf.contrib.distributions mean = tf.layers.dense(hidden, 10, None) stddev = tf.layers.dense(hidden, 10, tf.nn.softplus) dist = tfd.multivariatenormaldiag(mean, stddev)

TensorFlow Distributions: Overview Probabilistic programming made easy! tfd = tf.contrib.distributions mean = tf.layers.dense(hidden, 10, None) stddev = tf.layers.dense(hidden, 10, tf.nn.softplus) dist = tfd.multivariatenormaldiag(mean, stddev) samples = dist.sample() dist.log_prob(samples)

TensorFlow Distributions: Overview Probabilistic programming made easy! tfd = tf.contrib.distributions mean = tf.layers.dense(hidden, 10, None) stddev = tf.layers.dense(hidden, 10, tf.nn.softplus) dist = tfd.multivariatenormaldiag(mean, stddev) samples = dist.sample() dist.log_prob(samples) other = tfd.multivariatenormaldiag( tf.zeros_like(mean), tf.ones_like(stddev)) tfd.kl_divergence(dist, other)

TensorFlow Distributions: Regression example tfd = tf.contrib.distributions hidden = tf.layers.dense(inputs, 100, tf.nn.relu) mean = tf.layers.dense(hidden, 10, None) dist = tfd.multivariatenormaldiag(mean, tf.ones_like(mean))

TensorFlow Distributions: Regression example tfd = tf.contrib.distributions hidden = tf.layers.dense(inputs, 100, tf.nn.relu) mean = tf.layers.dense(hidden, 10, None) dist = tfd.multivariatenormaldiag(mean, tf.ones_like(mean)) loss = -dist.log_prob(label) # Squared error optimize = tf.train.adamoptimizer().minimize(loss)

TensorFlow Distributions: Classification example tfd = tf.contrib.distributions hidden = tf.layers.dense(inputs, 100, tf.nn.relu) logit = tf.layers.dense(hidden, 10, None) dist = tfd.categorical(logit)

TensorFlow Distributions: Classification example tfd = tf.contrib.distributions hidden = tf.layers.dense(inputs, 100, tf.nn.relu) logit = tf.layers.dense(hidden, 10, None) dist = tfd.categorical(logit) loss = -dist.log_prob(label) # Cross entropy optimize = tf.train.adamoptimizer().minimize(loss)

VAE in TensorFlow

VAE in TensorFlow: Overview tfd = tf.contrib.distributions images = tf.placeholder(tf.float32, [None, 28, 28]) prior = make_prior() posterior = make_encoder(images) dist = make_decoder(posterior.sample())

VAE in TensorFlow: Overview tfd = tf.contrib.distributions images = tf.placeholder(tf.float32, [None, 28, 28]) prior = make_prior() posterior = make_encoder(images) dist = make_decoder(posterior.sample()) elbo = dist.log_prob(images) - tfd.kl_divergence(posterior, prior) optimize = tf.train.adamoptimizer().minimize(-elbo) samples = make_decoder(prior.sample(10)).mean() # For visualization

VAE in TensorFlow: Prior & encoder def make_prior(code_size=2): mean, stddev = tf.zeros([code_size]), tf.ones([code_size]) return tfd.multivariatenormaldiag(mean, stddev) def make_encoder(images, code_size=2): images = tf.layers.flatten(images) hidden = tf.layers.dense(images, 100, tf.nn.relu) mean = tf.layers.dense(hidden, code_size) stddev = tf.layers.dense(hidden, code_size, tf.nn.softplus) return tfd.multivariatenormaldiag(mean, stddev)

VAE in TensorFlow: Networks def make_decoder(code, data_shape=[28, 28]): hidden = tf.layers.dense(code, 100, tf.nn.relu) logit = tf.layers.dense(hidden, np.prod(data_shape)) logit = tf.reshape(logit, [-1] + data_shape) return tfd.independent(tfd.bernoulli(logit), len(data_shape)) The tfd.independent(dist, 2) tells TensorFlow to treat the two innermost dimensions as data dimensions rather than batch dimensions This means dist.log_prob(images) returns a number per images rather than per pixel As the name tfd.independent() says, it's just summing the pixel log probabilities

VAE in TensorFlow: Results

Bonus: BNN in TensorFlow def define_network(images, num_classes=10): mean = tf.get_variable('mean', [28 * 28, num_classes]) stddev = tf.get_variable('stddev', [28 * 28, num_classes]) prior = tfd.multivariatenormaldiag( tf.zeros_like(mean), tf.ones_like(stddev)) posterior = tfd.multivariatenormaldiag(mean, tf.nn.softplus(stddev))

Bonus: BNN in TensorFlow def define_network(images, num_classes=10): mean = tf.get_variable('mean', [28 * 28, num_classes]) stddev = tf.get_variable('stddev', [28 * 28, num_classes]) prior = tfd.multivariatenormaldiag( tf.zeros_like(mean), tf.ones_like(stddev)) posterior = tfd.multivariatenormaldiag(mean, tf.nn.softplus(stddev)) bias = tf.get_variable('bias', [num_classes]) # Or Bayesian, too logit = tf.nn.relu(tf.matmul(posterior.sample(), images) + bias) return tfd.categorical(logit), posterior, prior

Bonus: BNN in TensorFlow def define_network(images, num_classes=10): mean = tf.get_variable('mean', [28 * 28, num_classes]) stddev = tf.get_variable('stddev', [28 * 28, num_classes]) prior = tfd.multivariatenormaldiag( tf.zeros_like(mean), tf.ones_like(stddev)) posterior = tfd.multivariatenormaldiag(mean, tf.nn.softplus(stddev)) bias = tf.get_variable('bias', [num_classes]) # Or Bayesian, too logit = tf.nn.relu(tf.matmul(posterior.sample(), images) + bias) return tfd.categorical(logit), posterior, prior dist, posterior, prior = define_network(images) elbo = (tf.reduce_mean(dist.log_prob(label)) - tf.reduce_mean(tfd.kl_divergence(posterior, prior))

That's all