Deep Generative Models

Similar documents
Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Auto-Encoding Variational Bayes

Variational Autoencoders

Variational Inference via Stochastic Backpropagation

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Natural Gradients via the Variational Predictive Distribution

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Unsupervised Learning

Lecture 14: Deep Generative Learning

Introduction to Machine Learning

Probabilistic Graphical Models

Evaluating the Variance of

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Variational Autoencoders (VAEs)

Variational Autoencoder

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Backpropagation Through

Bayesian Semi-supervised Learning with Deep Generative Models

Minimizing D(Q,P) def = Q(h)

Generative models for missing value completion

Variational Auto Encoders

The Origin of Deep Learning. Lili Mou Jan, 2015

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

17 : Optimization and Monte Carlo Methods

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Bayesian Deep Learning

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Nonparametric Inference for Auto-Encoding Variational Bayes

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Black-box α-divergence Minimization

Inference Suboptimality in Variational Autoencoders

Nonparametric Bayesian Methods (Gaussian Processes)

Overdispersed Variational Autoencoders

Denoising Criterion for Variational Auto-Encoding Framework

Greedy Layer-Wise Training of Deep Networks

Bayesian Paragraph Vectors

arxiv: v2 [stat.ml] 15 Aug 2017

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

B PROOF OF LEMMA 1. Published as a conference paper at ICLR 2018

The connection of dropout and Bayesian statistics

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

Learning to Sample Using Stein Discrepancy

Generative Clustering, Topic Modeling, & Bayesian Inference

Need for Sampling in Machine Learning. Sargur Srihari

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Quasi-Monte Carlo Flows

Density Estimation. Seungjin Choi

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic & Bayesian deep learning. Andreas Damianou

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

Probabilistic Reasoning in Deep Learning

UNSUPERVISED LEARNING

Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

The Success of Deep Generative Models

Variational Inference with Copula Augmentation

Variational Auto-Encoders (VAE)

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab

Chapter 16. Structured Probabilistic Models for Deep Learning

Latent Variable Models

arxiv: v3 [cs.lg] 14 Oct 2015

Approximate Bayesian inference

Computing with Distributed Distributional Codes Convergent Inference in Brains and Machines?

Bayesian Machine Learning

Series 7, May 22, 2018 (EM Convergence)

Normalization Techniques in Training of Deep Neural Networks

Variational Bayes on Monte Carlo Steroids

Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

CMU-Q Lecture 24:

Semi-Supervised Learning with the Deep Rendering Mixture Model

Operator Variational Inference

Posterior Regularization

A Unified View of Deep Generative Models

Diversity-Promoting Bayesian Learning of Latent Variable Models

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Approximate Inference using MCMC

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

ECS289: Scalable Machine Learning

Citation for published version (APA): Kingma, D. P. (2017). Variational inference & deep learning: A new synthesis

Tensor Methods for Feature Learning

arxiv: v4 [stat.co] 19 May 2015

Algorithms for Variational Learning of Mixture of Gaussians

Variational Inference for Monte Carlo Objectives

BACKPROPAGATION THROUGH THE VOID

Transcription:

Deep Generative Models Durk Kingma Max Welling Deep Probabilistic Models Worksop Wednesday, 1st of Oct, 2014 D.P. Kingma

Deep generative models Transformations between Bayes nets and Neural nets Transformation between Bayes nets and Neural Nets. ICML 14 Deep generative models of images Auto-Encoding Variational Bayes, ICLR 14! Semi-Supervised Learning wit Deep Generative Models NIPS 14; wit Sakir & Danilo (Deepmind)

Part 1: Transformations between Bayes nets and Neural nets

Neural Nets and Deep Generative Models (1) Directed latent variables models Can model complex (multimodal) distributions over observed variables Feedforward neural nets: Typically used for learning a single conditional: log p(y x) = f(x,y) e.g. Classification/regression wit deep net But: not good for modelling complex distributions, e.g. multi-modal distributions

Neural Nets and Deep Generative Models (2) Natural idea: neural nets as components of deep latent-variable models! Learn complex distributions over data Inner loop of learning requires posterior inference!! Exact inference is intractable Efficient gradient-based approximate inference metods: Stocastic Variational inference (biased but fast) [Hoffman & Blei 2012] [Kingma & Welling 2013] [Deepmind 2014] Sampling-based metods (unbiased but slower), e.g. HMC Can we increase te efficiency of gradient-based p(z x) =p(x, posterior z)/p(x) inference? p(z x) / p(x, z) r z log p(z x) =r z log p(x, z)

Idea: Reparameterizations (Gaussian example) Centered form (CP) Differentiable Non-Centered Form (DNCP) p (z pa) =N (z; µ, 2 I) µ = f (pa) (e.g. neural net) N (0, I) ez = f (pa)+ Neural net perspective: idden unit wit injected noise

Idea: Reparameterizations Centered form (CP) Differentiable Non-Centered Form (DNCP) p (z pa) p( ) z = g(pa,, ) Neural net perspective: idden unit wit injected noise

Reparameterizations Can be performed for a broad class of distributions

So: we ave a coice Wic form to use for learning / inference?

Posterior correlation analysis (1) Squared posterior correlation ρ 2! second-order metric of posterior dependency between pair of latent variables A and B

Posterior correlation analysis (2) Squared correlations for CP:!! Squared correlations for DNCP:

Posterior correlation analysis (3) Inequality: all terms cancel wit very simple result: 2 pa(z) i,z > 2 pa(z) i,, 2 z pa(z) < 2 z c(z) i.e., DNCP leads to more efficient inference wen latent variable 'z' is more strongly bound to its parents ten to its cildren

Posterior correlation analysis (4) beauty-and-beast pair Disney

2D linear-gaussian posterior example σz = 50 σz = 1 σz = 0.02 z = 50 ( 0.00) z =1( 0.41) z =0.02 ( 1.00) z2 z 1 z = 50 ( 0.58) z 1 e2 e2 z2 z2 z =1( 0.41) z 1 z = 0.02 ( 0.01) e2 e 1 e 1 e 1

Robust HMC Sampler Proposal distribution is mix between two proposal distributions:! Were we use ρ = 0.5 in experiments

Experiment: HMC sampling wit Dynamic Bayesian network (DBN)

Autocorrelation results on DBN

Part 2: Te Variational Auto-Encoder

Deep latent variable model z p(z) x p(x z)

Monte Carlo EM

Monte Carlo EM - End result

MAP inference wit L-BFGS

Te Variational Auto-Encoder (ICLR 14) Q P q(z x) z z p(z) ep(x) x x p(x z) L = D KL (ep(x)q(z x) p(x, z)) apple log p(x)

{ { { Wy is tis an auto-encoder? Reparameterized Q(z x) P(x z) ε x z x L = D KL (ep(x)q(z x) p(x, z)) = E ep(x) Ep( ) [log p(x z) + log p(z) log q(z x)] Reconstruction error Regularization terms! (dictated by te bound)

Variational Auto-Encoder trained on MNIST (2D Latent space)

3D latent space

Labeled faces in te wild

Labeled faces in te wild

Semi-supervised learning wit deep generative models ( Model 2.5 )

Approac 1 DLGM / VAE as feature extractor for semi-supervised classifier Q P q(z x) z z p(z) ep(x) x x p(x z) L = D KL (ep(x)q(z x) p(x, z))

Approac 2 DLGM / VAE as regularizer of neural net classifier Q P q(y x)q(z x, y) y z y z p(y)p(z) ep(x) x x p(x y, z) L = E epl (x,y) [log q(y x)] + L reg L reg = D KL ( ep l (x, y)q(z x, y) p(x, y, z) ) D KL ( ep u (x)q(y x)q(z x, y) p(x, y, z) )

Approac 3 Q P y z2 y z2 z1 z1 x x

Results Table 1: Bencmark results of semi-supervised classification on MNIST wit few labels. N labeled NN CNN TSVM EmbNN CAE MTC M1 M2 M3 100 25.81 22.98 16.81 16.86 13.47 12.03* 11.82 (± 0.25) 11.97 (± 1.71) 3.54 (± 0.03) 600 11.44 7.68 6.16 5.97 6.3 5.13* 5.72 (± 0.049) 4.94 (± 0.13) 2.85 (± 0.1) 1000 10.7 6.45 5.38 5.73 4.77 3.64* 4.24 (± 0.07) 3.60 (± 0.56) 2.76 (± 0.30) 3000 6.04 3.35 3.45 3.59 3.22 2.57* 3.49 (± 0.04) 3.92 (± 0.63) 2.63 (± 0.24)

Analogic reasoning (will improve)

Next steps SSL on larger images SVHN, CIFAR-10 (tis NIPS paper) Imagenet (Future papers) SSL on video s Youtube Applications of anagogic reasoning

Tanks!

RMSProp Initialization: m 0 v 0 Update parameters(w, g) : m 1 g +(1 1) m v 2 g 2 +(1 2) v w w m/ p v

Estimation based on exp. mov. avg. Assume: g is a random variable Goal: from exponential moving averages of i.i.d. draws gi, estimate moments E[g] and E[g 2 ] m t = 1 g t +(1 1) m t 1 tx = 1 (1 1) t i g i (As function of past samples) i=1 " E [m t ]=E 1 # tx (1 1) t i g i (Taking expectations) i=1 = E [g] 1 tx (1 1) t i (Because i.i.d.) i=1 = E [g] (1 (1 1) t ) (Furter simplification)

Fixed RMSProp (AdaM) Initialization: m 0 v 0 Update parameters(w, g) : m 1 g +(1 1) m v 2 g 2 +(1 2) v w w t m/ p v were t = p 1 (1 2) t /(1 (1 1) t )

VAE, 10 epocs β1 = 1 β1 = 0.1 β2 = 0.01 β2 = 0.001 β2 = 0.0001 More like AdaGrad (infinite memory)

VAE, 100 epocs β1 = 1 β1 = 0.1 β2 = 0.01 β2 = 0.001 β2 = 0.0001 More like AdaGrad (infinite memory)

Now possible: Interpolation between Adagrad and RMSProp Initialization: m 0 v 0 Update parameters(w, g) : m 1 g +(1 1) m v 2 g 2 +(1 2) p v w w t m/ t v were t = p 1 (1 2) t /(1 (1 1) t ) Tis algoritm AdaGrad wen β1 = 1 and β2 = ε (infinitesimal)