Variational Inference via Stochastic Backpropagation

Similar documents
Auto-Encoding Variational Bayes

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Variational Autoencoder

Variational Autoencoders

Density Estimation. Seungjin Choi

Bayesian Deep Learning

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Introduction to Machine Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Unsupervised Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

arxiv: v2 [stat.ml] 28 Mar 2017

Lecture 16 Deep Neural Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models

The connection of dropout and Bayesian statistics

Deep Generative Models

Probabilistic Reasoning in Deep Learning

Lecture 14: Deep Generative Learning

Undirected Graphical Models

Lecture 13 : Variational Inference: Mean Field Approximation

The Origin of Deep Learning. Lili Mou Jan, 2015

Machine Learning Basics III

Scalable Non-linear Beta Process Factor Analysis

Variational Autoencoders (VAEs)

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Approximate Bayesian inference

Outline Lecture 2 2(32)

Part 1: Expectation Propagation

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

UNSUPERVISED LEARNING

Bayesian Regression Linear and Logistic Regression

Probabilistic Graphical Models & Applications

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

Pattern Recognition and Machine Learning

LDA with Amortized Inference

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Probabilistic Graphical Networks: Definitions and Basic Results

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Lecture 6: Graphical Models: Learning

Introduction to Restricted Boltzmann Machines

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning Summer School

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Bayesian Inference and MCMC

Probabilistic Machine Learning

Part 4: Conditional Random Fields

STA 4273H: Statistical Machine Learning

Machine Learning Basics: Maximum Likelihood Estimation

Unsupervised Learning

Probabilistic & Bayesian deep learning. Andreas Damianou

Nonparametric Bayesian Methods (Gaussian Processes)

Notes on pseudo-marginal methods, variational Bayes and ABC

GWAS V: Gaussian processes

Basic math for biology

Chapter 16. Structured Probabilistic Models for Deep Learning

VCMC: Variational Consensus Monte Carlo

Greedy Layer-Wise Training of Deep Networks

Recent Advances in Bayesian Inference Techniques

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Deep latent variable models

Recurrent Latent Variable Networks for Session-Based Recommendation

arxiv: v8 [stat.ml] 3 Mar 2014

STA 4273H: Statistical Machine Learning

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

STA 4273H: Sta-s-cal Machine Learning

13: Variational inference II

Expectation Propagation for Approximate Bayesian Inference

Variational Scoring of Graphical Model Structures

Denoising Criterion for Variational Auto-Encoding Framework

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

CPSC 540: Machine Learning

Posterior Regularization

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bayesian Machine Learning - Lecture 7

Dropout as a Bayesian Approximation: Insights and Applications

Generative models for missing value completion

Approximate Inference using MCMC

Expectation Propagation Algorithm

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Default Priors and Effcient Posterior Computation in Bayesian

Variational Inference for Monte Carlo Objectives

Advances in Variational Inference

Deep unsupervised learning

Evaluating the Variance of

Deep Generative Models. (Unsupervised Learning)

Note Set 5: Hidden Markov Models

Model Selection for Gaussian Processes

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Hmms with variable dimension structures and extensions

Basic Principles of Unsupervised and Unsupervised

Transcription:

Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016

Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Bayesian inference on latent variable model y observed data x latent variable p θ (x, y) probabilistic model

Bayesian inference on latent variable model y observed data x latent variable p θ (x, y) probabilistic model Purpose: we are (very) interested in inferring a posterior distribution p θ (x y) Enables learning parameters in latent variable models Deep learning

Bayesian inference on latent variable model y observed data x latent variable p θ (x, y) probabilistic model Purpose: we are (very) interested in inferring a posterior distribution p θ (x y) Enables learning parameters in latent variable models Deep learning Difficulty: p(x y) = p(x,y) p(y) is most often intractable.

Non-variational approx. inference methods Point estimate of p θ (x y) (MAP) Fast Overfitting Markov Chain Monte Carlo (MCMC) Asymptotically unbiased Expensive, slow to assess convergence

Variational Inference Introduce variational distribution q φ (x) or q φ (x y) of true posterior. φ variational parameters Objective: minimize w.r.t. the KL-divergence D KL (q φ (x y) p θ (x y)) q φ (x y) = p θ (x y) achieves 0 KL divergence.

Lower Bound From marginal log-likelihood to lower bound, [ log p θ (y) = E q log p ] θ(y, x) + D KL (q φ (x y) p θ (x y)) q φ (x y) E q [log p θ (y, x) log q φ (x y)] L Objective: maximize w.r.t the Lower Bound Non-gradient-based optimization technique: Mean-Field VB with fixed-point equations Efficiency Intractable / not applicable in many cases

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Reparameterized Gradient Estimator Consider a general form of lower bound L = E qφ (x y)[f (y, x)] Monte Carlo Gradient Approximation at Iteration t sample ɛ t from some base distribution p(ɛ) transformation x t = g φ (ɛ l ), s.t. x t q φ (x y) compute φ f (y, x t ) to approximate φ L Reparameterization has to exist. E.g. Gaussian, Laplace, Student t s, etc.

Reparameterization Trick

Gaussian Backpropagation x N (µ, C), we have following identities. µi E N (x µ,c) [f (x)] = E N (x µ,c) [ zi f (x)] Cij E N (x µ,c) [f (x)] = 1 2 E N (x µ,c)[ 2 z i,z j f (x)], 2 C i,j,c k,l E N (x µ,c) [f (x)] = 1 4 E N (x µ,c)[ 4 z i,z j,z k,z l f (x)], 2 µ i,c k,l E N (x µ,c) [f (x)] = 1 2 E [ N (x µ,c) 3 zi,z k,z l f (x) ]. Unbiased estimator of k E[f ], k = 1, 2 Higher order derivatives need to calculated w.r.t. f

Reparameterized Gaussian Backpropagation x N (µ, RR ), thus x = µ + Rɛ where ɛ N (0, I) New identities R E N (µ,c) [f (x)] = E ɛ N (0,Idz )[ɛg ] 2 µ,r E N (µ,c)[f (x)] = E ɛ N (0,Idz )[ɛ H] 2 R E N (µ,c)[f (x)] = E ɛ N (0,Idz )[(ɛɛ T ) H] where is Kronecker product, and gradient g, Hessian H are evaluated at µ + Rɛ in terms of f (x). Still easy to obtain unbiased estimator Hessian-vector multiplication due to the fact that (A B)vec(V ) = vec(avb)

Bayesian Logreg Prior N (0, Λ) where Λ is diagonal Variational distribution q(β µ, D) where D is diagonal for simplicity. Lower Bound 1 x 104 1.1 1.2 1.3 1.4 A9a DSVI L BFGS SGVI HFSGVI 1.5 0 100 200 300 400 500 time(s) value 3 2 1 0 1 2 3 A9a regression coefficients DSVI L BFGS SGVI HFSGVI 4 0 50 100 150 index

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Model Formulation Gaussian latent variable, prior p(x) = N (0, I) Generative model p θ (y x), characterize a non-linear transformation, e.g. MLP Recognition model q φ (x y) = N (µ, D), where φ = [µ, D] = MLP(y; W, b) and denote ψ = (W, b) Objective Function: L = log p(y x) + log p(x) log q(x y) log p(y x) reconstruction error log p(x) log q(x y) regularization Unlike VEM, (θ, ψ) is optimized simultaneously, by gradient based algorithm.

Unrolled VAE Hidden decoder layer Gaussian latent layer Hidden encoder layer Output: y h d z h e W 5, b 5 W 4, b 4 (W 2, b 2 ), (W 3, b 3 ) W 1, b 1 ~ (W 5, b 5 ), (W 6, b 6 ) if x is con?nuous. N( mu, Sigma ) (W 2, b 2 ) (W 3, b 3 ) h e Input: x

Back to Backpropagation Fast Gradient computation [ ] (µ + Rɛ) ψl E N (µ,c) [f (x)] = E ɛ N (0,I) g ψ l 2 ψ l1 ψ l2 E N (µ,c) [f (x)] = [ ] (µ + Rɛ) (µ + Rɛ) E ɛ N (0,I) H + g 2 (µ + Rɛ) ψ l1 ψ l2 ψ l1 l2 O(d 2 z ) algorithmic complexity for both 1st and 2nd derivative.

Back to Backpropagation F (ψ+γv) F (ψ) For any F, H ψ v = lim γ 0 γ H ψ v = F (ψ + γv) γ = γ E N (0,I) γ=0 [ (µ(ψ) + R(ψ)ɛ) g ψ ψ ψ+γv] γ=0 [ ( (µ(ψ) + R(ψ)ɛ) = E N (0,I) g γ ψ ψ ψ+γv)] PCG only requires H ψ v to solve linear system Hp = g. γ=0 For K iteration of PCG, relative tolerance e < exp( 2K/ c), where c is matrix conditioner. Thus, c can be nearly as large as O(K 2 ). Complexity for each iteration: O(Kdd 2 z ) v.s. O(dd 2 z )

Theoretical Perspective If f is an L-Lipschitz differentiable function and ɛ N (0, I dz ), then E[(f (ɛ) E[f (ɛ)]) 2 ] L2 π 2 4. ( 1 ) P M M m=1 f (ɛ m) E[f (ɛ)] t 2e 2Mt2 π 2 L 2. In most application, M = 1 is used as MC integration.

VAE Experiments Manifold of Generative Model by setting d z = 2

VAE Experiments Lower Bound 1600 Frey Face 6000 Olivetti Face Lower Bound 1400 1200 1000 800 600 400 Ada train Ada test L BFGS SGVI train L BFGS SGVI test HFSGVI train HFSGVI test 200 0 0.5 1 1.5 2 time(s) x 10 4 Lower Bound 4000 2000 0 2000 Ada train Ada test L BFGS SGVI train L BFGS SGVI test HFSGVI train HFSGVI test 4000 0 0.5 1 1.5 2 time (s) x 10 4

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Semi-supervised VAE (NIPS 2014) Generative Model: p(y) = Cat(y π); p(x) = N (0, I); p(y y, x) = MLP Recognition Model: q(x y, y) = N (µ φ (y, y), D φ (y)) and q(y y) = Cat(y π(y)), parameter function is also MLP.

Neural Variational Inference (ICML 2014) Sigmoid Belief Networks Generative Model: h L h L 1... h 1 y Recognition Model: reverse the arrow direction Learning signal or control variate for variance reduction borrowing idea form RL φ L = E q [(log p θ (x, z) log q φ (z x)) φ log q φ (z x)] = E q [(log p θ (x, z) log q φ (z x) C ξ (x)) φ log q φ (z x)] (θ, φ, ξ) joint learning

Dynamic Modeling DRAW (Dynamic VAE with LSTM, ICML 2015, reviewed before) DSBN (NIPS 2015), generative model is similar to HMM

10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 Dynamic Modeling, ctd hgchmm, my model (g) NVI (h) Doctor (i) GibbsEM

Bayesian Dark Knowledge (NIPS 2015) Teacher Model: deep neural networks T (y x, θ), prior p(θ λ) Student Model: deep neural networks S(y x, ω), prior p(ω γ) Two step training (or distilled SGLD, term they used in paper) Mini-bactch data (X, Y ) with size B SGLD update θ ( ) θ t+1 = η t 2 θ log p(θ λ) + N B θ log p(y i x i, θ) x i X + N (0, η t ) SGD update ω with noisy data X only; ỹ i is obtained by feeding x i to current teacher model ω t+1 = ρ t 1 ω log p(ỹ i x i, ω) + γω t B x i X

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Summary Minimize the difference between Generative model and recognition model Variational inference framework