An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

Similar documents
Nonparametric Inference for Auto-Encoding Variational Bayes

Bayesian Inference Course, WTCN, UCL, March 2013

Auto-Encoding Variational Bayes

CSC321 Lecture 18: Learning Probabilistic Models

Variational Autoencoder

Density Estimation. Seungjin Choi

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Probabilistic Graphical Models

Probabilistic Reasoning in Deep Learning

Gradient Estimation Using Stochastic Computation Graphs

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

Algorithms for Variational Learning of Mixture of Gaussians

Natural Gradients via the Variational Predictive Distribution

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Unsupervised Learning

Improved Bayesian Compression

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Does the Wake-sleep Algorithm Produce Good Density Estimators?

17 : Optimization and Monte Carlo Methods

Statistical Machine Learning Lectures 4: Variational Bayes

Variational Autoencoders (VAEs)

Bayesian Machine Learning

VIME: Variational Information Maximizing Exploration

Introduction to Bayesian inference

Unsupervised Learning

Variational Principal Components

Statistical learning. Chapter 20, Sections 1 3 1

Variational Dropout via Empirical Bayes

Generative models for missing value completion

MODULE -4 BAYEIAN LEARNING

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Hierarchical Models & Bayesian Model Selection

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

Lecture 13 : Variational Inference: Mean Field Approximation

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

Statistical learning. Chapter 20, Sections 1 3 1

Black-box α-divergence Minimization

Variational Autoencoders

arxiv: v2 [stat.ml] 15 Aug 2017

Context tree models for source coding

13: Variational inference II

Variational Inference. Sargur Srihari

Lecture 16 Deep Neural Generative Models

Clustering with k-means and Gaussian mixture distributions

Bayesian Inference. p(y)

Data Mining Techniques

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Predictive Hypothesis Identification

Bayesian Semi-supervised Learning with Deep Generative Models

Variational Inference with Stein Mixtures

Introduction to Bayesian Statistics

Lecture 3: Machine learning, classification, and generative models

CSC 2541: Bayesian Methods for Machine Learning

Understanding Covariance Estimates in Expectation Propagation

Probabilistic & Bayesian deep learning. Andreas Damianou

Bayesian Machine Learning

An introduction to Variational calculus in Machine Learning

Bayesian Methods for Machine Learning

Recent Advances in Bayesian Inference Techniques

Stochastic Variational Inference

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Expectation Propagation in Dynamical Systems

Naïve Bayes classification

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Bayesian Deep Learning

Variational Auto Encoders

Inference Suboptimality in Variational Autoencoders

Approximate Inference Part 1 of 2

Introduction to Probabilistic Machine Learning

Approximate Inference Part 1 of 2

STA 4273H: Statistical Machine Learning

Machine Learning 4771

Lecture : Probabilistic Machine Learning

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Analysis for Natural Language Processing Lecture 2

an introduction to bayesian inference

20: Gaussian Processes

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Bayesian ensemble learning of generative models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Inference and MCMC

Robustness and duality of maximum entropy and exponential family distributions

Deep Generative Models. (Unsupervised Learning)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Bayesian Machine Learning

Lecture 6: Model Checking and Selection

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Bayesian Learning (II)

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Stochastic Complexity of Variational Bayesian Hidden Markov Models

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

The Minimum Message Length Principle for Inductive Inference

Part 1: Expectation Propagation

CISC 876: Kolmogorov Complexity

Foundations of Statistical Inference

Transcription:

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme Ghassen Jerfel April 2017 As we will see during this talk, the Bayesian and information-theoretic views of variational inference provide complementary and mutually beneficial perspectives to the same problems with two different languages. More specifically, based on the paper by Honkela and Valpola [4], we will provide an interpretation of variational inference based on the MDL principle as a theoretical framework for model evaluation using code length and on Bits-back as a practical efficient coding scheme which builds on the MDL principle. From this perspective, we can better understand the shortcomings of practical variational inference and the over-pruning phenomenon of variational auto-encoders. 1

1 Variational Inference: The Bayesian Interpretation In statistical Bayesian learning, parameter values can have distributions too. Accordingly, for a given statistical model, we start with a prior distribution of the parameters that rationally incorporates some knowledge about the world [2]. 1 Using the Bayes Rule [Eq 1], every observed evidence updates the parameter distribution resulting in a posterior distribution p(θ X) which contains all of the parameter information. We can then make predictions by evaluating all possible parameter values as weighed by their posterior probabilities [Eq 2] (marginalization). p(θ X) = p(x θ)p(θ) p(x) (1) This use of distributions is insensitive to over-fitting and provides a natural method to compare and select models since we can easily rerun and evaluate the posteriors. p(x) = p(x θ)p(θ)dθ (2) θ However, the posterior distribution is rarely tractable 2 and requires an approximation such as maximum-a-posteriori (MAP) estimates or MCMC sampling. The most scalable and currently standard approximation is variational inference [5]. Variational inference proposes a variational or approximate distribution (or family thereof) over the parameters. It then proceeds to minimize the KL divergence (a popular distribution metric) between the true posterior and the approximate posterior. Eventually, VI returns the set of variational parameters that best approximate the true posterior and one can use the variational distribution as an alternative to the true posterior distribution for testing and model selection [5]. The optimization objective is the KL divergence or its equivalent negative Evidence Lower BOund 1 Whether this has to be an informative or a non-informative prior or if to depend on the data (empirical Bayes) is up for debate 2 Logistic regression, use of neural networks or even a mixture of Gaussians present intractable posteriors 2

as can be seen in the following derivation: log p(x) = log p(x, θ)dθ q(θ)p(x, θ) = log dθ q(θ) p(x, θ) = log E q [ q(θ) ] >= E q [log p(x, θ) ] = L[q; p, x] q(θ) = E q [log p(x, θ)] E[q(θ)] ELBO = L[q; p, x] = E q [log p(x θ)p(θ)] E q [log q(θ)] = E q [log p(x θ)p(θ)] + H[q(θ)] = E q [log p(x θ)] KL(q(θ) p(θ)) However, Variational Inference is problematic from a purely Bayesian point-of-view: A: The ELBO minimize an ad hoc distance measure between an approximate posterior and the true posterior. It just so happens that the cost function is a convenient lower bound to the model evidence log p(x) L(x) B: It is not clear, from a Bayesian perspective, how the approximate marginalization over the approximate posterior relates to other methods since the approximation doesn t correspond to a proper loss function (but rather a lower-bound). C: In order to minimize the expected loss of a log score function log q p we should use E p [log p log q] = KL(p q) rather than 3 KL(q p) = E q [log q log p] 3 However, since KL is asymmetric, there is no relation between the two different forms. 3

2 Minimum Description Length Von Neumann is claimed to have said, With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. Based on the Minimum Message Length by Wallace [8], Minimum Description Length [7] was developed by Risanen in 1979 as an information theoretic formalization of Occam s razor and provides a criterion that can be seen as a tractable approximation to the Kolmogorov Complexity 4 [8]. The MDL principle states that the best model of some data is the one that minimizes the combined cost of describing the data and the cost of describing the misfit between the model and the data. The description in this case is a binary code that the receiver can use to reconstruct the original data using the communicated model: First part of the code describes the model: the number of bits it takes to describe the parameters. Second part of the code describes the model misfit: the number of bits it takes to describe the discrepancy between the correct output and the training output. Using this two-part code, the receiver can train the model and then use the misfit part to correct the training output and reach the true output. This principle offers a natural trade-off between model complexity and descriptive adequacy (model fit or good compression) and links the resolution of this trade-off to the problem of finding the most efficient use of memory: the shortest code. The original framework focuses on evaluating code length and doesn t address how to construct the 4 Kolmogorov Complexity is uncomputable and language-dependent 4

codes. However, we know from Shannon s coding theory [6] that data from a discrete distribution p(x) cannot be represented with less than E[ log 2 (p(x))] bits or E[ ln(p(x))] nats. Therefore, for any probability distribution, it is possible to construct a code such that the length in bits is equal to log 2 p(x). The reverse is also true. Searching for an efficient code thus reduces to searching for a good probability distribution or model. More concretely, we can represent the expected description length as follows: L(X) = E[ log(p (θ H)) log(p(x θ, H))] (3) In the probabilistic setting, the offers a new interpretation to the two-code part: First, the data is modeled with parameters θ. We first encode the θ since it s the essence/structure of a datum. Based on θ, we next encode the modeling error or how the information from X deviates from the θ structure. This establishes a connection to the Bayesian inference literature since minimizing this expected code length is equivalent to finding the MAP estimate of θ given the data X. Notice that for the continuous case, a finite code can encode a distribution up to a quantization tolerance ɛ and the code length would then be: L(X) = E[ log(p (θ H)ɛ θ ) log(p(x θ, H))ɛ X θ X ] 3 Bits-back Coding Coined by Hinton and van Camp Bits-back coding [3] suggests using redundant codewords (a set of alternative parameter values instead of single values) based on some auxiliary information (the distribution of those parameters). 5

This leads to an overall shorter code length since the receiver can re-run the same learning algorithm as the sender and recover the auxiliary information (the distribution), thus getting his bits back. The coding length of the auxiliary information can thus be subtracted from the total code length leading to a more efficient coding scheme. Hinton introduced Bits-back in order to add noise and decrease the information contained in the parameters. While this sounds like adding information (mean and variance of the noise), this code is no longer than the original since we don t need to transmit the noise in the MDL framework and we can thus subtract the noise code. (View 1) As a concrete example, let s consider a model with an arbitrary distribution over the parameters q(θ) (arbitrary). This distribution introduces redundant information (we now have multiple θ values). The original coding scheme would return the following total length: E q(θ) [L(X)] = E q(θ) [L(θ)] + E q(θ) [L(X θ)] = θ q(θ) log P (θ) θ q(θ) log P (X θ) (4) This coding scheme is inefficient since the length is greater than the optimal L(X) derived earlier. It s crucial to observe that the approximate posterior (retrieved from the learning algorithm which the receiver has access to) can transmit additional information up to Shannon entropy of q(θ): H(θ) = θ q(θ) log P (θ). In other words, H q (θ X) can be used to carry other information than the original X and should not be included in the total length of X when the receiver has access to q(θ X). (View 2) (View 3): q(θ X) has already been communicated successfully since we have access to both the values θ and X and we should subtract it from the true cost of communicating the model and the misfits. (View 4): By running the same learning algorithm and observing the choices that were made (based on the transmitted values of X) and the results (values of θ) the receiver can realize q(θ X) which is not relevant to the model communication. 6

The code length is thus: L q(θ) (X) = E q(θ) [L(X)] H q (θ X) = θ = θ q(θ X) log q(θ X) log q(θ X) p(θ)p(x θ) q(θ X) p(x, θ) = KL(q(θ X) p(x, θ)) = θ q(θ X) log q(θ X) log p(x) p(θ X) = KL(q(θ X) p(θ X)) log p(x) Notice that the second line of this equation is the same as the classic result obtained from the classic scheme [Eq. 3] which can be regarded as a special cases where q(θ X) is a point-mass distribution at θ 0 that maximizes L q(θ) (X). Retrieving the minimum length code reduces to minimizing this KL divergence term between the coding distribution (approximate posterior) and posterior distribution of the parameters. Additionally, this code length is optimal when q(θ X) = p(θ X) which provides a theoretical insight into the optimality of variational inference but we can almost never reach that equality in practice. 4 Combining Both Perspectives The MDL and Bits-back view provides a nice way to derive the cost function (the ELBO: L q(θ) (X) = E q(θ) [ (θ X) p(x,θ) ] as the KL divergence between an approximate posterior and the true posterior in addition to explaining the gap in the lower-bound based on the KL divergence between p(θ X) and q(θ X) which can cause the sub-optimality/inefficiency of the code. This solves the issues of using an ad hoc measure and the reversed KL that a pure Bayesian perspective would find problematic. 7

5 Understanding Overpruning in Variational Autoencoders Combining these complementary views can shed new light on practical problems in constructing and training models. Variational autoencders are such models. The Variational Lossy Autoencoder [1] model provides a Bits-back interpretation of how the overpruning in classic VAE models is not an optimization challenge but is reasonable and probably even necessary in the cases where the latent code is not need for reconstruction (local information can be used instead, for example). In that case, the training of the model would place the posterior of the latent code/parameter at the prior to avoid incurring a KL regularization cost and read optimal code length (for the specific X). This interpretation allowed an easier integration of auto-regressive structures in the decoder model without fully ignoring the encoder/latent representation. Based on this interpretation, annealing the KL loss is an ad hoc solution that should not have a theoretical foundation and should probably be avoided. References [1] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arxiv preprint arxiv:1611.02731, 2016. [2] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis, volume 2. Chapman & Hall/CRC Boca Raton, FL, USA, 2014. [3] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5 13. ACM, 1993. [4] Antti Honkela and Harri Valpola. Variational learning and bits-back coding: an informationtheoretic view to bayesian learning. IEEE Transactions on Neural Networks, 15(4):800 810, 2004. 8

[5] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999. [6] Tom Richardson and Ruediger Urbanke. Modern coding theory. Cambridge University Press, 2008. [7] Jorma Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of statistics, pages 416 431, 1983. [8] Chris S Wallace and David L Dowe. Minimum message length and kolmogorov complexity. The Computer Journal, 42(4):270 283, 1999. 9