Approximate Bayesian inference

Similar documents
Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms

Probabilistic Graphical Models for Image Analysis - Lecture 4

Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms

Lecture 13 : Variational Inference: Mean Field Approximation

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Variational Sequential Monte Carlo

13: Variational inference II

Variational Inference via Stochastic Backpropagation

Natural Gradients via the Variational Predictive Distribution

arxiv: v1 [stat.ml] 18 Oct 2016

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

arxiv: v1 [stat.ml] 2 Mar 2016

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Lecture 2: From Linear Regression to Kalman Filter and Beyond

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

CPSC 540: Machine Learning

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Auto-Encoding Variational Bayes

Advances in Variational Inference

Notes on pseudo-marginal methods, variational Bayes and ABC

Probabilistic Graphical Models

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Expectation Propagation Algorithm

Ch 4. Linear Models for Classification

Part 1: Expectation Propagation

Unsupervised Learning

Introduction to Probabilistic Machine Learning

Bayesian Machine Learning

Pattern Recognition and Machine Learning

Evaluating the Variance of

Variational Autoencoders

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Automatic Differentiation Variational Inference

arxiv: v2 [stat.ml] 15 Aug 2017

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

The Generalized Reparameterization Gradient

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

On Markov chain Monte Carlo methods for tall data

Bayesian Methods for Machine Learning

Linear Classification

Probabilistic & Bayesian deep learning. Andreas Damianou

Automatic Differentiation Variational Inference

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Nonparametric Bayesian Methods (Gaussian Processes)

An introduction to Variational calculus in Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Note 1: Varitional Methods for Latent Dirichlet Allocation

Integrated Non-Factorized Variational Inference

Variational Scoring of Graphical Model Structures

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Variational Autoencoder

Density Estimation. Seungjin Choi

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

The connection of dropout and Bayesian statistics

Stochastic Variational Inference

STA 4273H: Statistical Machine Learning

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Statistical Machine Learning Lectures 4: Variational Bayes

an introduction to bayesian inference

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Bayesian Deep Learning

Week 3: The EM algorithm

Latent state estimation using control theory

Interpretable Latent Variable Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Sparse Stochastic Inference for Latent Dirichlet Allocation

Generative v. Discriminative classifiers Intuition

Recurrent Latent Variable Networks for Session-Based Recommendation

Overdispersed Black-Box Variational Inference

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Quasi-Monte Carlo Flows

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

17 : Optimization and Monte Carlo Methods

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Black-box α-divergence Minimization

Auxiliary Particle Methods

Approximate Inference Part 1 of 2

Lecture 6: Graphical Models: Learning

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Approximate Inference Part 1 of 2

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Automatic Variational Inference in Stan

Controlled sequential Monte Carlo

Introduction to Bayesian inference

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CPSC 540: Machine Learning

Lecture 7 and 8: Markov Chain Monte Carlo

G8325: Variational Bayes

Probabilistic Graphical Models & Applications

Variational Inference for Monte Carlo Objectives

Transcription:

Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth

1 Exchange rate data 0 20 40 60 80 100 120 Month

Image data 2

1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

4 Bayesian inference y - data x - latent (unknown) variables p(x, y) - probabilistic (generative) model Posterior distribution: p(x y) = p(x, y) p(y) Expectations: E p(x y) [ϕ(x)] = ϕ(x)p(x y)dx

5 Examples of probabilistic generative models Exchangeable data models T p(x, y 1:T ) = µ(x) g(y t x) t=1 Applications: Regression, classification,... State space models T T p(x 1:T, y 1:T ) µ(x 1 ) f(x t x t 1 ) g(y t x t ) t=2 t=1 Applications: Filtering, smoothing, forecasting,...

6 Approximate Bayesian inference Because p(x y) is typically intractable we approximate: Variational methods - Turn integration into optimization E p(x y) [ϕ(x)] E q(x ; λ ) [ϕ(x)] Monte Carlo methods - Random sampling E p(x y) [ϕ(x)] 1 N N ϕ(x i ), x i p(x y) i=1

1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

8 Variational inference q(x ; λ) λ p(x y) KL(q(x ; λ) p(x y)) λ init

9 Evidence lower bound KL(q(x ; λ) p(x y)) = log p(y) E q [log p(x, y) log q(x ; λ)] log p(y) E q [log p(x, y) log q(x ; λ)] }{{} L(λ) Maximizing the ELBO L(λ) minimizes KL divergence between q(x ; λ) and p(x y) Coordinate ascent when conditionally conjugate Stochastic optimization using noisy λ L(λ) estimated with Monte Carlo methods

10 Bayesian logistic regression Data are pairs (y t, z t ) z t is a covariate y t is binary label θ is regression coefficient The generative model is p(θ, y 1:T z 1:T ) = N (θ ; 0, 1) T Bernoulli (y t ; σ(θz t )), t=1 where σ( ) is the logistic function mapping reals to (0, 1).

11 VI for Bayesian logistic regression Let s consider a single data pair (y, z) Variational approximation q(θ ; λ) = N (θ ; µ, σ 2 ) The ELBO is given by L(µ, σ 2 [ ) = E q log p(θ) + log p(y θ, z) log q(θ ; µ, σ 2 ) ]

12 VI for Bayesian logistic regression L(µ, σ 2 [ ) = E q log p(θ) + log p(y θ, z) log q(θ ; µ, σ 2 ) ] = 1 2 (µ2 + σ 2 ) + 1 2 log σ2 + E q [log p(y θ, z)] + const. = 1 2 (µ2 + σ 2 ) + 1 2 log σ2 + E q [yzθ log (1 + exp(zθ))] = 1 2 (µ2 + σ 2 ) + 1 2 log σ2 + yzµ E q [log (1 + exp(zθ))] We are stuck because of the expectation. What can we do?

1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

14 Stochastic Optimization Gradient-based optimization: L(λ) = 0 Stochastic approximation: [Robbins and Monro, 1951] [ ] with E L(λ k ) = L(λ k ). λ k+1 = λ k + α k L(λ k ),

15 VI for Bayesian logistic regression L(µ, σ 2 ) = 1 2 (µ2 + σ 2 ) + 1 2 log σ2 + yzµ E q [log (1 + exp(zθ))] E q [log (1 + exp(zθ))] = q(θ ; µ, σ 2 ) log (1 + exp(zθ)) dθ [ = E q log q(θ ; µ, σ 2 ) log (1 + exp(zθ)) ] 1 N log q(θ i ; µ, σ 2 ) log ( 1 + exp(zθ i ) ) N L(µ, σ 2 ), i=1 with θ i q(θ ; µ, σ 2 ). Unbiased MC estimator of the gradient!

16 Stochastic gradient variational inference Monte Carlo estimates of gradients to optimize the ELBO. Score function: [Paisley et al., 2012, Ranganath et al., 2014] L(λ) = E q [log p(x, y) log q(x ; λ)] = E q [ log q(x ; λ) (log p(x, y) log q(x ; λ))] Reparameterization: [Kingma & Welling 2014, Rezende et al, 2014] L(λ) = E q [log p(x, y) log q(x ; λ)] = E s(ε) [log p(h(ε, λ), y) log q(h(ε, λ) ; λ)] = E s(ε) [ x (log p(x, y) log q(x ; λ)) λ h(ε, λ) ] e.g. x = h(ε, µ, σ) = µ + σε, s(ε) = N (0, 1).

17 Reparameterization vs. score function 0.5 10 7 1.0 ELBO 1.5 2.0 2.5 10 0 10 1 10 2 10 3 10 4 Time [s] ADVI BBVI G REP RSVI

18 Why doesn t reparameterization always work? All random variables we simulate on our computers are ultimately transformations of uniforms. Why can t we do reparameterization gradients for everything? Often transformations h( ) used in practice are not differentiable.

1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

20 Reparameterizing gamma Gamma(λ, 1): Common distribution in VI Used to sample beta, Dirichlet, Student s t, chi-squared,... Simulation: Propose h(ε, λ) = Accept reject ( λ 1 ) ( 1 + 3 ε 9λ 3 ) 3, ε N (0, 1) u a(ε, λ)? u U[0, 1] Tricky to reparameterize because of the accept reject step! A simple method for generating gamma variables, Marsaglia & Tsang, 2000

21 Our strategy: partial reparameterization Idea: Use ε π(ε ; λ) (only weakly dependent on λ) s.t. x = h(ε, λ) q(x ; λ) Then: L(λ) = E π(ε ; λ) [log p(h(ε, λ), y) log q(h(ε, λ) ; λ)] = E π(ε ; λ) [ x (log p(x, y) log q(x ; λ)) λ h(ε, λ) ] }{{} reparameterization + E π(ε ; λ) [(log p(h(ε, λ), y) log q(h(ε, λ) ; λ)) λ log π(ε ; λ)] }{{} correction Intuition: Interpolating between score function gradients and reparameterization gradients.

22 Rejection sampling and reparameterization Idea: Use h(ε, λ) from the proposal Let π(ε ; λ) be the distribution of the accepted ε Rejection sampling: Propose x = h(ε, λ), ε s(ε) Accept if u a(ε, λ) = Then: 1 q(h(ε,λ) ; λ) dh (ε,λ) dε π(ε ; λ) = π(ε, u ; λ)du = M λ s(ε) 0 = q(h(ε, λ) ; λ) dh (ε, λ) dε M λ s(ε), u U[0, 1] 1 0 1[u a(ε, λ)]du

23 Sparse gamma deep exponential family zn;l;k z KL w k,k Gamma(0.1, 0.1) ( ) x l 0.1 n,k Gamma 0.1, k wl k,k x l+1 n,k ( ) w`+1,k w`,k zn;`c1;k zn;`;k K`C1 K` z y n,d Poisson k w 0 k,dx 1 n,k w1,k zn;1;k K1 w0,i xn;i V x Deep exponential families, Ranganath et al., 2015 N

24 Sparse gamma DEF Olivetti faces 0.5 10 7 1.0 ELBO 1.5 2.0 2.5 10 0 10 1 10 2 10 3 10 4 Time [s] ADVI BBVI G REP RSVI [Kucukelbir et al., 2015, Ranganath et al., 2014, Ruiz et al., 2016]

25 Variational Monte Carlo methods Monte Carlo methods can enable variational inference for more complex (non-conjugate) models p(x, y) But what if we can use Monte Carlo methods to also define more flexible q(x ; λ)? I will illustrate how we can do this using sequential Monte Carlo for the state space model: T T p(x 1:T, y 1:T ) µ(x 1 ) f(x t x t 1 ) g(y t x t ) t=2 t=1

1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

27 Sequential Monte Carlo SMC Approximation: p(x 1:t 1 y 1:t 1 ) = Update: N i=1 a i t 1 Discrete ( w j t 1/ l wl t 1 x i t r(x t x ai t 1 t 1 ; λ) w i t = f(xi t xa i t 1 t 1 )g(y t xi t ) r(x i t xai t 1 t 1 ; λ) wt 1 i δ x i l wl 1:t 1 t 1 ) particle particle time SIS time

28 Variational Sequential Monte Carlo Key idea: q(x 1:T ; λ) defined by running SMC and sample p(x 1:T y 1:T ) Density 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 p(x y) r(x;λ ) q(x;λ ) 6 4 2 0 2 4 6 x

29 Variational Sequential Monte Carlo Generative Process VSMC 1. Run SMC to get {( x i 1:T, )} N wi T i=1 2. b T Discrete ( wt/ i ) l wl T 3. return x 1:T x b T 1:T particle N x 1:T p(x 1:T y 1:T ) = wi T δ x i i=1 l wl 1:T T time

30 Distribution of VSMC The distribution q is the marginal of x b T 1:T over all random variables, x 1:N 1:T, a1:n 1:T 1, b T, generated by VSMC ] q(x 1:T y 1:T ; λ) = p(x 1:T, y 1:T ) E [ p(y 1:T ) 1 x1:t, where p(y 1:T ) T t=1 1 N N i=1 wi t.

31 Surrogate ELBO We can not directly use the standard ELBO, L(λ), because evaluating q pointwise is intractable. Consider instead: [ T ( L(λ) E log t=1 1 N N i=1 w i t )] = E [log p(y 1:T )] Theorem 1 (Surrogate ELBO). log p(y 1:T ) L(λ) L(λ). Proof. (Idea) Use the definition of L(λ) for q(x 1:T y 1:T ; λ) and then Jensen s inequality.

32 Stochastic Optimization We optimize the surrogate ELBO L(λ) using stochastic gradient ascent, L(λ) [ = E [log p(y 1:T )] = E log p(y 1:T ) + log p(y 1:T ) log φ ]. Variance reduction using the reparameterization trick, Rao-Blackwellization, and control variates.

33 Theoretical Results For some c(λ) <, With N = bt, ( ) KL q(x 1:T ; λ) p(x 1:T y 1:T ) ( ) KL q(x 1:T ; λ) p(x 1:T y 1:T ) T σ 2 (λ), 2b for σ 2 (λ) <. c(λ) N. [ E log p(y 1:T ) ] p(y 1:T )

34 Experiments - Stochastic volatility Data: 10 years of monthly returns for the exchange rate of 22 currencies wrt to USD N = 4 N = 8 N = 16 x t = µ + φ(x t 1 µ) + v t, v t N (0, Q) ( xt ) y t = β exp e t, e t N (0, I) 2 Method ELBO Structured VI 6905.1 IWAE 6911.2 VSMC 6921.6 IWAE 6912.4 VSMC 6935.8 IWAE 6913.3 VSMC 6936.6 ELBO 6960 6950 6940 6930 6920 6910 VSMC IWAE 100 200 300 400 500 N r(x t x t 1 ; λ, θ) f(x t x t 1 ; θ) N (x t ; µ t, Σ t )

35 Experiments - Stochastic recurrent neural networks Data: 105 motor cortex neurons simultaneously recorded in a macaque monkey 1080 x t = µ θ (x t 1 ) + exp ( σ θ (x t 1 )/2 ) v t, y t Poisson (exp (η θ (x t ))), ELBO 1100 1120 1140 1160 1180 1200 0 20 40 60 80 100 Iterations (10 3 ) IWAE d x = 3 IWAE d x = 5 IWAE d x = 10 VSMC d x = 3 VSMC d x = 5 VSMC d x = 10 Proposal: r(x t x t 1, y t ) f(x t x t 1 ; θ) ( ) N x t ; µ λ (y t ), e 2σ λ(y t ), µ λ ( ), σ λ ( ) are inference networks.

36 Collaborators Scott Linderman Francisco Ruiz Rajesh Ranganath David Blei Work done while visiting Columbia University Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms, Naesseth et al., 2017 Variational Sequential Monte Carlo, Naesseth et al., 2017

37 Take-home messages Stochastic optimization opens up for interesting new approximate Bayesian inference methods Monte Carlo methods more flexible variational approximations with guarantees Variational methods adaptation of Monte Carlo methods based on a global objective Contact: christian.a.naesseth@liu.se Website: users.isy.liu.se/en/rt/chran60/index.html

38 Subsampling and Variational EM Data subsampling y 1:T = {y j 1:T }n j=1 L(λ) = n j=1 ] E [log p(y j1:t ) At each iteration sample a mini-batch of data. Variational EM log p θ (y 1:T ) L(λ, θ) = E [log p θ (y 1:T )] Maximize L(λ, θ) jointly with respect to λ, θ