Approximate Bayesian inference

Size: px

Start display at page:

Download "Approximate Bayesian inference"

Marylou Holland
5 years ago
Views:

1 Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth

2 1 Exchange rate data Month

3 Image data 2

4 1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

5 4 Bayesian inference y - data x - latent (unknown) variables p(x, y) - probabilistic (generative) model Posterior distribution: p(x y) = p(x, y) p(y) Expectations: E p(x y) [ϕ(x)] = ϕ(x)p(x y)dx

6 5 Examples of probabilistic generative models Exchangeable data models T p(x, y 1:T ) = µ(x) g(y t x) t=1 Applications: Regression, classification,... State space models T T p(x 1:T, y 1:T ) µ(x 1 ) f(x t x t 1 ) g(y t x t ) t=2 t=1 Applications: Filtering, smoothing, forecasting,...

7 6 Approximate Bayesian inference Because p(x y) is typically intractable we approximate: Variational methods - Turn integration into optimization E p(x y) [ϕ(x)] E q(x ; λ ) [ϕ(x)] Monte Carlo methods - Random sampling E p(x y) [ϕ(x)] 1 N N ϕ(x i ), x i p(x y) i=1

8 1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

9 8 Variational inference q(x ; λ) λ p(x y) KL(q(x ; λ) p(x y)) λ init

10 9 Evidence lower bound KL(q(x ; λ) p(x y)) = log p(y) E q [log p(x, y) log q(x ; λ)] log p(y) E q [log p(x, y) log q(x ; λ)] }{{} L(λ) Maximizing the ELBO L(λ) minimizes KL divergence between q(x ; λ) and p(x y) Coordinate ascent when conditionally conjugate Stochastic optimization using noisy λ L(λ) estimated with Monte Carlo methods

11 10 Bayesian logistic regression Data are pairs (y t, z t ) z t is a covariate y t is binary label θ is regression coefficient The generative model is p(θ, y 1:T z 1:T ) = N (θ ; 0, 1) T Bernoulli (y t ; σ(θz t )), t=1 where σ( ) is the logistic function mapping reals to (0, 1).

12 11 VI for Bayesian logistic regression Let s consider a single data pair (y, z) Variational approximation q(θ ; λ) = N (θ ; µ, σ 2 ) The ELBO is given by L(µ, σ 2 [ ) = E q log p(θ) + log p(y θ, z) log q(θ ; µ, σ 2 ) ]

13 12 VI for Bayesian logistic regression L(µ, σ 2 [ ) = E q log p(θ) + log p(y θ, z) log q(θ ; µ, σ 2 ) ] = 1 2 (µ2 + σ 2 ) log σ2 + E q [log p(y θ, z)] + const. = 1 2 (µ2 + σ 2 ) log σ2 + E q [yzθ log (1 + exp(zθ))] = 1 2 (µ2 + σ 2 ) log σ2 + yzµ E q [log (1 + exp(zθ))] We are stuck because of the expectation. What can we do?

14 1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

15 14 Stochastic Optimization Gradient-based optimization: L(λ) = 0 Stochastic approximation: [Robbins and Monro, 1951] [ ] with E L(λ k ) = L(λ k ). λ k+1 = λ k + α k L(λ k ),

16 15 VI for Bayesian logistic regression L(µ, σ 2 ) = 1 2 (µ2 + σ 2 ) log σ2 + yzµ E q [log (1 + exp(zθ))] E q [log (1 + exp(zθ))] = q(θ ; µ, σ 2 ) log (1 + exp(zθ)) dθ [ = E q log q(θ ; µ, σ 2 ) log (1 + exp(zθ)) ] 1 N log q(θ i ; µ, σ 2 ) log ( 1 + exp(zθ i ) ) N L(µ, σ 2 ), i=1 with θ i q(θ ; µ, σ 2 ). Unbiased MC estimator of the gradient!

17 16 Stochastic gradient variational inference Monte Carlo estimates of gradients to optimize the ELBO. Score function: [Paisley et al., 2012, Ranganath et al., 2014] L(λ) = E q [log p(x, y) log q(x ; λ)] = E q [ log q(x ; λ) (log p(x, y) log q(x ; λ))] Reparameterization: [Kingma & Welling 2014, Rezende et al, 2014] L(λ) = E q [log p(x, y) log q(x ; λ)] = E s(ε) [log p(h(ε, λ), y) log q(h(ε, λ) ; λ)] = E s(ε) [ x (log p(x, y) log q(x ; λ)) λ h(ε, λ) ] e.g. x = h(ε, µ, σ) = µ + σε, s(ε) = N (0, 1).

18 17 Reparameterization vs. score function ELBO Time [s] ADVI BBVI G REP RSVI

19 18 Why doesn t reparameterization always work? All random variables we simulate on our computers are ultimately transformations of uniforms. Why can t we do reparameterization gradients for everything? Often transformations h( ) used in practice are not differentiable.

20 1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

21 20 Reparameterizing gamma Gamma(λ, 1): Common distribution in VI Used to sample beta, Dirichlet, Student s t, chi-squared,... Simulation: Propose h(ε, λ) = Accept reject ( λ 1 ) ( ε 9λ 3 ) 3, ε N (0, 1) u a(ε, λ)? u U[0, 1] Tricky to reparameterize because of the accept reject step! A simple method for generating gamma variables, Marsaglia & Tsang, 2000

22 21 Our strategy: partial reparameterization Idea: Use ε π(ε ; λ) (only weakly dependent on λ) s.t. x = h(ε, λ) q(x ; λ) Then: L(λ) = E π(ε ; λ) [log p(h(ε, λ), y) log q(h(ε, λ) ; λ)] = E π(ε ; λ) [ x (log p(x, y) log q(x ; λ)) λ h(ε, λ) ] }{{} reparameterization + E π(ε ; λ) [(log p(h(ε, λ), y) log q(h(ε, λ) ; λ)) λ log π(ε ; λ)] }{{} correction Intuition: Interpolating between score function gradients and reparameterization gradients.

23 22 Rejection sampling and reparameterization Idea: Use h(ε, λ) from the proposal Let π(ε ; λ) be the distribution of the accepted ε Rejection sampling: Propose x = h(ε, λ), ε s(ε) Accept if u a(ε, λ) = Then: 1 q(h(ε,λ) ; λ) dh (ε,λ) dε π(ε ; λ) = π(ε, u ; λ)du = M λ s(ε) 0 = q(h(ε, λ) ; λ) dh (ε, λ) dε M λ s(ε), u U[0, 1] 1 0 1[u a(ε, λ)]du

24 23 Sparse gamma deep exponential family zn;l;k z KL w k,k Gamma(0.1, 0.1) ( ) x l 0.1 n,k Gamma 0.1, k wl k,k x l+1 n,k ( ) w`+1,k w`,k zn;`c1;k zn;`;k K`C1 K` z y n,d Poisson k w 0 k,dx 1 n,k w1,k zn;1;k K1 w0,i xn;i V x Deep exponential families, Ranganath et al., 2015 N

25 24 Sparse gamma DEF Olivetti faces ELBO Time [s] ADVI BBVI G REP RSVI [Kucukelbir et al., 2015, Ranganath et al., 2014, Ruiz et al., 2016]

26 25 Variational Monte Carlo methods Monte Carlo methods can enable variational inference for more complex (non-conjugate) models p(x, y) But what if we can use Monte Carlo methods to also define more flexible q(x ; λ)? I will illustrate how we can do this using sequential Monte Carlo for the state space model: T T p(x 1:T, y 1:T ) µ(x 1 ) f(x t x t 1 ) g(y t x t ) t=2 t=1

27 1 Bayesian inference 2 Variational inference 3 Stochastic gradient variational inference 4 Rejection sampling variational inference 5 Variational sequential Monte Carlo

28 27 Sequential Monte Carlo SMC Approximation: p(x 1:t 1 y 1:t 1 ) = Update: N i=1 a i t 1 Discrete ( w j t 1/ l wl t 1 x i t r(x t x ai t 1 t 1 ; λ) w i t = f(xi t xa i t 1 t 1 )g(y t xi t ) r(x i t xai t 1 t 1 ; λ) wt 1 i δ x i l wl 1:t 1 t 1 ) particle particle time SIS time

29 28 Variational Sequential Monte Carlo Key idea: q(x 1:T ; λ) defined by running SMC and sample p(x 1:T y 1:T ) Density p(x y) r(x;λ ) q(x;λ ) x

30 29 Variational Sequential Monte Carlo Generative Process VSMC 1. Run SMC to get {( x i 1:T, )} N wi T i=1 2. b T Discrete ( wt/ i ) l wl T 3. return x 1:T x b T 1:T particle N x 1:T p(x 1:T y 1:T ) = wi T δ x i i=1 l wl 1:T T time

31 30 Distribution of VSMC The distribution q is the marginal of x b T 1:T over all random variables, x 1:N 1:T, a1:n 1:T 1, b T, generated by VSMC ] q(x 1:T y 1:T ; λ) = p(x 1:T, y 1:T ) E [ p(y 1:T ) 1 x1:t, where p(y 1:T ) T t=1 1 N N i=1 wi t.

32 31 Surrogate ELBO We can not directly use the standard ELBO, L(λ), because evaluating q pointwise is intractable. Consider instead: [ T ( L(λ) E log t=1 1 N N i=1 w i t )] = E [log p(y 1:T )] Theorem 1 (Surrogate ELBO). log p(y 1:T ) L(λ) L(λ). Proof. (Idea) Use the definition of L(λ) for q(x 1:T y 1:T ; λ) and then Jensen s inequality.

33 32 Stochastic Optimization We optimize the surrogate ELBO L(λ) using stochastic gradient ascent, L(λ) [ = E [log p(y 1:T )] = E log p(y 1:T ) + log p(y 1:T ) log φ ]. Variance reduction using the reparameterization trick, Rao-Blackwellization, and control variates.

34 33 Theoretical Results For some c(λ) <, With N = bt, ( ) KL q(x 1:T ; λ) p(x 1:T y 1:T ) ( ) KL q(x 1:T ; λ) p(x 1:T y 1:T ) T σ 2 (λ), 2b for σ 2 (λ) <. c(λ) N. [ E log p(y 1:T ) ] p(y 1:T )

35 34 Experiments - Stochastic volatility Data: 10 years of monthly returns for the exchange rate of 22 currencies wrt to USD N = 4 N = 8 N = 16 x t = µ + φ(x t 1 µ) + v t, v t N (0, Q) ( xt ) y t = β exp e t, e t N (0, I) 2 Method ELBO Structured VI IWAE VSMC IWAE VSMC IWAE VSMC ELBO VSMC IWAE N r(x t x t 1 ; λ, θ) f(x t x t 1 ; θ) N (x t ; µ t, Σ t )

36 35 Experiments - Stochastic recurrent neural networks Data: 105 motor cortex neurons simultaneously recorded in a macaque monkey 1080 x t = µ θ (x t 1 ) + exp ( σ θ (x t 1 )/2 ) v t, y t Poisson (exp (η θ (x t ))), ELBO Iterations (10 3 ) IWAE d x = 3 IWAE d x = 5 IWAE d x = 10 VSMC d x = 3 VSMC d x = 5 VSMC d x = 10 Proposal: r(x t x t 1, y t ) f(x t x t 1 ; θ) ( ) N x t ; µ λ (y t ), e 2σ λ(y t ), µ λ ( ), σ λ ( ) are inference networks.

37 36 Collaborators Scott Linderman Francisco Ruiz Rajesh Ranganath David Blei Work done while visiting Columbia University Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms, Naesseth et al., 2017 Variational Sequential Monte Carlo, Naesseth et al., 2017

37 Take-home messages Stochastic optimization opens up for interesting new approximate Bayesian inference methods Monte Carlo methods more flexible variational approximations with

38 37 Take-home messages Stochastic optimization opens up for interesting new approximate Bayesian inference methods Monte Carlo methods more flexible variational approximations with guarantees Variational methods adaptation of Monte Carlo methods based on a global objective Contact: christian.a.naesseth@liu.se Website: users.isy.liu.se/en/rt/chran60/index.html

39 38 Subsampling and Variational EM Data subsampling y 1:T = {y j 1:T }n j=1 L(λ) = n j=1 ] E [log p(y j1:t ) At each iteration sample a mini-batch of data. Variational EM log p θ (y 1:T ) L(λ, θ) = E [log p θ (y 1:T )] Maximize L(λ, θ) jointly with respect to λ, θ

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian