Transport maps and dimension reduction for Bayesian computation Youssef Marzouk

Size: px

Start display at page:

Download "Transport maps and dimension reduction for Bayesian computation Youssef Marzouk"

Myron Fox
6 years ago
Views:

Technology Department of Aeronautics & Astronautics

1 Transport maps and dimension reduction for Bayesian computation Youssef Marzouk Massachusetts Institute of Technology Department of Aeronautics & Astronautics Center for Computational Engineering

2 Acknowledgments Joint work with: Matt Parno, Tarek Moselhy (MIT) Alessio Spantini, Tiangang Cui, Antti Solonen (MIT); Kody Law (KAUST); James Martin (UT-Austin); Luis Tenorio (Mines) Support from DOE Applied Math program (ASCR)

typically limited, noisy, and indirect Forward model G may be

3 Inverse problems Example: subsurface flow and transport forward model (PDE) d =G(θ)+ observation/model errors Data d are typically limited, noisy, and indirect Forward model G may be computationally intensive Parameter θ is often high- or infinite-dimensional

Inverse problems Example: ice-ocean interactions 1000 36 900 800 48 700 600 75 o S 500 400 12 300 200 100 24 102 o W 30 101 o

4 Inverse problems Example: ice-ocean interactions o S o W o W o W 30 Pine Island Glacier (left) modeled with the MIT General Circulation Model (right) [P. Conrad, P. Heimbach, YM]

5 UQ for inverse problems z } { z} { We adopt a Bayesian approach: p(θ d) p(d θ)p(θ) {z } {z } Key idea: model parameters θ are treated as random variables

6 Computational challenges Extract information from the posterior by evaluating posterior expectations: E π [f ] = f (θ)π(θ)dθ Key strategies for making this computationally tractable: Approximations of the forward model: spectral expansions, local interpolants, reduced order models, multi-fidelity approaches Efficient and structure-exploiting sampling schemes to explore the posterior distribution

7 Computational challenges Extract information from the posterior by evaluating posterior expectations: E π [f ] = f (θ)π(θ)dθ Key strategies for making this computationally tractable: Approximations of the forward model: spectral expansions, local interpolants, reduced order models, multi-fidelity approaches Efficient and structure-exploiting sampling schemes to explore the posterior distribution

8 Outline 1. Transport maps for Bayesian computation 2. Dimension reduction and dimension-independent posterior sampling

9 Posterior sampling Markov chain Monte Carlo (MCMC) algorithms are the workhorse of Bayesian computation Metropolis-Hastings kernel K(θ t, ) 1. Propose a sample θ ~ q(θ θ t ) 2. Compute acceptance probability: α( θ,θ t ) = min 1, π( θ )q(θ t θ ) π(θ t )q( θ θ t ) 3. Update the chain state: θ t+1 = θ with probability α θ t with probability 1 α Good mixing requires design of effective proposal distributions q( θ θ)

10 Posterior sampling Markov chain Monte Carlo (MCMC) algorithms are the workhorse of Bayesian computation Effective = adapted to the target Can we transform proposals or targets for better sampling?

11 Optimal transport Deterministic coupling of two random variables, r ~ µ, Z ~ ν Z T T(r) r r Monge problem: min T c ( r,t(r) ) µ(dr), where T µ = ν A unique and monotone solution exists for quadratic (and other) transport costs c(x,y) [Brenier 1991, McCann 1995]

12 Transport maps: computation Alternatives to Monge problem: seek triangular (Knothe- Rosenblatt) maps Previous work: directly finding a map from prior to posterior [Moselhy & YM, JCP 2012] Reference = prior or an uncorrelated multivariate normal Target = posterior max E T T p log( π(t(r) )+ log det T r Map has exploitable structure in high-dimensional inverse problems: In an appropriate basis, map can be expressed as a low-rank perturbation of the identity output index input index

13 Combining transport maps with MCMC Previous optimization problem can be costly Map must be represented in a finite basis (e.g., polynomials) and thus may be approximate. Can we still achieve exact posterior sampling? Key idea: combining map construction with MCMC Posterior sampling + convex optimization Transport map preconditions MCMC sampling; posterior samples enable simpler map construction Can also be understood in the framework of adaptive MCMC

14 Constructing a map from samples Candidate map T yields approximation π of target distribution Optimization objective: min T max T D ( KL π(θ) π(θ; T) ) E π log p( T(θ))+ log D θt(θ) ( ) T ( ) Now π is the distribution on the map inputs and p is the distribution on the outputs Samples from π approximate the expectation; p has useful structure ( ) r p(r) = N(0,I)

15 Constructing a map from samples Additional structure: Seek a monotone triangular map (converges to Knothe-Rosenblatt rearrangement) T(θ) = T 1 (θ 1 ) T 2 (θ 1,θ 2 ) T D (θ 1,θ 2,,θ D ) Let target p(r) be multivariate normal with identity covariance Yields a convex and separable optimization problem: N 1 T j i=1 2 min T j 2 (θ (i) ) log T j θ j Sample-average approximation (SAA) with N samples from π Linear representation of map T (e.g., polynomial or RBF basis) θ (i ), s.t. T j θ j θ (i ) d min > 0, i {1,,N }

16 Map-accelerated MCMC Ingredient #1: static map Idea: perform MCMC in the reference space, on a preconditioned density Simple proposal in reference space (e.g., random walk) corresponds to a more complex/tailored proposal on target ( ) T ( ) p(r) r

17 Map-accelerated MCMC Ingredient #1: static map Idea: perform MCMC in the reference space, on a preconditioned density Simple proposal in reference space (e.g., random walk) corresponds to a more complex/tailored proposal on target ( ) T ( ) 0 r 0 r q r (r 0 r) r α = π( T 1 ( r )) D rt 1 q (r r r ) r π( T 1 (r)) D rt 1 r q r ( r r) simple proposal q r on pushforward of target through map

18 Map-accelerated MCMC Ingredient #1: static map Idea: perform MCMC in the reference space, on a preconditioned density Simple proposal in reference space (e.g., random walk) corresponds to a more complex/tailored proposal on target ( ) T ( ) 0 r 0 r q r (r 0 r) r α = π( T 1 ( r )) D rt 1 q (r r r ) r π( T 1 (r)) D rt 1 r q r ( r r) more complex proposal, directly on target distribution

19 Map-accelerated MCMC Ingredient #2: adaptive map Update the map with each MCMC iteration: more samples from π, more accurate E π, better T Analogous to adaptive MCMC [Haario 2001, Andrieu 2006] but with nonlinear transformation to capture non-gaussian structure ( ) T k ( ) p k (r) r

20 Map-accelerated MCMC Ingredient #2: adaptive map Update the map with each MCMC iteration: more samples from π, more accurate E π, better T Analogous to adaptive MCMC [Haario 2001, Andrieu 2006] but with nonlinear transformation to capture non-gaussian structure T k+1 ( ) ( ) p k+1 (r) r

21 Map-accelerated MCMC Inefficient; the changing correlation Ingredient #3: global proposals structure of limits Once the map becomes sufficiently accurate,mixing. would like to avoid random-walk behavior mapped random walk proposal reference random walk proposal qr (r 0 r ) = N(r, 2 I) q ( 0 ) = qr T ( 0) T ( ) detd T ( 0)

22 Map-accelerated MCMC Ingredient #3: global proposals Once the map becomes sufficiently accurate, would like to avoid random-walk behavior mapped independence proposal reference independence proposal 0 qr (r r ) = N(0, I ) q ( 0 ) = qr T ( 0) T ( ) detd T ( 0)

23 Map-accelerated MCMC Ingredient #3: global proposals Once the map becomes sufficiently accurate, would like to avoid random-walk behavior Solution: delayed rejection MCMC [Mira 2001] First proposal = independent sample from p (global, more efficient); second proposal = random walk (local, more robust) Entire scheme is provably ergodic with respect to the exact posterior measure [Parno, YM; arxiv: ] Requires enforcing a bi-lipschitz condition on maps, to preserve reasonable tail behavior of target

24 Lines in the reference space become geodesics on this manifold Example: BOD model Application 1: BOD inference problem Biochemical oxygen demand (BOD) model Goal: Infer parameters of a biochemical oxygen demand (BOD) model, given many early observations. Simple two-parameter inference problem: A simple BOD model, often used in water quality monitoring, is: B(t) =t )0) (1+ exp( B(ti ) = θ0 (1 exp( θ i 1 i 1t)). We assume 20 observations of B(t) are available in t 2of [1,B(t) 5]. Flatthatpriors; 20 noisy observations Final reference density p (r ) r2 2 Posterior distribution available for t [1,5] 1 r1

25 Example: BOD model r1 Comparison 1of sampling efficiency, for multiple MCMC schemes: pplication 1: BOD inference results minimum ESS/sec minimum ESS/evaluation TM+MIX 1.8e-1 TM+DRL 1.6e-1 NUTS 6.6e-3 AMALA DRAM 5.8e smmala 1.4 smmala 2.8e-3 DRAM TM+DRG NUTS 1.4e-3 AMALA TM+DRL 5.7e-2 TM+DRG TM+MIX ,000 2,000

26 NUTS AMALA Example: predator-prey model smmala a few initial observations DRAM given Infer ODE parameters 0 dp P 1,000PQ = rp 1 s dt K a +P 0.2 model 2,000 Population dq PQ =u vq a +P n a few initial population dt observations. Posterior predictive: Prey Population P(t) Predator Population Q(t) Time

27 Application 2: Predator-prey model Example: predator-prey model Population Goal: Infer ODE parameters given a few initial population observations. Posterior distribution: Posterio P0 Posterior distribution has complex structure Q0 r K s a u v dp = rp 1 dt dq PQ = u dt a+

28 Example: predator-prey model Table 5: Performance of MCMC samplers on predator-prey parameter inference problem. For this 8 dimensional Sampling problem, efficiency the maximum comparisons: correlation time and corresponding minimum e ective sample size is displayed provided. The relative performance compared to DRAM is also reported in the Rel. ESS/sec and Rel. ESS/eval columns. Method max ESS ESS/sec ESS/eval Rel. ESS/sec Rel. ESS/eval DRAM e e smmala e e AMALA e e TM+DRG e e TM+DRL e e TM+RWM e e TM+MIX e e TM+LA e e

29 Map-accelerated MCMC Important characteristics of the approach: Map construction is easily parallelizable (Currently) requires no gradients from posterior density Generalizes many current MCMC techniques: Adaptive Metropolis: map enables non-gaussian proposals; mixing local and global moves Manifold MCMC [Girolami & Calderhead 2011]: map defines a Riemannian metric; paths in reference space are geodesics on target What about high dimensions? How to choose or enrich the space of triangular maps T T?

30 Outline 1. Transport maps for Bayesian computation 2. Dimension reduction and dimension-independent posterior sampling

31 Dimension reduction Bayesian inverse problems are in principle infinitedimensional (practically high-dimensional) But what controls the intrinsic dimensionality? 1. Smoothing from the prior distribution (e.g., correlation) 2. How many observations are used 3. How much the forward model filters the parameters (ill-posedness of the classical/deterministic problem) More formally, #2 and #3 contribute to low rank of the data misfit Hessian H(x) = 2 x log p(d x) Key idea: low dimensionality lies in the change from prior to posterior

32 Low-dimensional structure Algorithm design: identify/use directions along which posterior is most changed relative to the prior Developments along this theme: Linear Bayesian inverse problem: optimality results, fast posterior mean estimators Nonlinear Bayesian inverse problem: approximate posterior, explicit dimension reduction Infinite-dimensional setting: dimension-independent + likelihood-informed (DILI) samplers

33 Linear Bayesian IP Simple linear-gaussian model: d =Gx +, x ~ N(0,Γ pr ), ~ N(0,Γ obs ); H :=G T Γ 1 obs G Exact posterior covariance is: Γ pos = Γ pr Γ pr G T Γ y 1 GΓ pr, with Γ d := Γ obs +GΓ pr G T Posterior mean is: µ pos = Γ pos G T Γ 1 obs d

34 Linear Bayesian IP Consider approximations of the form M r := ˆΓ pos = Γ pr KK T 0, rank(k) r { } = Γ pos Γ pr KK Theorem: optimal approximations of the posterior covariance [Spantini et al., arxiv: ] Optimality in Förstner-Moonen metric, rather than Frobenius Reinforces importance of prior-preconditioned Hessian [Flath, Ghattas, et al. 2011]

35 Linear Bayesian IP (Optimal covariance approximation) Let (δ 2 i,z i ) be the eigenvalue-eigenvector pairs of the priorpreconditioned Hessian Ĥ = Γ 1/2 1/2 pr H Γ pr. Then a ˆΓ that minimizes d ( F Γ pos, ˆΓ ) pos M r pos is r 2 Γ pos = Γ pr δ i (1 + δ 2 i ) 1 w i w T i, where w i = Γ pr i=1 1/2 z i d F is a Riemannian metric on the space of real symmetric positive definite matrices [Förstner & Moonen 2003] d 2 F (A,B) = tr ln 2 (A 1/2 BA 1/2 ) = Also yields optimal approximation of Gaussian posterior in Kullback- Leibler divergence and Hellinger metric i ln2 ( λ i (A,B))

36 Linear Bayesian IP Theorem: optimal approximations of the posterior mean [Spantini et al., arxiv: ] Consider approximations in the form of a low-rank matrix applied to the observations: { } ˆµ pos = Ad, A A r := A : rank(a r ) r Find A A r minimizing Bayes risk under squared-error loss: A 2 = arg min E A A x,d Ad x 1 Γpos r A Result: easily computed from low-rank approximation of Γ 1/2 1/2 obs G Γ pr

37 Linear Bayesian IP µ pos normalized error order of approximating class

38 Nonlinear Bayesian IP Nonlinear forward model: Hessian depends on parameter Find a linear subspace that captures likelihood-informed directions; use posterior expectation of pp-hessian Enables: π(x d) π(x d) = p(y Γ 1/2 pr U r x r )p(x r ) p(x ) = π r (x r d) π (x ) Accelerated MCMC in lower dimensions Variance reduction via Rao-Blackwellization x = Γ 1/2 pr U r x r + Γ 1/2 pr (I U r U T r )x ; x ~ N(0,I ) from MCMC samples [Cui et al., Inverse Problems 2014] everything known analytically!

39 Nonlinear Bayesian IP Posterior variance: elliptic inverse problem (4800 dim) (a) LIS (b) CS s s s 1 (c) LIS + CS s 1 (d) full-space MCMC s s s 1 s 1

40 DILI samplers What about exact sampling in the function space setting? MH kernel: ν(du,d u ) = q(u,d u )µ(du) ν (du,d u ) = q( u,du)µ(d u ) α(u k, u ) = min 1, dν dν (u, u ) k Need ν ν to define a valid acceptance probability How? Use discretizations of Langevin SDE that exactly preserve a Gaussian measure dominating the posterior Dimension-independent + likelihood-informed (DILI): A general class of operator-weighted MCMC proposals on function space Operators encode LIS structure; finite + infinite dimensions [Cui, Law, YM; arxiv: ]

41 DILI samplers Sampling efficiency, elliptic PDE inverse problem (n=1600) LI improves mixing significantly over pcn DI and/or particular choice of LIS improve it further 1 (c) autocorrelations 1 (c) autocorrelations autocorr MGLI-Langevin MGLI-Prior LI-Langevin LI-Prior H-Langevin PCN-RW autocorr MGLI-Langevin H-Langevin PCN-RW lag SNR lag SNR50

42 DILI samplers Sampling efficiency, conditioned diffusion (n=1000) DI and LI both helpful (c) autocorrelations autocorr MGLI-Langevin MGLI-Prior LI-Langevin LI-Prior H-Langevin PCN-RW lag

43 DILI samplers Sampling efficiency, conditioned diffusion (n=1000) Posterior-averaged Hessian yields improvements over MAP Hessian autocorr MAP-LIS Adapt-LIS lag

44 Ongoing work Transport-accelerated MCMC in high dimensions Combine transport maps with LIS ideas: map departure from identity limited to few directions Maps without MCMC: adaptive methods for high dimensions Maps from reference to target (rather than reverse) Enrichment via first variation of function-space optimization problem + progressive deflation Maps for nonlinear filtering Code (MUQ) available for download at: mituq/muq

45 References Moselhy, T; Marzouk YM. Bayesian inference with optimal maps. J. Comp. Phys. 281: (2012). Cui, T; Martin, J; Marzouk, YM; Solonen, A; Spantini, A. Likelihood-informed dimension reduction for nonlinear inverse probems. Inverse Problems 30: (2014). Spantini, A; Solonen, A; Cui, T; Martin, J; Tenorio, L; Marzouk, YM. Optimal lowrank approximation of linear Bayesian inverse problems. arxiv: (2014). Cui, T; Law, KJH; Marzouk, YM. Dimension independent likelihood informed MCMC. arxiv: (2014). Parno, M; Marzouk, YM. Transport map accelerated MCMC. arxiv: (2014).

46 Nonlinear Bayesian IP autocorrelation log-likelihood MCMC iteration 1st LIS vector rd LIS vector th LIS vector autocorrelation subspace full space CPU time (second)

Dimension-Independent likelihood-informed (DILI) MCMC

Dimension-Independent likelihood-informed (DILI) MCMC Tiangang Cui, Kody Law 2, Youssef Marzouk Massachusetts Institute of Technology 2 Oak Ridge National Laboratory 2 August 25 TC, KL, YM DILI MCMC USC