Scalable algorithms for optimal experimental design for infinite-dimensional nonlinear Bayesian inverse problems

Size: px

Start display at page:

Download "Scalable algorithms for optimal experimental design for infinite-dimensional nonlinear Bayesian inverse problems"

Sheryl Barber
5 years ago
Views:

1 Scalable algorithms for optimal experimental design for infinite-dimensional nonlinear Bayesian inverse problems Alen Alexanderian (Math/NC State), Omar Ghattas (ICES/UT-Austin), Noémi Petra (Applied Math/UC Merced), Georg Stadler (Courant/NYU) Data Assimilation and Inverse Problems Mathematics Institute, University of Warwick Coventry, United Kingdom February 23, 2016

e., update the prior state of knowledge on m Optimal experimental design: How to design observation

2 The conceptual optimal experimental design problem Data: d Rq Model parameter field: m = {m(x)}x D Bayesian inverse problem: Use data d to infer the model parameter field m in the Bayesian sense, i.e., update the prior state of knowledge on m Optimal experimental design: How to design observation system for d to optimally infer m? Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

3 Some relevant references on OED for PDEs E. Haber, L. Horesh, L. Tenorio, Numerical methods for experimental design of large-scale linear ill-posed inverse problems, Inverse Problems, 24: , E. Haber, L. Horesh, L. Tenorio, Numerical methods for the design of large-scale nonlinear discrete ill-posed inverse problems, Inverse Problems, 26, , E. Haber, Z. Magnant, C. Lucero, L. Tenorio, Numerical methods for A-optimal designs with a sparsity constraint for ill-posed inverse problems, Computational Optimization and Applications, 1-22, Q. Long, M. Scavino, R. Tempone, S Wang, Fast estimation of expected information gains for Bayesian experimental designs based on Laplace approximations, Comp. Meth. in Appl. Mech. and Eng., 259:24 39, Q. Long, M. Scavino, R. Tempone, S. Wang, A Laplace method for under-determined Bayesian optimal experimental designs, Comp. Meth. in Applied Mech. and Engineering, 285: , F. Bisetti, D. Kim, O. Knio, Q. Long, R. Tempone, Optimal Bayesian experimental design for priors of compact support with application to shocktube experiments for combustion kinetics, Intl. Journal for Num. Methods in Engineering, X. Huan and Y.M. Marzouk, Simulation-based optimal Bayesian experimental design for nonlinear systems, Journal of Computational Physics, 232: , X. Huan, Y. Marzouk, Gradient-based stochastic optimization methods in Bayesian experimental design, Intl. Journal for Uncertainty Quantification, 4(6), Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

4 How we define an optimal design Given locations x i of possible sensor sites, choose optimal weights w i for a subset (using sparsity constraints): { } x1,..., x design := ns w 1,..., w ns ground truth + sensor locations MAP solution Obtain data, solve Bayesian inverse problem: data + likelihood, prior = posterior variance We seek an A-optimal design: minimize average posterior variance Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

5 Bayesian inverse problem in Hilbert space (A. Stuart, 2010) D: bounded domain H = L 2 (D) m H: parameter Parameter-to-observable map: f : H R q Additive Gaussian noise: Likelihood: π like (d m) exp d = f(m) + η, η N (0, Γ noise ) Gaussian prior measure: µ prior = N (m pr, C prior ) Bayes Theorem (in infinite dimensions): dµ post dµ prior π like (d m) { 1 2 (f(m) d)t Γ 1 noise (f(m) d) } ( ) dµ post π like (d m) dµ prior A.M. Stuart, Inverse problems: A Bayesian perspective, Acta Numerica, Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

6 The experimental design and a weighted inference problem { } x1,..., x design := ns w 1,..., w ns Want w i {0, 1} w-weighted data-likelihood: relax = 0 w i 1 threshold = w i {0, 1} { π like (d m; w) exp 1 } 2 (f(m) d)t W σ (f(m) d) f: parameter-to-observable map W := diag(w i ) W σ := 1 σ 2 noise W Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

7 A-optimal design in Hilbert space Covariance function: c post (x, y) = Cov {m(x), m(y)} average posterior variance = 1 c post (x, x) dx D Covariance operator: [C post u](x) = Mercer s Theorem: A-optimal design criterion: D D D c post (x, y)u(y) dy c post (x, x) dx = tr(c post ) Choose a sensor configuration to minimize tr(c post ) Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

8 Relationship to some other OED criteria For Gaussian prior and noise, and linear parameter-to-observable map: Maximizing the expected information gain is equivalent to D-optimal design: { E µprior Ed m {D kl (µ post, µ prior )} } = 1 2 log det Γpost + 1 log det Γprior 2 Minimizing Bayes risk of posterior mean is equivalent to A-optimal design: { { E µprior Ed m m mpost(d) 2}} = m m post(d) 2 π like (d m) dd µ prior (dm) For infinite dimensions, see: H Y = tr(γ post) A. Alexanderian, P. Gloor, and O. Ghattas, On Bayesian A- and D-optimal experimental designs in infinite dimensions, Bayesian Analysis, to appear, Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

9 Challenges The infinite-dimensional Bayesian inverse problem is merely an inner problem within OED Need trace of posterior covariance operator, which would be intractable in the large-scale setting We approximate the posterior covariance by the inverse of the Hessian at the maximum a posteriori (MAP) point, but explicit construction of the Hessian is still very expensive For nonlinear inverse problem, Hessian depends on data, which are not available a priori For efficient optimization, we need derivatives of the trace of the inverse Hessian with respect to the sensor weights We seek scalable algorithms, i.e., those whose cost (in terms of forward PDE solves) is independent of the data dimension and the parameter dimension Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

10 A-optimal design problem: Linear case Linear parameter-to-observable map: f(m) := Fm Gaussian prior measure: µ prior = N (m prior, C prior ) Gaussian prior and noise = Gaussian posterior m post = arg min m 1 W 1/2 (Fm d) C 1 2 Γ 1 prior noise 2 (m m prior), m m prior }{{} J (m) := negative log-posterior OED problem: C post (w) = H 1 (w), H(w) = F W σ F + C 1 prior minimize w [ ( ) ] 1 tr F W σ F + C 1 prior + penalty A. Alexanderian, N. Petra, G. Stadler, O. Ghattas, A-optimal design of experiments for infinite-dimensional Bayesian linear inverse problems with regularized l 0-sparsification, SIAM Journal on Scientific Computing, 36(5):A2122 A2148, Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

11 A-optimal design problem: Nonlinear case Nonlinear case: Gaussian prior and noise Gaussian posterior Use Gaussian approximation to posterior: for given w and d Compute the maximum a posteriori probability (MAP) estimate m MAP(w; d) m MAP(w; d) := arg min J (m, w, d) m Gaussian approximation to posterior at MAP point Problem: data d not available a priori N ( m MAP(w; d), H 1[ m MAP(w; d), w; d ]) Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

12 A-optimal designs for nonlinear inverse problems General formulation: minimize average posterior variance: minimize tr (H 1[ m MAP (w; d), w; d ]) w Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

13 A-optimal designs for nonlinear inverse problems General formulation: minimize expected average posterior variance: minimize E d {tr (H 1[ m MAP (w; d), w; d ])} w Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

14 A-optimal designs for nonlinear inverse problems General formulation: minimize expected average posterior variance: minimize E µprior E d m {tr (H 1[ m MAP (w; d), w; d ])} w Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

15 A-optimal designs for nonlinear inverse problems General formulation: minimize expected average posterior variance: ( minimize E µprior E d m {tr H 1[ m MAP (w; d), w; d ])} + γp (w) w P (w): sparsifying penalty function In practice: get data from a few training models m 1,..., m nd m i : draws from prior Training data: d i = f(m i ) + η i, i = 1,..., n d Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

16 A-optimal designs for nonlinear inverse problems General formulation: minimize expected average posterior variance: ( minimize E µprior E d m {tr H 1[ m MAP (w; d), w; d ])} + γp (w) w P (w): sparsifying penalty function In practice: get data from a few training models m 1,..., m nd m i : draws from prior Training data: d i = f(m i ) + η i, i = 1,..., n d The problem to solve in practice: minimize w 1 n d n d i=1 ( tr H 1[ ] ) m MAP (w; d i ), w; d i + γp (w) Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

17 Randomized trace estimation A R n n symmetric Trace estimator: tr(a) 1 n tr n tr i=1 z T i Az i, z i random vectors Gaussian trace estimator: z i independent draws from N (0, I) For z N (0, I) E { z T Az } = tr(a) Var { z T Az } = 2 A 2 F Clustered eigenvalues = Good approximation with small n tr Efficient means of approximating trace of inverse Hessian M. Hutchinson, A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines (1990). H. Avron and S. Toledo, Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix (2011). Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

18 Optimization problem for computing A-optimal design OED objective function: Ψ(w) = 1 n d tr [ H 1( w; d i ) ] n d i=1 Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

19 Optimization problem for computing A-optimal design OED objective function: Ψ(w) = 1 n d n tr n d n tr zk, H 1( w; d i ) z k i=1 k=1 Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

20 Optimization problem for computing A-optimal design OED objective function: Ψ(w) = 1 n d n tr n d n tr z k, H 1( w; d i ) z k }{{} y ik i=1 k=1 Auxiliary variable: y ik = H 1( w; d i )z k Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

21 Optimization problem for computing A-optimal design OED objective function: Ψ(w) = 1 n d n tr n d n tr z k, H 1( w; d i ) z k }{{} y ik i=1 k=1 Auxiliary variable: y ik = H 1( w; d i )z k OED optimization problem minimize w where 1 n d n tr n d n tr z k, y ik + γp (w) i=1 k=1 ( m MAP (w; d i ) = arg min J m, w; d i ), i = 1,..., n d m H ( ) m MAP (w; d i ), w; d i yik = z k, i = 1,..., n d, k = 1,..., n tr Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

22 Application: subsurface flow Forward problem (e m u) = f u = g e m u n = h in D on Γ D on Γ N u: pressure field m: log-permeability field (inversion parameter) Left: true parameter; Right: pressure-field Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

23 Bayesian inverse problem: Gaussian approximation MAP point is solution to minimize m where J (m, w; d) := 1 2 (Bu d)t W σ(bu d) m m prior, m m prior E (e m u) = f u = g e m u n = 0 in D on Γ D on Γ N B is observation operator Cameron-Martin inner-product: m 1, m 2 E = C 1/2 prior m1, C 1/2 prior m2, m 1, m 2 range(c 1/2 prior ) Covariance of Gaussian approximation to posterior: with m = m MAP (w; d) C = H(m, w; d) 1 = [ 2 mj (m, w; d)] 1 Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

24 Optimization problem for A-optimal design 1 minimize w [0,1] ns n d n tr n d n tr z k, y ik + γp (w) i=1 k=1 where for i = 1,..., n d, k = 1,..., n tr (e mi u i )=f (e mi p i )= B W σ (Bu i d i ) C 1 prior (m i m pr ) + e mi u i p i = 0 }{{} mj (m i) (e mi v ik )= (y ik e mi u i ) (e mi q ik )= (y ik e mi p i ) B W σ Bv ik C 1 prior y ik + y ik e mi u i p i + e mi( ) v ik p i + u i q ik = z k }{{} H(m i)y ik Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

25 The Lagrangian for the OED problem L OED ( w, {u i}, {m i}, {p i}, {v ik }, {q ik }, {y ik }, {u i }, {m i }, {p i }, {v ik }, {q ik }, {y ik }) := 1 n d n tr n d + i=1 i=1 n d ntr z k, y ik + γp (w) i=1 k=1 [ e m i u i, u i f, u i h, u i ΓN ] n d [ m + e i p i, p i + B W σ(bu i d i), p ] i n d [ + (mi m pr), m i E + m ] i em i u i, p i i=1 n d ntr [ m + e i v ik, v ik + yik e m i u i, vik ] i=1 k=1 n d ntr [ m + e i q ik, q ik + yik e m i p i, q ik + B W σbv ik, qik ] i=1 k=1 n d ntr [ + y ik em i v ik, p i + y ik, y ik E + y ik em i u i, q ik + (y ik y ike m i u i, p i) i=1 k=1 z k, y ] ik Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

26 The adjoint OED problem and the gradient The adjoint OED problem for {u i }, {m i }, {p i }, {v ik }, {q ik }, {y ik } B W σ Bq ik (y ike mi p i ) (e mi v ik) = 0 e mi q ik p i + C 1 prior y ik + y ike m u i p i + e mi u i v ik = 1 n d n tr z k (e mi q ik) (y ike m u i ) = 0 B W σ Bp i (m i e mi p i ) (e mi u i ) = b 1 i e mi p i p i + C 1 prior m i + m i e mi u i p i + e mi u i u i = b 2 i (e mi p i ) (m i e mi u i ) = b 3 i Gradient for the OED problem: n d g(w) = i=1 q ik, y ik, and v ik noise (Bu n d n tr i d i ) Bp i + Γ 1 noise Bv ik Bqik, Γ 1 i=1 k=1 can be eliminated: qik = v ik, yik = y ik, vik = q ik Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

27 Computing the OED objective function and its gradient Solve inverse problems: m i (w), u i (m i (w)) and p i (m i (w)), i = 1,..., n d Solve for (v ik, y ik, q ik ), i = 1,..., n d, k = 1,..., n tr : (e m i v ik )= (y ik e m i u i ) (e m i q ik )= (y ik e m i p i ) B W σbv ik C 1 prior y ik + y ik e m i u i p i + e m i ( v ik p i + u i q ik ) = zk Objective function evaluation: Ψ(w) = 1 nd n d n tr i=1 Solve for (p i, m i, u i ), i = 1,..., n d ntr k=1 z k, y ik Gradient: B W σbp i (m i em i p i ) (e m i u i ) = b1 i e m i p i p i + C 1 prior m i + m i em i u i p i + e m i u i u i = b2 i (e m i p i ) (m i em i u i) = b 3 i n d g(w) = Γ 1 noise (Bu i d i ) Bp i 1 n d n tr Γ 1 noise n i=1 d n Bv ik Bv ik tr i=1 k=1 Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

28 Computational cost: the number of PDE solves r: rank of Hessian of the data misfit Independent of mesh (beyond a certain resolution) Independent of sensor dimension (beyond a certain dimension) Cost (in PDE solves) of evaluating OED objective function & gradient 1 Cost of solving inverse problems 2 r n d N newton 2 Cost of computing OED objective 2 r n tr n d 3 Cost of computing OED gradient 2 n tr n d + 2 r n d Low-rank approx to misfit Hessian = Cost in (2) and (3) 2 r n d OED optimization problem: Solved via quasi-newton interior point Number of quasi-newton iterations insensitive to parameter/sensor dimension Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

29 The prior and training models Prior knowledge: parameter value at few points, correlation truth prior mean prior variance Prior mean: smooth least-squares fit to point measurements N Z 2 1 αx mpr = arg min hm, Ami + δi (x) m(x) mtrue (x) dx m 2 2 i=1 D PN Prior covariance: Cprior = (A + α i=1 δi ) 2, Am = (D m) Draws from prior used to generate training data for OED Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

30 Computing an optimal design with sparsification Optimization problem 1 The functions f ε minimize w where 1 n d n tr n d n tr z k, y ik + γp ε (w) i=1 k=1 ( ) m MAP (w; d i ) = arg min J m, w; d i m H ( ) m MAP (w; d i ), w; d i yik = z k n d = 5, n tr = 20 Continuation l 0 -penalty n s P ε (w) := f ε (w i ) i=1 Solve the problem with ε The optimal weight vector w Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

31 Effectiveness of the optimal design Numerical test: how well can we recover the true model from true data? Optimal Random Truth Designs with 10 sensors Comparing MAP point (top row) and posterior variance (Bottom row) Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

32 Effectiveness of the optimal design Numerical test: how well can we recover the true model from true data? 10 sensor designs 20 sensor designs tr ( Γpost(w) ) Random Optimal Random Optimal E rel (w) E rel (w) d = f(m true) + η Γ post(w) := H 1 (m MAP(w; d), w; d) E rel (w) := mmap(w; d) mtrue m true Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

33 Effectiveness of the optimal design Numerical test: how well we do on average? 10 sensor designs 20 sensor designs Ψ(w) Random Optimal Random Optimal E rel (w) E rel (w) E rel (w) = 1 n d n d i=1 m MAP(w; d i) m i m i Ψ(w) = 1 n d n d tr ( H 1 (m MAP(w; d i), w; d ) i) i=1 d i generated from n d = 50 draws of m i from the prior (different from the n d = 5 used to solve the OED problem) Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

34 Scalability with respect to parameter/sensor dimension obj func/grad evaluation OED optimization inner CG iters outer CG 500 2,000 3,500 5,000 BFGS iterations ,000 3,500 5,000 parameter dimension inner CG iters outer CG BFGS iterations sensor dimension Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

35 Inversion of a realistic permeability field (e m u) = f u = 0 e m u n = 0 in D on Γ D on Γ N Γ D : at the corners Γ N : the rest of boundary Point source at center Log-permeability field from SPE Comparative Solution Project Injection well at the center Production wells at four corners Parameter (and state) dimension: n = 10, 202 Potential sensor locations: n s = 128 Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

36 SPE model: the prior Mean Standard deviation Random draw Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

37 SPE model: inference with the optimal design Posterior samples Truth Posterior mean Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

38 SPE model: Compare optimal vs random design Optimal Random average var = average var = average var = Designs with 32 sensors Observed data collected using true parameter on high resolution mesh (n ) Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

39 Challenges revisited Infinite-dimensional inference Proper choice of prior, mass-weighted inner-product Need trace of posterior covariance (inverse of Hessian, large, dense, expensive matvecs) Randomized trace-estimator, Gaussian approximation in nonlinear case In nonlinear inverse problem, Hessian depends on data (not available a priori) Random draws from prior to generate training data Optimal design has combinatorial complexity in general w-weighted formulation, relax weights 0 w j 1, continuation to get 0 1 Conventional OED algorithms intractable for large-scale problems Infinite-dimensional formulation, gradient based optimization, scalable algorithms Gaussian assumption (on posterior of inverse problem) remains via linearized parameter-to-observable map; current worked aimed at higher-order approximation A. Alexanderian, N. Petra, G. Stadler, and O. Ghattas, A fast and scalable method for A-optimal design of experiments for infinite-dimensional Bayesian nonlinear inverse problems, SIAM Journal on Scientific Computing, 38(1):A243 A272, Omar Ghattas (UT Austin) Large-scale OED February 23, / 32

Scalable Algorithms for Optimal Control of Systems Governed by PDEs Under Uncertainty

Scalable Algorithms for Optimal Control of Systems Governed by PDEs Under Uncertainty Alen Alexanderian 1, Omar Ghattas 2, Noémi Petra 3, Georg Stadler 4 1 Department of Mathematics North Carolina State