Bayesian Inverse problem, Data assimilation and Localization Xin T Tong National University of Singapore ICIP, Singapore 2018 X.Tong Localization 1 / 37
Content What is Bayesian inverse problem? What is data assimilation? What is the curse of dimensionality? Localization for algorithms. X.Tong Localization 2 / 37
Bayesian formulation Deterministic inverse problem Given h and data y, find state x, so that y = h(x). Tikhonov regularization: solves instead min y h(x) 2 R + x 2 C x R, C: PSD matrices, x 2 C = xt C 1 x. Bayesian inverse problem Prior x p 0 = N (0, C). Observe y = h(x) + v, v N (0, R), y p l (y x). Bayes formula for posterior distribution p(x y) = p 0(x)p l (y x) p0 (x)p l (y x)dx p 0(x)p l (y x) Try to obtain p(x y). X.Tong Localization 3 / 37
Bayesian formulation Deterministic inverse problem Given h and data y, find state x, so that y = h(x). Tikhonov regularization: solves instead min y h(x) 2 R + x 2 C x R, C: PSD matrices, x 2 C = xt C 1 x. Bayesian inverse problem Prior x p 0 = N (0, C). Observe y = h(x) + v, v N (0, R), y p l (y x). Bayes formula for posterior distribution p(x y) = p 0(x)p l (y x) p0 (x)p l (y x)dx p 0(x)p l (y x) Try to obtain p(x y). X.Tong Localization 3 / 37
Connection to deterministic IP Posterior density p(x y) p 0 (x)p l (y x) exp( 1 2 x 2 C 1 2 y h(x) 2 R). Maxium a posterior (MAP) estimator max x p(x y) = min y x h(x) 2 R + x 2 C. Deterministic IP finds the MAP. But p(x y) can contain much richer information. Quantifies the uncertainty within x, given data y. X.Tong Localization 4 / 37
Connection to deterministic IP Posterior density p(x y) p 0 (x)p l (y x) exp( 1 2 x 2 C 1 2 y h(x) 2 R). Maxium a posterior (MAP) estimator max x p(x y) = min y x h(x) 2 R + x 2 C. Deterministic IP finds the MAP. But p(x y) can contain much richer information. Quantifies the uncertainty within x, given data y. X.Tong Localization 4 / 37
Uncertainty quantification (UQ) Estimate the accuracy and robustness of the solution. Finding all the local mins, i.e., likely states. Risk: what is the chance a hurricane hits a city? Streaming data and estimation on the fly. Cost: more complicated computation X.Tong Localization 5 / 37
Monte Carlo Direct method Can we simply compute p(x y)? Involves quadrature (integral) and PDE. Not feasible if dim(x) > 5. Practical dim(x) = 10 10 8. One exception: if p 0 is Gaussian and h is linear Kalman filter: p(x y) is an explicit Gaussian. Monte Carlo Generate samples x k p(x y), i = 1,, N. Use sample statistics as an approximation. Local mins: clusters of samples. Sample variance: how uncertain are the estimators? Risk: what percentage of samples lead to bad outcomes? X.Tong Localization 6 / 37
Monte Carlo Direct method Can we simply compute p(x y)? Involves quadrature (integral) and PDE. Not feasible if dim(x) > 5. Practical dim(x) = 10 10 8. One exception: if p 0 is Gaussian and h is linear Kalman filter: p(x y) is an explicit Gaussian. Monte Carlo Generate samples x k p(x y), i = 1,, N. Use sample statistics as an approximation. Local mins: clusters of samples. Sample variance: how uncertain are the estimators? Risk: what percentage of samples lead to bad outcomes? X.Tong Localization 6 / 37
Markov chain Monte Carlo (MCMC) Given a target distribution p(x), generate a sequence of samples x 1, x 2,, x N. Standard MCMC steps Generate proposals x q(x k, x ) Accept with prob. α(x, x ) = min{1, p(x )q(x, x k )/q(x k, x )p(x)} Popular choices of proposals ξ k N (0, I n ). RWM: x = x k + σξ k MALA: x = x k + σ2 2 log p(xk ) + σξ k. pcn: x = 1 β 2 x k + β ξ k. Also emcee and Hamiltonian MCMC. σ, β are tuning parameters. X.Tong Localization 7 / 37
How does MCMC work in high dim? Sample isotropic Gaussian p = N (0, I n ). Measurement of efficiency: integrated auto-correlation time (IACT) Measure how many iterations to get an uncorrelated sample. RWM emcee Hamiltonian MCMC MALA IACT 1200 1000 800 600 400 200 0 0 500 1000 Dimension increases with dimension IACT 80 60 40 20 0 0 500 1000 Dimension All X.Tong Localization 8 / 37
Where is the problem? Let s look at RWM, assume x k = 0. Propose x = σξ k Accept with probability exp( 1 2 σ2 ξ k 2 ) exp( 1 2 σ2 d). If we keep σ = 1, never accept if d > 20. If we want acceptance at a constant rate, σ = d 1 2. But then x = x k + σξ k is highly correlated with x k. Similar for MALA, σ = d 1/3. Hamiltonian MCMC, σ = d 1/4. Is it possible to break this curse of dimensionality? X.Tong Localization 9 / 37
Where is the problem? Let s look at RWM, assume x k = 0. Propose x = σξ k Accept with probability exp( 1 2 σ2 ξ k 2 ) exp( 1 2 σ2 d). If we keep σ = 1, never accept if d > 20. If we want acceptance at a constant rate, σ = d 1 2. But then x = x k + σξ k is highly correlated with x k. Similar for MALA, σ = d 1/3. Hamiltonian MCMC, σ = d 1/4. Is it possible to break this curse of dimensionality? X.Tong Localization 9 / 37
Where is the problem? Let s look at RWM, assume x k = 0. Propose x = σξ k Accept with probability exp( 1 2 σ2 ξ k 2 ) exp( 1 2 σ2 d). If we keep σ = 1, never accept if d > 20. If we want acceptance at a constant rate, σ = d 1 2. But then x = x k + σξ k is highly correlated with x k. Similar for MALA, σ = d 1/3. Hamiltonian MCMC, σ = d 1/4. Is it possible to break this curse of dimensionality? X.Tong Localization 9 / 37
How to beat the curse of dimensionality? Gaussian structure Target p N (m, Σ), X = m + Σ1/2 Z. Approximate the area near MAP with Gaussian. High dimensional Gaussian+low dimensional nongaussian. Low effective dimension Components have diminishing uncertainty/importance. Example: Fourier modes of a PDE solution. Functional space MCMC (Stuart 2010) Lorenz system propagation. Video: MIT Aero Astro X.Tong Localization 10 / 37
Data assimilation Sequential inverse problem: x 0 p 0 Hidden: x n+1 = Ψ(x n ) + ξ n+1, Data: y n = h n (x n ) + v n. Estimate x n based on y 1,, y n. Bayesian approach: find p(x n y 1, y 2,, y n ) Online update from p(x n y 1, y 2,, y n ) p(x n+1 y 1, y 2,, y n, y n+1 ). without storing y 1,, y n. X.Tong Localization 11 / 37
Weather forecast Weather forecast: Hidden: x n+1 = Ψ(x n ) + ξ n+1, Data: y n = h(x n ) + v n. x n : atmosphere and ocean state of day n Ψ: PDE solver maps yesterday to today. h: partial observation made by satellites. Main challenge: high dimension, d 10 6 10 8. X.Tong Localization 12 / 37
Bayesian IP application Inverse problem with streaming data: x 0 p 0 Hidden: x n+1 = x n x 0, Data: y n = h n (x 0 ) + v n. Estimate x n based on y 1,, y n. Application: Subsurface detection. x 0 : subsurface structure. y n : data gathered through n-th experiements. X.Tong Localization 13 / 37
Kalman filter Linear setting Hidden: x n+1 = A n x n + B n + ξ n+1 Data: y n = Hx n + v n. Use Gaussian: x n y1...n N (m n, R n ) Forecast step: ˆm n+1 = A n m n + B n, R n+1 = A n R n A T n + Σ n. Assimilation step: apply the Kalman update rule m n+1 = ˆm n+1 + G( R n+1 )(y n+1 H ˆm n+1 ), R n+1 = K( R n+1 ) G(C) = CH T (R + HCH T ) 1, K(C) = C G(C)HC Posterior at t = n Prior+Obs at t = n + 1 forecast assimilate Posterior at t = n + 1 X.Tong Localization 14 / 37
Nonlinear filtering Main drawbacks Difficult to deal with nonlinear dynamics/observation. Computation is expensive O(d 3 ). Particle filter Can deal with nonlinearity. Try to generate samples from x k n p(x n y 1,..., y n ). Prohibitively expensive O(e d ). Ensemble Kalman filter Combination of two ideas: Gaussian distributions with samples Ensemble {x k n} K k=1 to represent N (x n, C n ) x n = x k n K, C n = 1 K 1 (x k n x n ) (x k n x n ). k X.Tong Localization 15 / 37
Nonlinear filtering Main drawbacks Difficult to deal with nonlinear dynamics/observation. Computation is expensive O(d 3 ). Particle filter Can deal with nonlinearity. Try to generate samples from x k n p(x n y 1,..., y n ). Prohibitively expensive O(e d ). Ensemble Kalman filter Combination of two ideas: Gaussian distributions with samples Ensemble {x k n} K k=1 to represent N (x n, C n ) x n = x k n K, C n = 1 K 1 (x k n x n ) (x k n x n ). k X.Tong Localization 15 / 37
Nonlinear filtering Main drawbacks Difficult to deal with nonlinear dynamics/observation. Computation is expensive O(d 3 ). Particle filter Can deal with nonlinearity. Try to generate samples from x k n p(x n y 1,..., y n ). Prohibitively expensive O(e d ). Ensemble Kalman filter Combination of two ideas: Gaussian distributions with samples Ensemble {x k n} K k=1 to represent N (x n, C n ) x n = x k n K, C n = 1 K 1 (x k n x n ) (x k n x n ). k X.Tong Localization 15 / 37
Ensemble Kalman filter Forecast step ˆx k n+1 = Ψ(X k n) + ξ k n+1, y k n+1 = h n+1 (x k n+1) + ζ k n+1 Assimilation step x k n+1 = ˆx k n+1 + G n+1 (y n+1 y k n+1). Gain matrix: G n+1 = S X SY T (R + S Y SY T ) 1, S X = 1 K 1 [x(1) n+1 x n+1,..., x k n+1 x n+1 ]. Complexity O(K 2 d). Posterior at t = n Prior+Obs at t = n + 1 Posterior at t = n + 1 X (k) n+1 X (k) forecast: M.C. assimilate n+1 X (k) n X.Tong Localization 16 / 37
Success of EnKF Successful applications to weather forecast System d 10 6, ensemble size K 10 2. Complexity of Kalman filter: O(d 3 ) = O(10 18 ). Complexity of EnKF: O(K 2 d) = O(10 10 ). No computation of gradient. Also find applications in Oil reservoir management Bayesian IP Deep learning. X.Tong Localization 17 / 37
EnKF in high dimension Theoretical dimension curse comes from another angle The covariance of x estimated using the sample covariance x = 1 K x k, Ĉ = 1 (x k x)(x k x) T. K 1 If x k N (0, Σ), Ĉ Σ = O( d/k). (Bai-Yin s law) If sample size K much less than the dimension d. X.Tong Localization 18 / 37
Local interaction High dimension often comes from dense grids. Interaction often is local: PDE discritization. Example: Lorenz 96 model ẋ i (t) = (x i+1 x i 2 )x i 1 x i dt + F, i = 1,, d Information travels along interaction, and is dissipated. X.Tong Localization 19 / 37
Sparsity: local covariance Correlation of Lorenz 96 X.Tong Localization 20 / 37 Correlation depends on information propagation. Correlation decays quickly with the distance. Covariance is localized with a structure Φ, e.g. Φ(x) = ρ x [Ĉ] i,j Φ( i j ) Φ(x) [0, 1] is decreasing. Distance can be general. 1 5 0.9 10 0.8 0.7 15 20 0.6 0.5 25 30 35 0.4 0.3 0.2 0.1 40 0 5 10 15 20 25 30 35 40
Covariance Localization Implementation: Schur product with a mask [Ĉ D L] i,j = [Ĉ] i,j [D L ] i,j Use Ĉ D L instead for covariance. [D L ] i,j = φ( i j ), with a radius L. Gaspari-Cohn matrix: φ(x) = exp( 4x 2 /L 2 )1 i j L. Cutoff matrix: φ(x) = 1 i j L. Application in EnKF: use Ĉn D L instead of Ĉn Essentially: assimilate information from near neighbors. X.Tong Localization 21 / 37
Advantage with localization Theorem (Bickel, Levina 08) x 1,..., x k N (0, Σ), sample covariance matrix C = 1 K K k=1 xk x k. There is a constant c, and for any t > 0 P( C D L Σ D L > D L 1 t) 8 exp(2 log d ck min{t, t 2 }) This indicates that K D L 2 1 log d is the necessary sample size. D L 1 = max i j D L i,j is independent of d, e.g, the cut-off/branding matrix. [D L cut] i,j = 1 i j L, D L cut 1 2L. Theorem (T. 18) If EnKF has a stable localized covariance structure, the sample size is of order log n. X.Tong Localization 22 / 37
Advantage with localization Theorem (Bickel, Levina 08) x 1,..., x k N (0, Σ), sample covariance matrix C = 1 K K k=1 xk x k. There is a constant c, and for any t > 0 P( C D L Σ D L > D L 1 t) 8 exp(2 log d ck min{t, t 2 }) This indicates that K D L 2 1 log d is the necessary sample size. D L 1 = max i j D L i,j is independent of d, e.g, the cut-off/branding matrix. [D L cut] i,j = 1 i j L, D L cut 1 2L. Theorem (T. 18) If EnKF has a stable localized covariance structure, the sample size is of order log n. X.Tong Localization 22 / 37
From EnKF to MCMC? Localization for EnKF: Update only in small local blocks. Works when covariance have local structure. How to apply localization to MCMC? How to update x i component by component? Gibbs sampling implements this idea exactly! Write x i = [x i 1, x i 2,, x i m]. x i k can be of dimension q, then d = qm. Generate x i+1 1 p(x 1 x i 2, x i 3, x i m). Generate x i+1 2 p(x 2 x i+1 1, x i 3, x i m). Generate x i+1 m p(x m x i+1 1, x i+1 2, x i+1 m 1 ). X.Tong Localization 23 / 37
From EnKF to MCMC? Localization for EnKF: Update only in small local blocks. Works when covariance have local structure. How to apply localization to MCMC? How to update x i component by component? Gibbs sampling implements this idea exactly! Write x i = [x i 1, x i 2,, x i m]. x i k can be of dimension q, then d = qm. Generate x i+1 1 p(x 1 x i 2, x i 3, x i m). Generate x i+1 2 p(x 2 x i+1 1, x i 3, x i m). Generate x i+1 m p(x m x i+1 1, x i+1 2, x i+1 m 1 ). X.Tong Localization 23 / 37
How efficient is Gibbs? First just test with p = N (0, I d ) Generate x i+1 1 p(x 1 x i 2, x i 3, x i m) = N (0, I q ). Generate x i+1 2 p(x 2 x i+1 1, x i 3, x i m) = N (0, I q ). Generate x i+1 m p(x m x i+1 1, x i+1 2, x i+1 m 1 ) = N (0, I q). Gibbs naturally exploits the component independence. It works efficiently against the dimension. How about component with sparse/local independence? X.Tong Localization 24 / 37
How efficient is Gibbs? First just test with p = N (0, I d ) Generate x i+1 1 p(x 1 x i 2, x i 3, x i m) = N (0, I q ). Generate x i+1 2 p(x 2 x i+1 1, x i 3, x i m) = N (0, I q ). Generate x i+1 m p(x m x i+1 1, x i+1 2, x i+1 m 1 ) = N (0, I q). Gibbs naturally exploits the component independence. It works efficiently against the dimension. How about component with sparse/local independence? X.Tong Localization 24 / 37
Block tridiagonal matrix Local covariance matrix C: [C] i,j decays to zero quickly when i j becomes large. Localized covariance matrix C: [C] i,j = 0 when i j > L. C has a bandwidth 2L. We will see local is a perturbation of localized We can choose q = L in x i = [x i 1, x i 2,, x i m], Then C is block tridiagonal. X.Tong Localization 25 / 37
Localized covariance Theorem (Morzfeld, T., Marzouk) Apply Gibbs sampler with block-size q to p = N (m, C). Suppose C is q-block-tridiagonal. Then the distribution of x k converges to p geometrically fast in all coordinates, and we can couple x k and a sample z N (m, C) such that where E C 1/2 (x k z) 2 β k d(1 + C 1/2 (x 0 m) 2 ), β 2(1 C 1 ) 2 C 4 1 + 2(1 C 1 ) 2 C 4, with C being the condition number of C. Localized covariance+mild condition dimension free covergence. X.Tong Localization 26 / 37
Localized precision Theorem (Morzfeld, T., Marzouk) Apply Gibbs sampler with block-size q to p = N (m, C). Suppose Σ = C 1 is q-block-tridiagonal. Then the distribution of x k converges to p geometrically fast in all coordinates, and we can couple x k and a sample z N (m, C) such that where E C 1/2 (x k z) 2 β k d(1 + C 1/2 (x 0 m) 2 ), β C(1 C 1 ) 2 1 + C(1 C 1 ) 2, with C being the condition number of C. Localized precision+mild condition dimension free covergence. X.Tong Localization 27 / 37
Covariance v.s. Precision Why both localized covariance and precision? A lemma in Bickle & Lindner 2012. Localized covariance+mild condition local precision. Localized precision+mild condition local covariance. We will see local is a perturbation of localized For computation of Gibbs sampler, localized precision is superior: x k+1 j N m j Ω 1 j,j Ω j,i(x k+1 i m i ) Ω 1 j,j Ω j,i(x k i m i ), Ω 1 j,j. i<j i>j When Ω is sparse, meaning fast computation. X.Tong Localization 28 / 37
Covariance v.s. Precision Why both localized covariance and precision? A lemma in Bickle & Lindner 2012. Localized covariance+mild condition local precision. Localized precision+mild condition local covariance. We will see local is a perturbation of localized For computation of Gibbs sampler, localized precision is superior: x k+1 j N m j Ω 1 j,j Ω j,i(x k+1 i m i ) Ω 1 j,j Ω j,i(x k i m i ), Ω 1 j,j. i<j i>j When Ω is sparse, meaning fast computation. X.Tong Localization 28 / 37
Generalization Gibbs works for Gaussian sampling, with localized covariance or precision. How about Bayesian inverse problem? y = h(x) + v, v N (0, R), x p 0 = N (m, C). If h is linear, p is also Gaussian, Gibbs is directly applicable. What to do when C e.t.c. are not localized but local? X.Tong Localization 29 / 37
Metropolis within Gibbs Add in Metropolis steps to incorporate information Generate x 1 p 0(x 1 x i 2, x i 3, x i m) Accept as x i+1 1 with α 1(x i 1, x 1, x i 2:m) { α 1(x i 1, x 1, x i 2:m) = min 1, exp( 1 y } 2 h(x ) 2 R) exp( 1 y 2 h(xi ) 2 R ), where x = (x 1, x i 2:m). Repeat for all 2,..., m blocks When h has a dimension free Lipschitz constant, y h(x ) 2 R y h(xi ) 2 R is independent of n. Dimension independent acceptance rate. Should have fast convergence, though proof is unclear. X.Tong Localization 30 / 37
Localization Often Ω and h are local [Ω] i,j decays to zero quickly when i j increases. [h(x)] j depends significantly only over a few x i. Fast sparse computation is possible with localized parameters [Ω] i,j decays to zero quickly when i j increases. [h(x)] j depends significantly only over a few x i. Localization: truncate the near zero terms, Ω Ω L, h h L. We call MwG with localization as l-mwg. Theorem (Morzfeld, T., Marzouk 2018) The perturbation to the inverse problem is of order { } max Ω Ω L 1, (H H L )(H H L ) T 1. A 1 = max 1 j n n i=1 A i,j. X.Tong Localization 31 / 37
Comparison with function space MCMC Function space MCMC: Discretization refines, domain const. Number of obs. const. Effective dimension const. Low-rank priors. Low-rank prior to posterior update. Solved by dimension reduction. MCMC for local problems: Domain size increases, discretization is const. Number of obs. increases. Effective dimension increases. High-rank, sparse priors. High-rank prior to posterior update. Solved by localization. X.Tong Localization 32 / 37
Example I: image deblurring Truth x N (0, δ 1 L 2 ), L is Laplacian. Defocus obs: y = Ax + η, η N (0, λ 1 I). Dimension is large O(10 4 ). X.Tong Localization 33 / 37
Example I: image deblurring Precision is sparse. Effective dimension is large. X.Tong Localization 34 / 37
Example II: Lorenz 96 inverse Truth x0 p0, p0 is Gaussian Climatology. Ψt : x0 7 xt : dxi = (xi+1 xi 2 )xi 1 xi + 8 Observe every other xt, y = H(Ψt (x0 )) + ξ. X.Tong Localization 35 / 37
Summary Bayesian IP provides more information on UQ. Often solved by MCMC algorithms. Data assimilation (DA): Bayesian IP with dynamical systems. EnKF: efficient solver for high dim DA. Localization: a useful technique in high dimensional settigns. X.Tong Localization 36 / 37
Reference Inverse problems: A Bayesian perspective, A.M.Stuart, Acta Numerica 2010. Localization for MCMC: sampling high-dimensional posterior distributions with local structure. arxiv:1710.07747 Performance analysis of local ensemble Kalman filter. to appear on J. Nonlinear Science. Thank you! X.Tong Localization 37 / 37