The Bayesian approach to inverse problems

Size: px

Start display at page:

Download "The Bayesian approach to inverse problems"

Amberly Jean Day
5 years ago
Views:

1 The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology 7 July 2015 Marzouk (MIT) ICERM IdeaLab 7 July / 29

2 Statistical inference Why is a statistical perspective useful in inverse problems? To characterize uncertainty in the inverse solution To understand how this uncertainty depends on the number and quality of observations, features of the forward model, prior information, etc. To make probabilistic predictions To choose good observations or experiments To address questions of model error, model validity, and model selection Marzouk (MIT) ICERM IdeaLab 7 July / 29

3 Bayesian inference Bayes rule p(θ y) = p(y θ)p(θ) p(y) Key idea: model parameters θ are treated as random variables (For simplicity, we let our random variables have densities) Notation θ are model parameters; y are the data; assume both to be finite-dimensional unless otherwise indicated p(θ) is the prior probability density L(θ) p(y θ) is the likelihood function p(θ y) is the posterior probability density p(y) is the evidence, or equivalently, the marginal likelihood Marzouk (MIT) ICERM IdeaLab 7 July / 29

4 Bayesian inference Summaries of the posterior distribution What information to extract? Posterior mean of θ; maximum a posteriori (MAP) estimate of θ Posterior covariance or higher moments of θ Quantiles Credibile intervals: C(y) such that P [θ C(y) y] = 1 α. Credibile intervals are not uniquely defined above; thus consider, for example, the HPD (highest posterior density) region. Posterior realizations: for direct assessment, or to estimate posterior predictions or other posterior expectations Marzouk (MIT) ICERM IdeaLab 7 July / 29

5 Bayesian and frequentist statistics Understanding both perspectives is useful and important... Key differences between these two statistical paradigms Frequentists do not assign probabilities to unknown parameters θ. One can write likelihoods p θ (y) p(y θ) but not priors p(θ) or posteriors. θ is not a random variable. In the frequentist viewpoint, there is no single preferred methodology for inverting the relationship between parameters and data. Instead, consider various estimators ˆθ(y) of θ. The estimator ˆθ is a random variable. Why? Frequentist paradigm considers y to result from a random and repeatable experiment. Marzouk (MIT) ICERM IdeaLab 7 July / 29

6 Bayesian and frequentist statistics Key differences (continued) Evaluate quality of ˆθ through various criteria: bias, variance, mean-square error, consistency, efficiency,... One common estimator is maximum likelihood: ˆθ ML = argmax θ p(y θ). p(y θ) defines a family of distributions indexed by θ. Link to Bayesian approach: MAP estimate maximizes a penalized likelihood. What about Bayesian versus frequentist prediction of y new y θ? Frequentist: plug-in or other estimators of y new Bayesian: posterior prediction via integration Marzouk (MIT) ICERM IdeaLab 7 July / 29

7 Bayesian inference Likelihood functions In general, p(y θ) is a probabilistic model for the data In the inverse problem or parameter estimation context, the likelihood function is where the forward model appears, along with a noise model and (if applicable) an expression for model discrepancy Contrasting example (but not really!): parametric density estimation, where the likelihood function results from the probability density itself. Selected examples of likelihood functions 1 Bayesian linear regression 2 Nonlinear forward model g(θ) with additive Gaussian noise 3 Nonlinear forward model with noise + model discrepancy Marzouk (MIT) ICERM IdeaLab 7 July / 29

8 Bayesian inference Prior distributions In ill-posed parameter estimation problems, e.g., inverse problems, prior information plays a key role Intuitive idea: assign lower probability to values of θ that you don t expect to see, higher probability to values of θ that you do expect to see Examples 1 Gaussian processes with specified covariance kernel 2 Gaussian Markov random fields 3 Gaussian priors derived from differential operators 4 Hierarchical priors 5 Besov space priors 6 Higher-level representations (objects, marked point processes) Marzouk (MIT) ICERM IdeaLab 7 July / 29

9 Gaussian process priors Key idea: any finite-dimensional distribution of the stochastic process θ(x, ω) : D Ω R is multivariate normal. In other words: θ(x, ω) is a collection of jointly Gaussian random variables, indexed by x Specify via mean function and covariance function E [θ(x)] = µ(x) E [(θ(x) µ) (θ(x ) µ)] = C(x, x ) Smoothness of process is controlled by behavior of covariance function as x x Restrictions: stationarity, isotropy,... Marzouk (MIT) ICERM IdeaLab 7 July / 29

(Gaussian covariance kernel) Both are θ(x, ω) : D M(x,!

10 Example: Gaussian stationary Gaussian process random fields priors! Prior is a stationary Gaussian random field: (exponential covariance kernel) (Gaussian covariance kernel) Both are θ(x, ω) : D M(x,!) Ω R, = with µ(x) D + = [0, " i 1] c 2 i (!). # i (x) K $ i=1 (Karhunen-Loève expansion) Marzouk (MIT) ICERM IdeaLab 7 July / 29

11 Gaussian Markov random fields Key idea: discretize space and specify a sparse inverse covariance ( precision ) matrix W ( p(θ) exp 1 ) 2 γθt Wθ where γ controls scale Full conditionals p(θ i θ i ) are available analytically and may simplify dramatically. Represent as an undirected graphical model Example: E [θ i θ i ] is just an average of site i s nearest neighbors Quite flexible; even used to simulate textures Marzouk (MIT) ICERM IdeaLab 7 July / 29

12 Priors through differential operators Key idea: return to infinite-dimensional setting; again penalize roughness in θ(x) Stuart 2010: define the prior using fractional negative powers of the Laplacian A = : θ N ( θ 0, βa α) Sufficiently large α (α > d/2), along with conditions on the likelihood, ensures that posterior measure is well defined Marzouk (MIT) ICERM IdeaLab 7 July / 29

13 GPs, GMRFs, and SPDEs In fact, all three types of Gaussian priors just described are closely connected. Linear fractional SPDE: ( κ 2 ) β/2 θ(x) = W(x), x R d, β = ν + d/2, κ > 0, ν > 0 Then θ(x) is a Gaussian field with Matérn covariance: C(x, x σ 2 ) = 2 ν 1 Γ(ν) (κ x x ) ν K ν (κ x x ) Covariance ( kernel is Green s function of differential operator κ 2 ) β C(x, x ) = δ(x x ) ν = 1/2 equivalent to exponential covariance; ν equivalent to squared exponential covariance Can construct a discrete GMRF that approximates the solution of SPDE (See Lindgren, Rue, Lindström JRSSB 2011.) Marzouk (MIT) ICERM IdeaLab 7 July / 29

14 Hierarchical Gaussian priors Inverse Problems 24 (2008) D Calvetti and E Somersalo Figure 1. Three realization drawn from the prior (6) with constant variance θ j = θ 0 (left) and from the corresponding prior where the variance is 100 fold at two points indicated by arrows (right). Calvetti & Somersalo, Inverse Problems 24 (2008) where X and W are the n-variate random variables with components X j and W j,respectively, and 1 Marzouk (MIT) ICERM IdeaLab 7 July / 29

variance (bottom Hierarchical row) after 1, 3Gaussian and 5 iteration of

updated of the image at each iteration step.

variance (bottom row) after 1, 3 and 5 iteration of the cyclic algorithm

iteration step Calvetti & Somersalo, Inverse Problems 24 (2008) 034013.

15 Figure 4. Approximation of the MAP Estimate of the image (top row) and of the variance (bottom Hierarchical row) after 1, 3Gaussian and 5 iteration of thepriors cyclic algorithm when using the GMRES method to compute the updated of the image at each iteration step. Iteration 1 Iteration 3 Iteration 7 Iteration 1 Iteration 3 Iteration 7 Figure 5. Approximation of the MAP estimate of the image (top row) and of the variance (bottom row) after 1, 3 and 5 iteration of the cyclic algorithm when using the CGLS method to compute the updated of the image at each iteration step Calvetti & Somersalo, Inverse Problems 24 (2008) he graphs displayed in figure 6 refer to the CGLS iteration with inverse gamma hyperprior. he value of the objective function levels off after five iterations, and this could be the basis Marzouk (MIT) ICERM IdeaLab 7 July / 29

16 Non-Gaussian priors Besov space B s pq(t): and θ B s pq (T) := θ(x) = c 0 + c 0 q + Consider p = q = s = 1: j=0 2 j 1 j=0 h=0 2 jq(s p ) θ B 1 11 (T) = c 0 + w j,h ψ j,h (x) 2 j q/p 1/q 1 w j,h p <. h=0 2 j 1 j=0 h=0 2 j/2 w j,h. Then the distribution of θ is a Besov prior if αc 0 and α2 j/2 w j,h are independent and Laplace(1). ( ) Loosely, π(θ) = exp α θ B 1 11 (T). Marzouk (MIT) ICERM IdeaLab 7 July / 29

17 Higher-level representations Marked point processes, and more: Rue & Hurn, Biometrika 86 (1999) Marzouk (MIT) ICERM IdeaLab 7 July / 29

18 Bayesian inference Hierarchical modeling One of the key flexibilities of the Bayesian construction! Hierarchical modeling has important implications for the design of efficient MCMC samplers (later in the lecture) Examples: 1 Unknown noise variance 2 Unknown variance of a Gaussian process prior (cf. choosing the regularization parameter) 3 Many more, as dictated by the physical models at hand Marzouk (MIT) ICERM IdeaLab 7 July / 29

19 Example: prior variance hyperparameter in an inverse diffusion problem hyperprior posterior, ς=10 1, 13 sensors posterior, ς=10 2, 25 sensors 1.2 p(θ) or p(θ d) θ Figure : Posterior marginal density of the variance hyperparameter θ, versus quality of data, contrasted with its hyperprior density. Regularization ς 2 /θ. Marzouk (MIT) ICERM IdeaLab 7 July / 29

20 The linear Gaussian model A key building-block problem: Parameters θ R n, observations y R m Forward model f (θ) = Gθ, where G R m n Additive noise yields observations: y = Gθ + ɛ ɛ N(0, Γ obs ) and is independent of θ Endow θ with a Gaussian prior, θ N(0, Γ pr ). Posterior probability density p(θ y) p(y θ)p(θ) = L(θ)p(θ) ( = exp 1 ) ( 2 (y Gθ)T Γ 1 obs (y Gθ) exp 1 ) 2 θt Γ 1 pr θ ( = exp 1 ) 2 (θ µ pos) T Γ 1 pos (θ µ pos) Marzouk (MIT) ICERM IdeaLab 7 July / 29

21 The linear Gaussian model A key building-block problem: Parameters θ R n, observations y R m Forward model f (θ) = Gθ, where G R m n Additive noise yields observations: y = Gθ + ɛ ɛ N(0, Γ obs ) and is independent of θ Endow θ with a Gaussian prior, θ N(0, Γ pr ). Posterior probability density p(θ y) p(y θ)p(θ) = L(θ)p(θ) ( = exp 1 ) ( 2 (y Gθ)T Γ 1 obs (y Gθ) exp 1 ) 2 θt Γ 1 pr θ ( = exp 1 ) 2 (θ µ pos) T Γ 1 pos (θ µ pos) Marzouk (MIT) ICERM IdeaLab 7 July / 29

22 The linear Gaussian model Posterior is again Gaussian: Γ pos = ( G T Γ 1 obs G + Γ 1 pr ) 1 = Γ pr Γ pr G T ( GΓ pr G T + Γ obs ) 1 GΓpr = (I KG) Γ pr µ pos = Γ pos G T Γ 1 obs y In the context of filtering, K is known as the (optimal) Kalman gain. H := G T Γ 1 obsg is the Hessian of the negative log-likelihood How does low rank of H affect the structure of the posterior? How does H interact with the prior? Marzouk (MIT) ICERM IdeaLab 7 July / 29

23 Likelihood-informed directions Consider the Rayleigh ratio R(w) = w Hw w Γ 1 pr w. When R(w) is large, likelihood dominates the prior in direction w. The ratio is maximized by solutions to the generalized eigenvalue problem Hw = λγ 1 pr w. The posterior covariance can be written as a negative update along these likelihood-informed directions, and approximation can be obtained by using only r largest eigenvalues: Γ pos = Γ pr n i=1 λ i w i wi Γ pr 1 + λ i r i=1 λ i 1 + λ i w i w i (1) Marzouk (MIT) ICERM IdeaLab 7 July / 29

24 Likelihood-informed directions Consider the Rayleigh ratio R(w) = w Hw w Γ 1 pr w. When R(w) is large, likelihood dominates the prior in direction w. The ratio is maximized by solutions to the generalized eigenvalue problem Hw = λγ 1 pr w. The posterior covariance can be written as a negative update along these likelihood-informed directions, and approximation can be obtained by using only r largest eigenvalues: Γ pos = Γ pr n i=1 λ i w i wi Γ pr 1 + λ i r i=1 λ i 1 + λ i w i w i (1) Marzouk (MIT) ICERM IdeaLab 7 July / 29

25 Likelihood-informed directions Consider the Rayleigh ratio R(w) = w Hw w Γ 1 pr w. When R(w) is large, likelihood dominates the prior in direction w. The ratio is maximized by solutions to the generalized eigenvalue problem Hw = λγ 1 pr w. The posterior covariance can be written as a negative update along these likelihood-informed directions, and approximation can be obtained by using only r largest eigenvalues: Γ pos = Γ pr n i=1 λ i w i wi Γ pr 1 + λ i r i=1 λ i 1 + λ i w i w i (1) Marzouk (MIT) ICERM IdeaLab 7 July / 29

26 Optimality results for Γ pos It turns out that the approximation Γpos = Γ pr r i=1 λ i 1 + λ i w i w i (2) is optimal in a class of loss functions L( Γ pos, Γ pos ) for approximations of form Γ pos = Γ pr KK, where rank(k) r. 1 Γpos minimises the Hellinger distance and the KL-divergence between N (µ pos (y), Γ pos ) and N (µ pos (y), Γ pos ). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance. 1 For details see Spantini et al., Optimal low-rank approximations of Bayesian linear inverse problems, Marzouk (MIT) ICERM IdeaLab 7 July / 29

27 Optimality results for Γ pos It turns out that the approximation Γpos = Γ pr r i=1 λ i 1 + λ i w i w i (2) is optimal in a class of loss functions L( Γ pos, Γ pos ) for approximations of form Γ pos = Γ pr KK, where rank(k) r. 1 Γpos minimises the Hellinger distance and the KL-divergence between N (µ pos (y), Γ pos ) and N (µ pos (y), Γ pos ). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance. 1 For details see Spantini et al., Optimal low-rank approximations of Bayesian linear inverse problems, Marzouk (MIT) ICERM IdeaLab 7 July / 29

28 Optimality results for Γ pos It turns out that the approximation Γpos = Γ pr r i=1 λ i 1 + λ i w i w i (2) is optimal in a class of loss functions L( Γ pos, Γ pos ) for approximations of form Γ pos = Γ pr KK, where rank(k) r. 1 Γpos minimises the Hellinger distance and the KL-divergence between N (µ pos (y), Γ pos ) and N (µ pos (y), Γ pos ). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance. 1 For details see Spantini et al., Optimal low-rank approximations of Bayesian linear inverse problems, Marzouk (MIT) ICERM IdeaLab 7 July / 29

29 Optimality results for Γ pos It turns out that the approximation Γpos = Γ pr r i=1 λ i 1 + λ i w i w i (2) is optimal in a class of loss functions L( Γ pos, Γ pos ) for approximations of form Γ pos = Γ pr KK, where rank(k) r. 1 Γpos minimises the Hellinger distance and the KL-divergence between N (µ pos (y), Γ pos ) and N (µ pos (y), Γ pos ). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance. 1 For details see Spantini et al., Optimal low-rank approximations of Bayesian linear inverse problems, Marzouk (MIT) ICERM IdeaLab 7 July / 29

30 Remarks on the optimal approximation Γ pos = Γ pr KK, KK = r i=1 λ i w i wi 1 + λ i The form of the optimal update is widely used (Flath et al. 2011) Compute with Lanczos, randomized SVD, etc. Directions w i = Γ 1 pr w i maximize the relative difference between prior and posterior variance: Var ( w i x ) Var ( w i x y ) Var ( w i x ) = λ i 1 + λ i Using the Frobenius norm as a loss would instead yield directions of greatest absolute difference between prior and posterior variance. Marzouk (MIT) ICERM IdeaLab 7 July / 29

31 A metric between covariance matrices Förstner metric Let A, B 0, and (σ i ) be the eigenvalues of (A, B), then: df [ln ( )] 2 (A, B) = tr 2 B 1 2 AB 1 2 B 1/2 AB 1/2 σ 2 σ 1 = i ln 2 (σ i ) Compare curvatures: sup u u Au u Bu = σ 1 Invariance properties: d F (A, B) = d F ( A 1, B 1) d F (A, B) = d F ( MAM, MBM ) Frobenius d F (A, B) = A B F does not share the same properties Marzouk (MIT) ICERM IdeaLab 7 July / 29

32 Example: computerized tomography X-rays travel from sources to detectors through an object of interest. The intensities from the sources are measured at the detectors, and the goal is to reconstruct the density of the object intensity detector pixel This synthetic example is motivated by a real application: real-time X-ray imaging of logs that enter a saw mill for the purpose of automatic quality control. 2 2 Check Marzouk (MIT) ICERM IdeaLab 7 July / 29

Example: computerized tomography Weaker data faster decay of

10 5 limited angle 10-3 prior limited angle eigenvalues 10-4 10-5

3 10 2 10 1 10 0 full angle df 10 3 10 2 10 1 full angle 10-1 10-8

0 100 200 300 400 500 rank of update In the limited angle case,

33 Example: computerized tomography Weaker data faster decay of generalized eigenvalues lower order approximations possible limited angle 10-3 prior limited angle eigenvalues limited angle full angle generalized eigenvalues full angle df full angle index i index i rank of update In the limited angle case, roughly r = 200 is enough to get a good approximation (with full angle r 800 needed). prior 5.4 rank = 50 rank = 100 rank = 200 posterior Marzouk (MIT) ICERM IdeaLab 7 July / 29

34 Example: computerized tomography Approximation of the mean: µ pos (y) = Γ pos G Γ 1 obs y A r y Marzouk (MIT) ICERM IdeaLab 7 July / 29

35 Questions yet to answer How to simulate from or explore the posterior distribution? How to make Bayesian inference computationally tractable when the forward model is expensive (e.g., a PDE) and the parameters are high- or infinite-dimensional? Downstream questions: model selection, optimal experimental design, decision-making, etc. Marzouk (MIT) ICERM IdeaLab 7 July / 29

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/