Course on Inverse Problems

Similar documents
Course on Inverse Problems Albert Tarantola

Comparison between least-squares reverse time migration and full-waveform inversion

Course on Inverse Problems

Learning Gaussian Process Models from Uncertain Data

Approximate- vs. full-hessian in FWI: 1D analytical and numerical experiments

Registration-guided least-squares waveform inversion

Full waveform inversion of shot gathers in terms of poro-elastic parameters

Uncertainty quantification for Wavefield Reconstruction Inversion

DRAGON ADVANCED TRAINING COURSE IN ATMOSPHERE REMOTE SENSING. Inversion basics. Erkki Kyrölä Finnish Meteorological Institute

2. As we shall see, we choose to write in terms of σ x because ( X ) 2 = σ 2 x.

Nonparametric Bayesian Methods (Gaussian Processes)

GAUSSIAN PROCESS REGRESSION

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

ik () uk () Today s menu Last lecture Some definitions Repeatability of sensing elements

Lecture : Probabilistic Machine Learning

Linearized inverse Problems

DIRECT AND INDIRECT LEAST SQUARES APPROXIMATING POLYNOMIALS FOR THE FIRST DERIVATIVE FUNCTION

Stochastic Spectral Approaches to Bayesian Inference

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Introduction to Bayesian methods in inverse problems

FDSA and SPSA in Simulation-Based Optimization Stochastic approximation provides ideal framework for carrying out simulation-based optimization

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

MTTTS16 Learning from Multiple Sources

Steepest descent minimization applied to a simple seismic inversion

Probabilistic Machine Learning. Industrial AI Lab.

Tools for Parameter Estimation and Propagation of Uncertainty

Dynamic System Identification using HDMR-Bayesian Technique

Acoustic Wave Equation

GEOPHYSICAL INVERSE THEORY AND REGULARIZATION PROBLEMS

USING GEOSTATISTICS TO DESCRIBE COMPLEX A PRIORI INFORMATION FOR INVERSE PROBLEMS THOMAS M. HANSEN 1,2, KLAUS MOSEGAARD 2 and KNUD S.

Efficient gradient computation of a misfit function for FWI using the adjoint method

Retrieval problems in Remote Sensing

Summary and discussion of The central role of the propensity score in observational studies for causal effects

Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I

19 : Slice Sampling and HMC

Signal Modeling, Statistical Inference and Data Mining in Astrophysics

4. DATA ASSIMILATION FUNDAMENTALS

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Physics 403. Segev BenZvi. Propagation of Uncertainties. Department of Physics and Astronomy University of Rochester

Linear inverse Gaussian theory and geostatistics a tomography example København Ø,

Outline Lecture 2 2(32)

Markov chain Monte Carlo methods in atmospheric remote sensing

Spring 2014: Computational and Variational Methods for Inverse Problems CSE 397/GEO 391/ME 397/ORI 397 Assignment 4 (due 14 April 2014)

Data assimilation in high dimensions

Introduction to Machine Learning CMU-10701

Multiple Scenario Inversion of Reflection Seismic Prestack Data

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Inverse problem and optimization

Stochastic Collocation Methods for Polynomial Chaos: Analysis and Applications

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Sequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes

APPLICATION OF OPTIMAL BASIS FUNCTIONS IN FULL WAVEFORM INVERSION

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

A Stochastic Collocation based. for Data Assimilation

Singular value decomposition. If only the first p singular values are nonzero we write. U T o U p =0

Chapter 9. Non-Parametric Density Function Estimation

back to the points in the medium where a correction to the presumed medium properties can be made to reduce the datamodel

The Math, Science and Computation of Hydraulic Fracturing

Multi-layer Neural Networks

The Expectation-Maximization Algorithm

Calculus (Math 1A) Lecture 4

Marginal density. If the unknown is of the form x = (x 1, x 2 ) in which the target of investigation is x 1, a marginal posterior density

GPS point positioning (using psuedorange from code). and earthquake location (using P or S travel times).

Density Estimation. Seungjin Choi

Pattern Recognition and Machine Learning

Calculus (Math 1A) Lecture 4

Parallel Algorithms for Four-Dimensional Variational Data Assimilation

Factor Analysis and Kalman Filtering (11/2/04)

Tracking using CONDENSATION: Conditional Density Propagation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

FWI with Compressive Updates Aleksandr Aravkin, Felix Herrmann, Tristan van Leeuwen, Xiang Li, James Burke

Probabilistic Graphical Models

Markov Chain Monte Carlo methods

The Inversion Problem: solving parameters inversion and assimilation problems

Chapter 9. Non-Parametric Density Function Estimation

Previously Monte Carlo Integration

AUTOMOTIVE ENVIRONMENT SENSORS

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Gaussian Process Approximations of Stochastic Differential Equations

Definition 5.1. A vector field v on a manifold M is map M T M such that for all x M, v(x) T x M.

MCMC Sampling for Bayesian Inference using L1-type Priors

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Discrete Variables and Gradient Estimators

CPSC 540: Machine Learning

CSC 2541: Bayesian Methods for Machine Learning

J = L + S. to this ket and normalize it. In this way we get expressions for all the kets

Geophysical Data Analysis: Discrete Inverse Theory

Evidence of an axial magma chamber beneath the ultraslow spreading Southwest Indian Ridge

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

The Kalman Filter ImPr Talk

Formulas for probability theory and linear models SF2941

Generative Clustering, Topic Modeling, & Bayesian Inference

Strong Lens Modeling (II): Statistical Methods

Part 6: Multivariate Normal and Linear Models

(Extended) Kalman Filter

Transcription:

California Institute of Technology Division of Geological and Planetary Sciences March 26 - May 25, 2007 Course on Inverse Problems Albert Tarantola Institut de Physique du Globe de Paris Lesson XVI

CONCLUSION OF THE LAST LESSON (IN THREE SLIDES): Let p = ϕ(z, S) the mapping that to any medium z = {z(x)} and any source S = {S(x, t)} associates the wavefield p = {p(x, t)} defined by the resolution of the acoustic wave equation with Initial Conditions of Rest: exp(-z 0 (x)) C 2 2 p (x, t) p(x, t) = S(x, t) ; (ICR). t2

We saw above the following result: For a fixed source field S = {S(x, t)}, the tangent linear mapping to the mapping p = ϕ(z, S) at z 0 = {z 0 (x)} is the linear mapping Φ 0 that to any δz = {δz(x)} associates the δp = {δp(x, t)} that is the solution to the differential equation (with Initial Conditions of Rest) exp(-z 0 (x)) C 2 2 δp (x, t) δp(x, t) = Σ(x, t) ; (ICR), t2 where Σ(x, t) is the Born secondary source Σ(x, t) = 2 p 0 t 2 (x, t) exp(-z 0(x)) C 2 δz(x).

We now have the following result: The transpose mapping Φ t 0 is the linear mapping that to any δp = { δp(x, t)} associates the δz = { δz(x)} defined by the two equations written four slides ago. That is, (i) compute the wavefield π(x, t) that is solution to the wave equation, with Final (instead of initial) Conditions of Rest, and whose source function (at the righthand side of the wave equation) is δp(x, t), exp(-z 0 (x)) C 2 and (ii) evaluate 2 π t 2 (x, t) π(x, t) = δp(x, t) ; (FCR), δz(x) = exp(-z 0(x)) C 2 dt 2 p 0 (x, t) π(x, t). t2

The mapping p = ϕ(z, S) depends on the (logarithmic) velocity model z = {z(x)} and on the source model S = {S(x, t)}. We have examined the dependence of the mapping on z, but not yet on S. But this dependence is linear, so we don t need to evaluate the tangent linear operator (it is the wave equation operator L itself [associated to initial conditions of rest]). As for the transpose operator L t, it equals L, excepted that it is associated to final conditions of rest.

And let s now move towards the setting of a least-squares optimization problem involving waveforms... Recall: we had some observable parameters o, and some model parameters m, the forward modeling relation was m o = ψ(m), we introduced the tangent linear operator Ψ via ψ(m + δm) = ψ(m) + Ψ δm +..., and we arrived at the iterative (steepest descent) algorithm m k+1 = m k µ ( C prior Ψ t k C-1 obs (ψ(m k) o obs ) + (m k m prior ) ), where the four typical elements m prior, C prior, o obs and C obs of least-squares problem appear.

In a problem involving waveforms, it is better to consider the operator m o = ψ(m), as composed of two parts, an operator m p = ϕ(m), that to any model m associates the wavefield p, and an operator p o = γ(p) that to the wavefield p associates the observables o. The mapping p = ϕ(m) has been extensively studied in the last lesson (and its linear tangent mapping and its transpose characterized). The mapping o = γ(p) can, for instance, correspond to the case when the observable parameters are the values of the wavefield at some points (a set of seismograms), or the observable parameters can be some more complicated function of the wavefield. We have o = γ(p) = γ( ϕ(m) ) = ψ(m).

We have already introduced the series expansion ψ(m + δm) = ψ(m) + Ψ δm +.... Introducing Φ the linear operator tangent to ϕ and Γ the linear operator tangent to γ, we can now also write ψ(m + δm) = γ(ϕ(m + δm) ) = γ(ϕ(m) + Φ δm +... ) = γ(ϕ(m) ) + Γ Φ δm +... = ψ(m) + Γ Φ δm +..., so we have and, therefore, Ψ = Γ Φ, Ψ t = Φ t Γ t.

The least-squares (steepest descent) optimization algorithm then becomes m k+1 = m k µ ( C prior Φ t k Γt k C-1 obs (γ( ϕ(m k) ) o obs ) + (m k m prior ) ). We already know what Φ t is (last lesson). Let us examine a couple of examples of what Γ t may be.

(i) The observable is one seismogram, i.e., the time dependent value of the wavefield at one space location. The mapping p o = γ(p) then is, in fact, a linear mapping p o = Γ p, p = { p(x, t) } Γ o = { p(x 0, t) }. We have two duality products p, p = dv(x) dt p(x, t) p(x, t) ô, o = dt ô(x 0, t) o(x 0, t) and the transpose Γ t must verify, for any δo and any δp, i.e., dt δo(x 0, t) δp(x 0, t) = δo, Γ δp = Γ t δo, δp, dv(x) dt (Γ t δo)(x, t) }{{} δ(x x 0 ) δo(x 0,t) δp(x, t).

Therefore, the mapping δo δp = Γ t δo is the (linear) mapping that to any δo(x 0, t) associates the δp(x, t) given by δp(x, t) = δ(x x 0 ) δo(x 0, t). With this result in view, and the result demonstrated during the last lesson, we can now interpret all the operations of the steepest-descent algorithm m k+1 = m k µ ( C prior Φ t k Γt k C-1 obs (γ( ϕ(m k) ) o obs ) + (m k m prior ) ).

(ii) The observable is the energy of one seismogram, o = E(x 0 ) = dt p(x 0, t) 2. The mapping p o = γ(p) is not linear. Let us evaluate the tangent linear mapping, using the standard expansion: γ(p + δp) = dt ( p(x 0, t) + δp(x 0, t) ) 2 = dt ( p(x 0, t) 2 + 2 p(x, t) δp(x 0, t) +... ) = dt p(x 0, t) 2 + dt 2 p(x, t) δp(x }{{} 0, t) +... Γ(x,t) = γ(p) + dt Γ(x, t) δp(x 0, t) +.... We have therefore identified the tangent linear operator Γ (to any δp(x, t) it associates δo = dt 2 p(x, t) δp(x 0, t) ) and the Fréchet derivative ( Γ(x, t) = 2 p(x, t) ).

To identity the transpose operator Γ t, we have here the two duality products p, p = dv(x) dt p(x, t) p(x, t) ô, o = ô(x 0 ) o(x 0 ). The transpose Γ t must verify, for any δo and any δp, i.e., δo(x 0 ) δo, Γ δp = Γ t δo, δp, dt 2 p(x, t) δp(x 0, t) = = dv(x) dt (Γ t δo)(x, t) }{{} 2 p(x,t) δ(x x 0 ) δo(x 0 ) δp(x, t).

Therefore, the mapping δo δp = Γ t δo is the (linear) mapping that to any δo(x 0 ) associates the δp(x, t) given by δp(x, t) = 2 p(x, t) δ(x x 0 ) δo(x 0 ). With this result in view, we can again interpret all the operations of the steepest-descent algorithm m k+1 = m k µ ( C prior Φ t k Γt k C-1 obs (γ( ϕ(m k) ) o obs ) + (m k m prior ) ).

8.5 Fitting Waveforms to Obtain the Medium Properties Before developing the theory in section 8.5.2 and examining the associated computer codes in section 8.5.3, let us first discuss the problem and have a look at the images. 8.5.1 The Story, in Short We consider the propagation of acoustic (i.e., pressure) waves in an one-dimensional medium. Let x be a Cartesian coordinate and t be ordinary (Newtonian) time. When a pressure perturbation p(x, t) propagates in a medium where the velocity of the acoustic waves at point x is c(x), the pressure field p(x, t) satisfies the well-known (one-dimensional) acoustic wave equation 1 2 p c(x) 2 t 2 (x, t) 2 p (x, t) = S(x, t), (8.94) x2 where S(x, t) is the source function, i.e., the function representing the excitation that produces the acoustic waves. So, the medium is characterized by the velocity function c(x). We assume that the source field S(x, t) is perfectly known, and the goal is to use some observations of the pressure field (i.e., some observed seismograms) to infer properties of c(x).

8.5.1.1 A Priori Information We are told that the actual (unknown) medium c(x) actual was created as a random realization of a log-gaussian distribution, with known log-mean and known log-covariance. This means that the logarithm of the actual velocity, defined as z(x) acrual ( c(x)actual ) 2 c(x) = log = 2 log actual, (8.95) c 0 c 0 where c 0 is an arbitrary constant, was created as a random realization of a Gaussian distribution, with known mean and known covariance. Of course, the Gaussian model would be inappropriate for a positive quantity like c(x), while it is appropriate for the logarithmic velocity z(x). Figure 8.13 shows the mean of the prior Gaussian distribution (a linear function), and figure 8.14 shows the covariance function (an exponential covariance). -0.25-0.5 Figure 8.13: The mean of the a priori Gaussian function for the logarithmic velocity is the linear function z(x) = α + β x, with α = QQQ and β = XXX. -0.75-1 -1.25-1.5-1.75-2 0 20 40 60 80 100 It is easy to generate pseudo-random samples of this a priori distribution (Gaussian for the logarithmic velocity, log-gaussian for the velocity). Figure 8.15 displays half a dozen

Figure 8.14: Instead of choosing an a priori covariance operator, I choose an operator L (simple to operate with), and define the covariance operator via C = L L t. The operator L is simple to operate with, and is such that C is an exponential covariance (see section 1.8). This figure is the representation of the (prior) covariance function C(x, x ). pseudo-random samples of the velocity function c(x), together with the log-mean of the prior distribution (red line). Those samples would be the starting point of a Monte Carlo resolution of the inference problem stated below, but this is not the way we follow here.

Figure 8.15: Some pseudo-random samples of the a priori (log-gaussian) distribution for the velocity, together with the log-mean of the prior distribution (red line) (this line is, in fact, the exponential of the function represented in figure 8.13). If enough of those samples were given, then, using statistical techniques, the log-mean could be evaluated, as well as the covariance function in figure 8.14. 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20 40 60 80 100

8.5.1.2 Observations When selecting one randomly generated sample of the a priori distribution just described (the sample displayed in figure 8.16), and when using the source function S(x, t) displayed in figure 8.17, the pressure wavefield displayed in figure 8.18 is generated (using the acoustic wave equation 8.94.) 1 Figure 8.16: The true velocity model used to generate the observed seismograms in figure 8.19. For the exercise below, we are not supposed have seen this true velocity model: we only know that is one particular sample of the a priori log-gaussian distribution. 0.9 0.8 0.7 0.6 0.5 0.4 0 20 40 60 80 100

Figure 8.17: The source function S(x, t) : a point in space and the derivative of a Gaussian in time. For the exercise below, this source function is assumed known. Figure 8.18: The pressure wavefield obtained when propagating in the velocity model displayed in figure 8.16 the source function in displayed in figure 8.17 (using initial conditions of rest). For the exercise below, we are not supposed have seen the whole of this pressure wavefield: only the two noisy seismograms in figure 8.19 are available.

At two positions x 1 and x 2 (corresponding to grid points 45 and 87), two seismograms have recorded the pressure values as a function of time. The recorded values are contaminated by instrumental noise, simulated here as a white noise of variance QQQ. The two observed seismograms are displayed in figure 8.19. One must understand that the two figures 8.17 (the source field) and 8.19 (the two observed seismograms) are available for further work (our problem shall be to say as much as possible about the actual velocity function), but that the two figures 8.16 (the actual velocity function) and 8.18 (the actual pressure wavefield) are only here to show how the synthetic data were built, but are not avaiable for later use. 40000 30000 20000 10000 0-10000 -20000-30000 Figure 8.19: The two seismograms observed at grid points 45 and 87, with instrumental noise added (white noise with variance QQQ.) It is sometimes useful to also display the residual seismograms defined as the difference between the observed seismograms and those that would be predicted from the model at

the center of the prior distribution (red line in figure 8.15). They are displayed in figure 8.20, and, although for the computer they will just correspond to an intermediary computation, we, humans, may already have a first indication of the difficulty of the optimization task to be later faced. 40000 30000 20000 10000 0-10000 -20000-30000 Figure 8.20: The initial residual seismograms (difference between the observed seismograms and those that would be predicted from the model at the center of the prior distribution [red line in figure 8.15]).

8.5.1.3 A Posteriori Information Let us denote by z a logarithmic velocity model, by p obs the two observed seismograms (data), and by z p(z) (8.96) the mapping corresponding to the prediction of the data. Let us also denote by C obs the covariance operator describing the noise in the data. The a priori information corresponds to the mean a priori model z prior and to the covariance operator C prior. For this weakly nonlinear problem, the standard least-squares method applies. It tells us that the a posteriori information in the model space is also a Gaussian, with mean z post and covariance C post. The posterior mean z post is to be obtained by searching the point z at which the misfit function χ 2 (z) = p(z) p obs 2 + z z prior 2 = (p(z) p obs ) t C -1 obs (p(z) p obs) + (z z prior ) t C -1 prior (z z prior ) (8.97) attains its minimum. This point is obtained using an iterative algorithm, each iteration re-

quiring the propagation of the actual source in the current medium (using initial conditions of rest), and the propagation of a virtual source (the weighted seismogram residuals) using final conditions of rest (see the details in section 8.5.2). Once the point z post has been obtained, the posterior covariance operator can be obtained as C post = (P t postc -1 obs P post + C -1 prior) -1, (8.98) were, if the introduce the linear tangent operator P (the kernel of which being called the Fréchet derivative) via p(z + δz) = p(z) + P δz +..., (8.99) the operator P post is the value of P at point z post. When implementing all this in a computer (see section 8.5.3) one obtains an z post that is not represented here (the exponential of it is the red line in figure 8.22), and the posterior covariance operator displayed in figure 8.21.

Figure 8.21: The posterior covariance operator (see the text for the discussion).

It is now easy (see 8.5.3) to generate pseudo-ramdon samples of this a posteriori distribution (Gaussian for the logarithmic velocity, log-gaussian for the velocity) and half a dozen of them are represented in figure 8.22. The comparison of figure 8.15 (prior samples) and figure 8.22 (posterior samples) informs us about the gain in information obtained through the use of the two seismograms. The posterior samples are much closer to each other than were the prior samples. All of them display a jump in velocity around grid point 30, and a bump around grid point 80 (this, of course, reflects two conspicuous properties of the true model in figure 8.16). Figure 8.22: A few samples of the posterior distribution, to be compared with the samples of the prior distribution (in figure 8.15). The same random seeds have been used to generate the prior and the posterior samples (from where the similarity). The red line is the log-mean of all the samples (or, equivalently, the exponential of the mean 0.9 0.8 0.7 0.6 0.5 0.4 of logarithmic velocity samples, not displayed). 0 20 40 60 80 100 1

1 0.9 0.8 0.7 0.6 0.5 0.4 0 20 40 60 80 100

Current least-squares practice, unfortunately, is not to display the posterior samples in figure 8.22, but to just display the log-mean model (the red line), calling it the solution of the least-squares inversion. To speak properly, the red line is not a model: the individual samples are possible models, the red line is just the log-average of all possible models. We may now look at the final residual seismograms. Figure 8.23 displays the difference between the seismograms predicted using the center of the posterior distribution and the observed seismograms. Note: I should perhaps suppress this figure 8.23 as, perhaps, only figure 8.24 matters. Figure 8.24 displays the residual seismograms: difference between the seismograms predicted by the six posterior random realizations and the observed seismograms. Note: some of these residuals (for instance, for the random model #6) are quite large. I don t fully understand this.

40000 30000 20000 10000 0-10000 -20000-30000 Figure 8.23: Final residual seismograms: difference between the seismograms predicted using the center of the posterior distribution and the observed seismograms. This is to be compared with figure 8.20.

40000 30000 20000 10000 0-10000 -20000-30000 40000 30000 20000 10000 0-10000 -20000-30000 40000 30000 20000 10000 0-10000 -20000-30000 40000 30000 20000 10000 0-10000 -20000-30000 40000 30000 20000 10000 0-10000 -20000-30000 40000 30000 20000 10000 0-10000 -20000-30000 Figure 8.24: Final residual seismograms: difference between the seismograms predicted by the six posterior random ralizations and the observed seismograms. Note: I was expecting those residuals to be as small as those corresponding to the mean model (i.e., as small as the residuals in figure 8.23). Perhaps we see here the limits of the assumption that the posterior distribution is approximately Gaussian.