Linear inversion. A 1 m 2 + A 2 m 2 = Am = d. (12.1) again.

14 Linear inversion In Chapter 12 we saw how the parametrization of a continuous moel allows us to formulate a iscrete linear relationship between ata an moel m. With unknown corrections ae to the moel vector, this linear relationship remains formally the same if we write the physical moel parameters as m 1 an the corrections as m 2 but combine both in one vector m: A 1 m 2 + A 2 m 2 = Am = (12.1) again. Assuming we have M 1 moel parameters an M 2 corrections, this is a system of N equations (ata) an M = M 1 + M 2 unknowns. For more than one reason the solution of the system is not straightforwar: Even if we o not inclue multiple measurements along the same path, many of the N rows will be epenent. Since the ata always contain errors, this implies we cannot solve the system exactly, but have to minimize the misfit between Am an. For this misfit we can efine ifferent norms, an we face a choice of options. Despite the fact that we have (usually) many more ata than unknowns (i.e. N M), the system is almost certainly ill-pose in the sense that small errors in can lea to large errors in m; a parameter m j may be completely unetermine (A ij = 0 for all i) if it represents a noe that is far away from any raypath. We cannot escape making a subjective choice among an infinite set of equally satisfactory solutions by imposing a regularization strategy. For large M, the numerical computation of the solution has to be one with an iterative matrix solver which is often halte when a satisfactory fit is obtaine. Such efficient shortcuts interfere with the regularization strategy. We shall eal with each of these aspects in succession. Appenix D introuces some concepts of probability theory that are neee in this chapter. 255

256 Linear inversion 14.1 Maximum likelihoo estimation an least squares In experimental sciences, the most commonly use misfit criterion is the criterion of least squares, in which we minimize χ 2 ( chi square ) as a function of the moel: χ 2 (m) = N M j=1 A ijm j i 2 = min, (14.1) i=1 where σ i is the stanar eviation in atum i; χ 2 is a irect measure of the ata misfit, in which we weigh the misfits inversely with their stanar errors σ i. For uncorrelate an normally istribute errors, the principle of maximum likelihoo leas naturally to the least squares efinition of misfit. If there are no sources of bias, the expecte value E( i ) of i (the average of infinitely many observations of the same observable) is equal to the correct or error-free value. In practice, we have only one observation for each atum, but we usually have an eucate guess at the magnitue of the errors. We almost always use a normal istribution for errors, an assume errors to be uncorrelate, such that the probability ensity is given by a Gaussian or normal istribution of the form: P ( i ) = 1 σ i 2π exp σ 2 i i E( i ) 2. (14.2) The joint probability ensity for the observation of an N-tuple of ata with inepenent errors = ( 1, 2,..., N ) is foun by multiplying the iniviual probability ensities for each atum: N i E( i ) 2 P () = i=1 1 σ i 2π exp 2σ 2 i 2σ 2 i. (14.3) If we replace the expecte values in (14.3) with the preicte values from the moel parameters, we obtain again a probability, but now one that is conitional on the moel parameters taking the values m j : P ( m) = N i=1 1 σ i 2π exp i j A ijm j 2 2σ 2 i. (14.4) We usually assume that there are no extra errors introuce by the moelling (e.g. we ignore the approximation errors introuce by linearizations, neglect of anisotropy, or the shortcomings of ray theory etc.). In fact, if such moelling errors are also uncorrelate, unbiase an normally istribute, we can take them into account by incluing them in σ i but this is a big if. See Tarantola [351] for a much more comprehensive iscussion of this issue.

14.1 Maximum likelihoo estimation an least squares 257 Clearly, one woul like to have a moel that is associate with a high probability for its preicte ata vector. This leas to the efinition of the likelihoo function L for the moel m given the observation of the ata : L(m ) = P ( m) exp 1 2 χ 2 (m). Thus, maximizing the likelihoo for a moel involves minimizing χ 2. Since this involves minimizing the sum of squares of ata misfit, the metho is more generally known as the metho of least squares. The strong point of the metho of least squares is that it leas to very efficient methos of solving (12.1). Its major weakness is the reliance on a normal istribution of the errors, which may not always be the case. Because of the quaratic epenence on the misfit, outliers misfits of several stanar eviations have an influence on the solution that may be out of proportion, which means that errors may ominate in the solution. For a truly normal istribution, large errors have such a low probability of occurrence that we woul not worry about this. In practice however, many ata o suffer from outliers. For picke arrival times Jeffreys [146] has alreay observe that the ata have a tail-like istribution that eviates from the Gaussian for large eviations from the mean t m, mainly because a later arrival is misientifie as P or S: P (t) = 1 σ 2π e (t t m)2 /2σ 2 + g(t), where the probability ensity g(t) varies slowly an where 1. A simple metho to bring the ata istribution close to normal is to reject outliers with a elay that excees the largest elay time to be expecte from reasonable effects of lateral heterogeneity. This ecision can be mae after a first trial inversion: for example, one may reject all ata that leave a resiual in excess of 3σ after a first inversion attempt. If we ivie all ata an the corresponing row of A by their stanar eviations, we en up with a ata vector that is univariant, i.e. all stanar eviations are equal to 1. Thus, without loss of generality, we may assume that the ata are univariant, in which case we see from (14.1) that χ 2 is simply the square length of the resiual vector r = Am. From Figure 14.1 we see that r is then perpenicular to the subspace spanne by all vectors Ay (the range R(A) of A). For if it was not, we coul a a δm to m such that Aδm reuces the length of r. Thus, for all y the ot prouct between r an Ay must be zero: r Ay = A T r y = A T ( Am) y = 0,

258 Linear inversion r = Am R(A) Fig. 14.1. If the ata vector oes not lie in the range of A, the best we can o is to minimize the length of the resiual vector r. This implies that r must be perpenicular to any possible vector Ay. where A T is the transpose of A (i.e. A T ij = A ji). Since this ot prouct is 0 for all y, clearly A T ( Am) = 0, or: A T Am = A T, (14.5) which is known as the set of normal equations to solve the least-squares problem. Chi square is an essential statistical measure of the gooness of fit. In the hypothetical case that we satisfy every atum with a misfit of one stanar eviation we fin χ 2 = N; clearly values much higher than N are unwante because the misfit is higher than coul be expecte from the knowlege of ata errors, an values much lower than N inicate that the moel is trying to fit the ata errors rather than the general tren in the ata. For example, if two very close rays have travel time anomalies iffering by only 0.5 s an the stanar eviation is estimate to be 0.7 s, we shoul accept that a smooth moel preicts the same anomaly for each, rather than introucing a steep velocity graient in the 3D moel to try to satisfy the ifference. Because we want χ 2 N, it is often convenient to work with the reuce χ 2 or χ 2 re, which is efine as χ 2 /N, so that the optimum solution is foun for χ 2 re 1. But how close shoul χ 2 be to N? Statistical theory shows that χ 2 itself has a variance of 2N, or a stanar eviation of 2N. Thus, for 1 000 000 ata the true moel woul with 67% confience be foun in the interval χ 2 = 1 000 000 ± 1414. Such theoretical bouns are almost certainly too narrow because our estimates of the stanar eviations σ i are themselves uncertain. For example, if the true σ i are equal to 0.9 but we use 1.0 to compute χ 2, our compute χ 2 itself is in error (i.e. too low) by almost 20%, an a moel satisfying this level of misfit is probably not goo enough. It is therefore important to obtain accurate estimates of the stanar errors, e.g. using (6.2) or (6.12). Provie one is confient that the estimate stanar errors are unbiase, one shoul still aim for a moel that brings χ 2 very close to N, say to within 20 or 30%. An aitional help in eciing how close one wishes to be to a moel that fits at a level given by χ 2 = N is to plot the traeoff between the moel norm an χ 2

14.1 Maximum likelihoo estimation an least squares 259 χ 2 B 01 01 m 2 Fig. 14.2. The L- or traeoff curve between χ 2 an moel norm m 2. A 01 01 (sometimes calle the L-curve), shown schematically in Figure 14.2. If the traeoff curve shows that one coul significantly reuce the norm of the moel while paying only a small price in terms of an increase in χ 2 (point A in Figure 14.2), this is an inication that the stanar errors in the ata have been unerestimate. For common ata errors o not correlate between nearby stations, but the true elays shoul correlate even if the Earth s properties vary erratically (because of the overlap in finite-frequency sensitivity). The baly correlating ata can only be fit by significantly increasing the norm an complexity of the moel, which is what we see happening on the horizontal part of the traeoff curve. Conversely, if we notice that a significant ecrease in χ 2 can be obtaine at the cost of only a minor increase in moel norm (point B), this inicates an overestimate of ata errors an tells us we may wish to accept a moel with χ 2 < N. If the eviations require are unexpectely large, this is an inication that the error estimation for the ata may nee to be revisite. Depening on where on the L-curve we fin that χ 2 = N, we fin that we o or o not have a strong constraint on the norm of the moel. If the optimal ata fit is obtaine close to point B where the L-curve is steep, even large changes in χ 2 have little effect on the moel norm. On the other han, near point A even large changes in the moel give only a small improvement of the ata fit. Both A an B represent unwante situations, since at A we are trying to fit ata errors, which leas to erratic features in the moel, whereas at B we are amping too strongly. In a well esigne tomography experiment, χ 2 N near the ben in the L-curve. We use the term moel norm here in a very general sense one may wish to inspect the Eucliean m 2 as well as more complicate norms that we shall encounter in Section 14.5. In many cases one inverts ifferent ata groups that have uncorrelate errors. For example Montelli et al. [215] combine travel times from the ISC catalogues with cross-correlation travel times from broaban seismometers. The ISC set, with about 10 6 ata was an orer of magnitue larger than the secon ata set (10 5 ), an a brute force least-squares inversion woul give preference to the short perio ISC ata in cases where there are systematic incompatibilities. This is easily iagnose

260 Linear inversion by computing χ 2 for the iniviual ata groups. One woul wish to weigh the ata sets such that each group iniviually satisfies the optimal χ 2 criterion, i.e. if χ 2 i esignates the misfit for ata set i with N i ata, one imposes χ 2 i N i for each ata set. This may be accomplishe by giving each ata set equal weight an minimizing a weighte penalty function: P = i 1 N i χ 2 i. Note that this gives a solution that eviates from the maximum likelihoo solution, an we shoul only resort to weighting if we suspect that important conitions are violate, especially those of zero mean, uncorrelate an normally istribute errors. More often, an imbalance for iniviual χi 2 simply reflects an over- or unerestimation of the stanar eviations for one particular group of ata, an may prompt us to revisit our estimates for prior ata errors. Early tomographic stuies often ignore a formal statistical appraisal of the gooness of fit, an merely quote how much better a 3D tomographic moel satisfies the ata when compare to a 1D (layere or spherically symmetric) backgroun or starting moel, using a quantity name variance reuction, essentially the reuction in the Eucliean norm of the misfit vector. This reuction is as much a function of the fit of the 1D starting moel as of the ata fit itself i.e. the same 3D moel can have ifferent variance reuctions epening on the starting moel an is therefore useless as a statistical measure of quality for the tomographic moel. Exercises Exercise 14.1 Derive the normal equations by ifferentiating the expression for χ 2 with respect to m k for k = 1,..., M. Assume univariant ata (σ i = 1). Exercise 14.2 Why can we not conclue from (14.5) that Am? 14.2 Alternatives to least squares In the parlance of mathematics, the square Eucliean norm i r i 2 is one of a class of Lebesgue norms efine by the power p use in the sum: ( i r i p ) 1/p. Thus, the Eucliean norm is also known as the L 2 norm because p = 2. Of special interest are the L 1 norm (p = 1) an the case p which leas to minimizing the maximum among all r i. Instea of simply rejecting outliers, which always requires the choice of a har boun for acceptance, we may ownweight ata that show a large misfit in a

14.3 Singular value ecomposition 261 m = A T 2 v = λ v A M N A M 1 M 1 N M M 1 N 1 N M a. b. Fig. 14.3. (a) The original matrix system Am =. (b) The eigenvalue problem for the least-squares matrix A T A. previous inversion attempt, an repeat the process until it converges. In 1898, the Belgian mathematician Charles Lagrange propose such an iteratively weighte least-squares solution, by iteratively solving: W p Am W p 2 = min, where W p is a iagonal matrix with elements r i p 2 an 0 p < 2, which are etermine from the misfits r i in atum i after the previous iteration. We can start with an unweighte inversion to fin the first r. The choice p = 1 leas to the minimization of the L 1 norm if it converges. The elements of the resiual vector r i vary with each iteration, an convergence is not assure, but the avantage is that the inversion makes use of the very efficient numerical tools available for linear least-squares problems. The metho was introuce in geophysics by Scales et al. [304]. 14.3 Singular value ecomposition Though the least squares formalism hanles the incompatibility problem of ata in an overetermine system, we usually fin that A T A has a eterminant equal to zero, i.e. eigenvalues equal to zero, an its inverse oes not exist. Even though in tomographic applications A T A is often too large to be iagonalize, we shall analyse the inverse problem using singular values ( eigenvalues of a non-square matrix), since this formalism gives consierable insight. Let v i be an eigenvector of A T A with eigenvalue λ 2 i, so that AT Av = λ 2 i v. We may use square eigenvalues because A T A is symmetric an has only non-negative, real eigenvalues. Its eigenvectors are orthogonal. The choice of λ 2 instea of λ as eigenvalue is for convenience: the notation λ 2 i avois the occurrence of λ i later in the evelopment. We can arrange all M eigenvectors as columns in an M M matrix V an write (see Figure 14.3): A T AV = V 2. (14.6) The eigenvectors are normalize such that V T V = V V T = I.

262 Linear inversion With (14.6) we can stuy the uneretermine nature of the problem Am =, of which the least-squares solution is given by the system A T Am = A T. The eigenvectors v i span the M-imensional moel space so m can be written as a linear combination of eigenvectors: m = V y. Since V is orthonormal, m = y an we can work with y instea of m if we wish to restrict the norm of the moel. Using this: A T AV y = V 2 y = A T, or, multiplying both on the left with V T an using the orthogonality of V: 2 y = V T A T. Since is iagonal, this gives y i (an with that m = V y) simply by iviing the i-th component of the vector on the right by λ 2 i. But clearly, any y i which is multiplie by a zero eigenvalue can take any value without affecting the ata fit! We fin the minimum norm solution, the solution with the smallest y 2, by setting such components of y to 0. If we rank the eigenvalues λ 2 1 λ2 2...λ2 K > 0, 0,..., 0, then the last M K columns of V belong to the nullspace of A T A. We truncate the matrices V an to an M K matrix V K an a K K iagonal matrix to obtain the minimum norm estimate: ˆm min norm = V K 2 K V T K AT. (14.7) Note that the inverse of K exists because we have remove the zero eigenvalues. The orthogonality of the eigenvectors still guarantees VK T V K = I K, but now V K VK T = I M. To see how errors in the ata propagate into the moel, we use the fact that (14.7) represents a linear transformation of ata with a covariance matrix C. The posteriori covariance of transforme ata T is equal to T C T T (see Equation 14.39 in Appenix D). In our case we have scale the ata such that C = I so that the posteriori moel covariance is: C ˆm = V K 2 K V K T AT I AV K 2 K V K T = V K 2 K 2 K 2 K V K T = V K 2 K V K T. (14.8) Thus the posteriori variance of the estimate for parameter m i is given by: σ 2 m i = K j=1 Vij 2 λ 2. (14.9) j To istinguish ata uncertainty from moel uncertainty we enote the moel stanar eviation as σ mi an the ata stanar eviation as σ i.

14.3 Singular value ecomposition 263 m 000000 111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 T A 0000000 1111111 0000000 1111111 0000000 1111111 00000 11111 000 111 A A T 0000000 1111111 Am 0000000 1111111 0000000 1111111 0000000 1111111 000 111 Fig. 14.4. Mappings between the moel space (left) an the ata space (right). The range of A is inicate by the grey area within the ata space. The range of the backprojection A T is inicate by the grey area in the moel space. This equation makes it clear that removing zero singular values is not sufficient, since the errors blow up as λ 2 j, renering the incorporation of small λ j very angerous. Dealing with small eigenvalues is known as regularization of the problem. Before we iscuss this in more etail, we nee to show the connection between the evelopment given here an the theory of singular value ecomposition which is more commonly foun in the literature. One way of looking at the system Am = is to see the components m i as weights in a summation of the columns of A to fit the ata vector. The columns make up the range of A in the ata space (Figure 14.4). Similarly, the rows of A the columns of A T make up the range of the backprojection A T in the moel space. The rest of the moel space is the nullspace: if m is in the nullspace, Am = 0. Components in the nullspace o not contribute to the ata fit, but a to the norm of m. We fin the minimum norm solution by avoiing any components in the nullspace, in other wors by selecting a moel in the range of A T : an fin y by solving for: ˆm = A T y AA T y =. The eterminant of AA T is likely to be zero, so just as in the case of least squares we shall wish to eliminate zero eigenvalues. Let the eigenvectors of AA T be u i with eigenvalues λ 2 i : AA T U = U 2. (14.10) Since AA T is symmetric, the eigenvectors are orthogonal an we can scale them to be orthonormal, such that U T U = UU T = I. Multiplying (14.10) on the left by

264 Linear inversion A = λ λ λ 0 0 T V U M M A = U K λ λ K K T V K KM 0 N M N N N M a. b. N M Fig. 14.5. (a) The full eigenvalue problem for AA T leas to a matrix with small or zero eigenvalues on the iagonal. (b) removing zero eigenvalues has no effect on A. N K A T an grouping A T U we see that A T U is an eigenvector of A T A: A T A(A T U) = (A T U) 2 an comparison with (14.6) shows that A T u i must be a constant v i, an λ i = λ i. We choose the constant to be λ i, so that Multiplying this on the left by A we obtain: A T U = V. (14.11) AA T U = U 2 = AV, or, iviing on the right by λ i for all λ i = 0, an efining u i λ i = Av i with a nullspace eigenvector v i in case λ i = 0: AV = U. (14.12) In the same way, by multiplying (14.12) on the right by V T we fin: A = UV T (14.13) which is the singular value ecomposition of A. Note that in this evelopment we have carefully avoie using the inverse of, so there is no nee to truncate it to exclue zero singular values. However, because the tail of the iagonal matrix contains only zeroes, (14.13) is equivalent to the truncate version (Figure 14.5): A = UV T = U K K V T K. (14.14) Exercises Exercise 14.3 (14.12). Exercise 14.4 Show that the choice (14.11) inee implies that U T U = I. Hint: use Show that ˆm = V K 1 K U T K is equivalent to ˆm min norm.

14.4 Tikhonov regularization 265 14.4 Tikhonov regularization The truncation to inclue only nonzero singular values is an example of regularization of the inverse problem. Removing zero λ i is not sufficient however, since small singular values may give rise to large moelling errors, as shown by (14.9). This equation tells us that small errors in the ata vector may cause very large excursions in moel space in the irection of v k if λ k 1. It thus seems wise to truncate V in (14.13) even further, an exclue eigenvectors belonging to small singular values. The price we pay is a small increase in χ 2, but we are reware by a significant reuction in the moelling error. We coul apply a sharp cut-off by choosing K at some nonzero threshol level for the singular values. Less critical to the choice of threshol is a tapere cut-off. We show that the latter approach is equivalent to aing M equations of the form n m i = 0, with n small, to the tomographic system. Such equations act as artificial ata that bias the moel parameters towars zero: A m = n I. (14.15) 0 If the j-th column of A associate with parameter m j has large elements, the aition of one aitional constraint n m j = 0 will have very little influence. But the more m j is uneretermine by the unampe system, the more the amping will push m j towars zero. The least squares solution of (14.15) is: (A T A + 2 n I)m = AT. (14.16) The avantage of the formulation (14.15) is that it can easily be solve iteratively, without a nee for singular value ecomposition. But the solution of (14.15) oes have a simple representation in terms of singular values, an it is instructive to analyse it with SVD. If v k is an eigenvector of A T A with eigenvalue λ 2 k, then the ampe matrix gives: (A T A + 2 n I)v k = (λ 2 k + 2 n )v k, (14.17) an we see that the ampe system has the same eigenvectors but with raise eigenvalues λ 2 k + 2 n > 0. The minimum norm solution (14.7) is therefore replace by: with the posteriori moel variance given by: ˆm ampe = V K ( 2 K + 2 n I) 1 V T K AT (14.18) σ 2 m i = K Vij 2 λ 2 j +. (14.19) 2 n j=1

266 Linear inversion Since there are no zero eigenvalues, we may set K = N, but of course this maximizes the variance an some truncation may still be neee. For simplicity, we assume a amping with the same n everywhere on the iagonal. The metho is often referre to as Tikhonov regularization, after its original iscoverer [364]. Because one as 2 to the iagonal of A T A it is also known as rige regression. Spakman an Nolet [338] vary the amping factor n along the iagonal. When corrections are part of the moel, one shoul vary amping factors such that amping results in corrections that are reasonable in view of the prior uncertainty (for example, one woul juge corrections as large as 100 km for hypocentral parameters usually unacceptable an increase n for those corrections). A comparison of (14.19) with (14.9) shows that ampe moel errors blow up at most by a factor n 1. Thus, amping reuces the variance of the solution. This comes at a price however: by iscaring eigenvectors, we reuce our ability to shape the moel. The small eigenvalues are usually associate with vectors that are strongly oscillating in space: the positive an negative parts cancel upon integration an the resulting integral (12.12) is small. Damping small eigenvalues is thus expecte to lea to smoother moels. However, even long-wavelength features of the moel may be biase towars zero because of regularization. The fact that biase estimations prouce smaller variances is a well known phenomenon in statistical estimation, an it is easily misunerstoo: one can obtain a very small moel parameter m i with a very small posteriori variance σi 2, yet learn nothing about the moel because the bias is of the orer of the true m i. We shall come back to this in the section on resolution, but first investigate a more powerful regularization metho, base on Bayesian statistics. Exercises Exercise 14.5 Show that the minimization of Am 2 + 2 m 2 leas to (14.16). Exercise 14.6 In the L-curve for (14.18), inicate where = 0 an where. 14.5 Bayesian inference The simple Tikhonov regularization by norm amping we introuce in the previous section, while reucing the anger of excessive error propagation, is usually not satisfactory from a geophysical point of view. At first sight, this may seem surprising: for, when the m i represent perturbations with respect to a backgroun moel, the amping towars 0 is efensible if we prefer the moel values given by the backgroun moel in the absence of any other information. However, if the

14.5 Bayesian inference 267 information given by the ata is unequally istribute, some parts of the moel may be ampe more than others, introucing an apparent structure in m that may be very misleaing. The error estimate (14.19) oes not represent the full moelling error because it neglects the bias. In general, we woul like the moel to have a minimum of unwarrante structure, or etail. Jackson [145] an Tarantola [349], significantly extening earlier work by Franklin [105], introuce the Bayesian metho into geophysical inversion to eal with this problem, name after the Reveren Thomas Bayes (1702 1761), a British mathematician whose theorem on joint probabilities is a cornerstone of this inference metho. We shall give a brief exposé of Bayesian estimation for the case of N observations in a ata vector obs. Let P (m) be the prior probability ensity for the moel m = (m 1, m 2,..., m M ), e.g. a Gaussian probability of the form: P (m) = 1 1 exp 1 (2π) M/2 et C m 1/2 2 m C 1 m m. (14.20) Here, C m is the prior covariance matrix for the moel parameters. By prior we mean that we generally have an iea of the allowable variations in the moel values, e.g. how much the 3D Earth may iffer from a 1D backgroun moel without violating more general laws of physics. We may express such knowlege as a prior probability ensity for the moel values. The iagonal elements of C m are the variances of that prior istribution. The off-iagonal elements reflect the correlation of moel parameters often it helps to think of them as escribing the likely smoothness of the moel. In a strict Bayesian philosophy such constraints may be subjective. This, however, is not to say that we may impose constraints following the whim of an arbitrary person. An experience geophysicist may often evelop a very goo intuition of the prior uncertainty of moel parameters, perhaps because he has one experiments in the laboratory on analogue materials, or because he has experience with tomographic inversions in similar geological provinces. We shall classify such efensible subjective notions to be objective after all. The ranom errors in our observations make that the observe ata vector obs eviates from the true (i.e. error-free) ata. For the ata we assume the normal istribution (14.2). Assuming the linear relationship Am = has no errors (or incorporating those errors into σ i as iscusse before), we fin the conitional probability ensity for the observe ata, given a moel m: P ( m) = 1 (2π) N/2 1 et C 1/2 exp 1 2 (Am obs ) C 1 (Am obs ), (14.21)

268 Linear inversion where C is the matrix with ata covariance, usually taken to be iagonal with entries σi 2 because we have little knowlege about ata correlations. Though we have an expression for the ata probability P ( m), for solution of the inverse problem we are more intereste in the probability of the moel, given the observe ata obs. This is where Bayes theorem is useful. It starts from the recognition that the joint probability can be split up in a conitional an marginal probability in two ways, assuming the probabilities for moel an ata are inepenent: P (m, obs ) = P (m obs )P ( obs ) = P ( obs m)p (m), from which we fin Bayes theorem: P (m obs ) = P (obs m)p (m). (14.22) P ( obs ) Using (14.20) an (14.21): P (m obs ) exp 1 2 (Am obs ) C 1 (Am obs ) 1 2 m C 1 m m. Thus, we obtain the maximum likelihoo solution by minimizing: (Am obs ) C 1 (Am obs ) + m Cm 1 m = χ 2 (m) + m Cm 1 m = min, or, ifferentiating with respect to m i : A T C 1 (Am obs ) + Cm 1 m = 0. One sees that this is again a system of normal equations belonging to the ampe system: C 1 2 C 1 2 m A m =. (14.23) 0 Of course, if we have alreay scale the ata to be univariant the ata covariance matrix is C = I. This simply shows that we are sooner or later oblige to scale the system with the ata uncertainty. The prior smoothness constraint is unlikely to be a har constraint, an in practice we face again a traeoff between the ata fit an the amping of the moel, much as in Figure 14.2. We obtain a manageable flexibility in the traeoff between smoothness of the moel an χ 2 by scaling C 1 2 with a scaling factor. Varying allows us to tweak the moel amping until χ 2 N. Equation (14.23) is thus usually encountere in the equivalent, simplifie C 1 2

14.5 Bayesian inference 269 form: A C 1 2 m m =. (14.24) 0 How shoul one specify C m? The moel covariance essentially tells us how moel parameters are correlate. Usually, such correlations are only high for nearby parameters. Thus, C m smoothes the moel when operating on m. Conversely, Cm 1 roughens the moel, an () expresses the penalization of those moel elements that ominate after the roughening operation. The simplest roughening operator is the Laplacian 2, which is zero when a moel parameter is exactly the average of its neighbours. If we parametrize the moel with tetrahera or blocks, so that every noe has well-efine nearest neighbours, we can minimize the ifference between parameter m i an the average of its neighbours (Nolet [235]): 1 1 (m i m j ) 2 = min, 2 N i i j N i where N i is the set of N i nearest neighbours of moe i. Differentiating with respect to m k gives M equations: m k 1 N k j N k m j = 0, (14.25) in which we recognize the k-th row of C 1 2 m m in (14.24). One isavantage of the system (14.24) is that it often converges much more slowly than the Tikhonov system (14.15) in iterative matrix solvers (VanDecar an Snieer [381]). The reason is that we are simultaneously solving a system arising from a set of integral equations, an the regularization system which involves finite-ifferencing. Without sacrificing the Bayesian philosophy, it is possible to transform (14.24) to a simple norm amping. Spakman an Nolet [338] introuce m = C 1 2 m m. Inserting this into (14.24) we fin: AC 1 2 m m =. (14.26) I 0 Though it is not practical to invert the matrix C 1 2 m that is implicit in (14.25) to fin an exact expression for C 1 2 m, many explicit smoothers of m may act as an appropriate correlation matrix C 1 2 m for regularization purposes. After inversion for m, the tomographic moel is obtaine from the smoothing operation m = C 1 2 m m. The system (14.26) has the same form as the Tikhonov regularization (14.15).

270 Linear inversion Despite this resemblance, in my own experience the acceleration of convergence is only moest compare to inverting (14.24) irectly. 14.6 Information theory Given the lack of resolution, geophysicists are conemne to accept the fact that there are infinitely many moels that all satisfy the ata within the error bouns. The Earth is a laboratory, but one that is very ifferent from those in experimental physics, where we are taught to carefully esign an experiment so that we have full control. Unerstanably, we feel unhappy with a wie choice of regularizations, resulting in our inability to come up with a unique outcome of the experiment. The temptation is always to resort to some higher if not metaphysical principle that allows us to choose the best moel among the infinite set before we start plotting tomographic cross-sections. It shoul be recognize that this simply replaces one subjective choice (that of a moel) with another (that of a criterion). Though some tomographers religiously ahere to such metaphysical consierations, I reaily confess to being an atheist. In my view, such external criteria are simply a matter of taste. As an example, the methos of regularization are relate to concepts known from the fiel of information theory, notably to the concept of information entropy. We shall briefly look into this, but warn the reaer that, in the en, there is no panacea for our fall from Paraise. We start with a simple application of the concept of information entropy: suppose we have only one atum, a elay measure along a ray of length L. We then have a 1 M system, or just one equation: 1 = m(r)s = m i s, i As a thought experiment, assume that the segments of s i are of equal length s, an that we allow only one of them to cause the travel time anomaly. Which one? Information theory looks at this problem in the following way: let P i be the probability that m i = 0. By the law of probabilities, P i = 1. Intuitively, we juge that in the absence of any other information, all P i shoul be equal if not this woul constitute aitional information on the m i. Formally, we may get to this conclusion by efining the information entropy: I = i P i ln P i, (14.27) which can be unerstoo if we consier that any P i = 0 will yiel I =, thus minimizing the isorer in the solution (note that if any P i = 1, all others must be 0, again minimizing isorer). We express our esire to have a solution

14.6 Information theory 271 with minimum unwarrante information as the esire to maximize I, while still satisfying P i = 1. Such problems are solve with the metho of Lagrange multipliers. This metho recognizes that the absolute maximum of I zero for all probabilities equal to 1 oes not satisfy the constraint that P i = 1. So we relax the maximum conition by aing λ( P i 1) to I an require: I + λ( P i 1) = Max. Since the ae factor is require to be zero, the function to maximize has not really change as long as we satisfy that constraint. All we have one is a another imension, or epenent variable, the Lagrange multiplier λ. We recover the original constraint by maximizing with respect to λ. Taking the erivative with respect to P i now gives an equation that involves λ: P i ln P i + λ P i = 0, P i i i or ln P i = (1 + λ) P i = e 1 λ. We fin the Lagrange multiplier from the constraint: P i = Ne 1 λ = 1 λ = ln N 1, or i P i = e ln(1/n) = 1 N. Thus, if we impose the criterion of maximum entropy for the information in our moel, all m i are equally likely to contribute. The reasoning oes not change much if we allow every m i to contribute to the anomaly an again maximize (14.27). In that case, all m i are equally likely to contribute. In the absence of further information, there is no reason to assume that one woul contribute more than any other, an all are equal: m i = 1 / s = 1 /L. The smoothest moel is the moel with the highest information entropy. Such reasoning provies a higher principle to justify the amping towars smooth moels. Constable et al. [64] name the construction of the smoothest moel that satisfies the ata with the prescribe tolerance Occam s inversion, after the fourteenth century philosopher William of Occam, or Ockham, who avocate the principle that simple explanations are more likely than complicate ones an who applie what came to be known as Occam s razor to eliminate unnecessary presuppositions. However, one shoul not assume that smooth moels are free of presuppositions: in fact, if we apply (14.25) in (14.24) we arbitrarily impose that smooth structures

272 Linear inversion are more likely than others. Artefacts may be suppresse, but so will sharp bounaries, e.g. the top of a subuction zone. Loris et al. [188], who invert for moels that can be expane with the fewest wavelets of a given wavelet basis, provie a variant on Occam s razor that is in principle able to preserve sharp features while eliminating unwarrante etail. An interesting connection arises if we assume that sparse moel parametrizations are a priori more probable than parametrizations with many basis functions. Assume that the prior moel probability P (m) is inversely proportional to the number of basis functions with nonzero coefficients in an exponential fashion: P (m) e K, where K is the number of basis functions. If we insert this into Bayes equation, we fin that the maximum likelihoo equation becomes: ln χ 2 (m) K = min, which is Akaike s [1] criterion for the optimum selection of the number of parameters K, use in seismic tomography by Zollo et al. [422]. Note, however, that this criterion lacks a crucial element: it oes not impose any restrictions on the shape of the basis functions. Presumably one coul use it by ranking inepenently efine basis functions in orer of increasing roughness, again appealing to William of Occam for his blessing. 14.7 Numerical consierations With N often of the orer of 10 5 10 7 ata, an M only one orer of magnitue smaller than N, the matrix system Am = is gigantic in size. Some reuction in the number of rows N can be obtaine by combining (almost) coincient raypaths into summary rays (see Section 6.1). The correct way to o this is to sum the rows of all N S ata belonging to a summary ray group S into one new average row that replaces them in the matrix: M 1 A ij m j = 1 i ± σ S, (14.28) N S N j=1 S with the variance σs 2 equal to i S σ 2 S = 1 N 2 S i S i S σ 2 i + σ 2 0. Here, σ0 2 is ae to account for lateral variations within the summary ray that affect the variance of the sum. Gumunsson et al. [126] analyse the relationship

14.7 Numerical consierations 273 between the with of a bunle an the variance in teleseismic P elay times from the ISC catalogue. Care must be taken in efining the volume that efines the members of the summary ray. Events with a common epicentre but ifferent epth provie important vertical resolution in the earthquake region an shoul often be treate separately. When using ray theory an large cells to parametrize the moel we o not lose much information if we average over large volumes with size comparable to the moel cells. But the Fréchet kernels of finite-frequency theory show that the sensitivity narrows own near source an receiver, an summarizing may uno some of the benefits of a finite-frequency approach. Summary rays are sometimes applie to counteract the effect of ominant ray trajectories on the moel which may lea to strong parameter correlations along the prevailing ray irection by ignoring the reuction of the error in the average. However, this violates statistical theory if we seek the maximum likelihoo solution for normally istribute errors. The uneven istribution of sensitivity is better fought using unstructure gris with aapte resolution, an smoothness amping using a correlation matrix C m that promotes equal parameter correlation in all irections. If the parametrization is local, many elements of A are zero. For a least-squares solution, A T A has lost much of this sparseness, though, so we shall wish to avoi constructing A T A explicitly. We can obtain a large savings in memory space by only storing the nonzero elements of A. We o this row-wise surprisingly the multiplications Am an A T can both be one in row-orer, using the following row-action algorithms: p = Am: q = A T : for i = 1, N for i = 1, N for j = 1, M for j = 1, M p i p i + A ij m j q j q j + A ij i where only nonzero elements of A ij shoul take part. This often leas to complicate bookkeeping. Claerbout s [60] ot-prouct test: q Ap = A T q p for ranom vectors p an q can be use as a first (though not conclusive) test to valiate the coing. Early tomographic efforts in the meical an biological sciences le to a reiscovery of row-action methos (Censor [44]). The early methos, however, ha the isavantage that they introuce an unwante scaling into the problem that The explicit computation an use of A T A is also unwise from the point of view of numerical stability since its conition number the measure of the sensitivity of the solution to ata errors is the square of that of A itself. For a iscussion of this issue see Numerical Recipes [269].

274 Linear inversion interferes with the optimal regularization one wishes to impose (see van er Sluis an van er Vorst [377] for a etaile analysis). Conjugate graient methos work without implicit scaling. The stablest algorithm known toay is LSQR, evelope by Paige an Sauners [249] an introuce into seismic tomography by the author [233, 234]. We give a short erivation of LSQR. The main iea of the algorithm is to evelop orthonormal bases µ k in moel space, an ρ k in ata space. The first basis vector in ata space, ρ 1, is simply in the irection of the ata vector: β 1 ρ 1 =, an µ 1 is the backprojection of ρ 1 : α 1 µ 1 = A T ρ 1. Coefficients α i an β i are normalization factors such that ρ i = µ i = 1. We fin the secon basis vector in ata space by mapping µ 1 into ata space, an orthogonalize to ρ 1 : β 2 ρ 2 = Aµ 1 (Aµ 1 ρ 1 )ρ 1 = Aµ 1 α 1 ρ 1, where we use Aµ 1 ρ 1 = µ 1 A T ρ 1 = µ 1 α 1 µ 1. Similarly: α 2 µ 2 = A T ρ 2 β 2 µ 1. Although it woul seem that we have to go through more an lengthier orthogonalizations as the basis grows, it turns out that at least in theory, ignoring rounoff errors the orthogonalization to the previous basis function only is sufficient. For example, for ρ 3 we fin β 3 ρ 3 = Aµ 2 α 2 ρ 2. Taking the ot prouct with ρ 1, we fin: β 3 ρ 3 ρ 1 = Aµ 2 ρ 1 α 2 ρ 2 ρ 1 = µ 2 A T ρ 1 = µ 2 (α 1 µ 1 ) = 0, an ρ 3 is perpenicular to ρ 1. A similar proof by inuction can be mae for all ρ k an µ k in the iterative sequence: If we expan the solution after k iterations: β k+1 ρ k+1 = Aµ k α k ρ k (14.29) α k+1 µ k+1 = A T ρ k+1 β k+1 µ k. (14.30) m k = k γ j µ j, j=1 k γ j Aµ j =, j=1

an with (14.29): 14.8 Appenix D: Some concepts of probability theory an statistics 275 k γ j (β j+1 ρ j+1 + α j ρ j ) = β 1 ρ 1. j=1 Taking the ot prouct of this with ρ 1 yiels γ 1 = β 1 /α 1, whereas subsequent factors are foun by taking the prouct with ρ k to give γ k = β k γ k 1 /α k. 14.8 Appenix D: Some concepts of probability theory an statistics I assume the reaer is familiar with iscrete probabilities, such as the probability that a flippe coin will come up with hea or tail. If ae up for all possible outcomes, the sum of all probabilities is 1. This concept of probability cannot irectly be applie to variables that can take any value within prescribe bouns. For such variables we use probability ensity. The probability ensity P (X 0 ) for a ranom variable X at X 0 is equal to the probability that X is within the interval X 0 X X 0 + X, ivie by X. This can be extene to multiple variables. If P () is the probability ensity for the ata in vector, then the probability that we fin the ata within a small N-imensional volume in ata space is given by 0 P () 1. We only eal with normalize probability ensities, i.e. the integral over all ata: P () N = 1. (14.31) Joint probability ensities give the probability that two or more ranom variables take a particular value, e.g. P (m, ). If the istributions for the two variables are inepenent, the joint probability ensity is the prouct of the iniviual ensities: P (m, ) = P (m)p (). (14.32) Conversely, one fins the marginal probability ensity of one of the variables by integrating out the secon variable: P (m) = P (m, ) N. (14.33) The conitional probability ensity gives the probability of the first variable uner the conition that the secon variable has a given value, e.g. P (m obs ) gives the probability ensity for moel m given an observe set of ata in obs. The expectation or expecte value E(X) of X is efine as the average over all values of X weighte by the probability ensity: X E(X) = P (X)X X. (14.34)

276 Linear inversion The expectation is a linear functional: an for inepenent variables it is separable: E(aX + by ) = ae(x) + be(y ), (14.35) E(XY ) = E(X)E(Y ). (14.36) The variance is a measure of the sprea of X aroun its expecte value: σ 2 X = E[(X X) 2 ], (14.37) where σ X itself is known as the stanar eviation. The covariance between two ranom variables X an Y is efine as Cov(X, Y ) = E[(X X)(Y Ȳ )]. (14.38) In the case of an N-tuple of variables this efines an N N covariance matrix, with the variance on the iagonal. The covariance matrix of a linear combination of variables is foun by applying the linearity (14.35). Consier a linear transformation x = T y. Since the sprea of a variable oes not change if we reefine the average as zero, we can assume that E(x i ) = 0 without loss of generality. Then: Cov(x i, x j ) = E T ij y k T jl y l = T ij T jl E(y k y l ) k l kl or, in matrix notation: = kl T ij T jl Cov(y k, y l ), C x = T C y T T. (14.39)