Developing Learned Regularization for Geophysical Inversions

Size: px

Start display at page:

Download "Developing Learned Regularization for Geophysical Inversions"

Merry Sims
5 years ago
Views:

1 Developing Learned Regularization for Geophysical Inversions by Chad Hewson B.Sc.(hon.), The University of British Columbia, 00 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Department of Earth and Ocean Sciences) We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA April 7, 00 c Chad Hewson, 003

2 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. (Signature) Department of Earth and Ocean Sciences The University Of British Columbia Vancouver, Canada Date

3 Abstract ii Abstract Geophysical inverse problems can be posed as the minimization of an objective function where one term (φ d ) is a data misfit function and another (φ m ) is a model regularization. In current practice, φ m is posed as a mathematical operator that potentially includes limited prior information on the model, m. This research focusses on the specification of learned forms of φ m from information on the model contained in a training set, M T. This is accomplished via three routes: probabilistic, deterministic (model based) and the HT algorithm. In order to adopt a pure probabilistic method for finding a learned φ m, equivalence between Gibbs distributions and Markov random fields is established. As a result, the prior probability of any given model is reduced to the interactions of cells in a local neighbourhood. Gibbs potentials are defined to represent these interactions. The case of the multivariate Gaussian is used due to its expressible form of normalization. φ m is parameterized by a set of coefficients, θ, and the recovery of these parameters is obtained via an optimization method given M T. For non-gaussian distributions θ is recovered via MCMC sampling techniques and a strategy to compare different forms of φ m is introduced and developed. The model based deterministic route revolves around independent identically distributed (i.i.d.) assumptions on some filter property of the model, z = f(m). Given samples of z, two methods of expressing its corresponding φ m are developed. The first requires the expression of a generic distribution to which all the samples of z are assumed to belong. Methodology to translate z into usable data and recover the corresponding φ m is developed. Although there are ramifications of the statistical assumptions, this method is shown to translate significant information on z into φ m. Specifically, the shape of the φ m functional is maintained and, as a result, the deterministic φ m performs well in geophysical inversions. This method is compared with the parametrization of the generalized Gaussian (p-norm) for z. Agreement between the generic φ m and generalized Gaussian helps validate the specific choice of norm in the probabilistic route. The HT algorithm is based around the notion that the geophysical forward operator should help determine the form of φ m. The strategy of Haber and Tenorio [] is introduced and an algorithm for the recovery of θ is developed. Two examples are used to show a case where the HT algorithm is advantageous and one where it does not differ significantly from the probabilistic route. Finally, a methodology to invert geophysical data with generic learned regularization is developed and a simple example is shown. For this example, the generic deterministic method is shown to transfer the most information from the training set to the recovered model. Difficulties with extremely non-linear objective functions due to learned regularization are discussed and research into more effective search algorithms is suggested.

4 Contents iii Contents Abstract Contents List of Figures List of Tables ii iii vi viii Acknowledgements ix Introduction The Forward Problem The Inverse Problem Non-Uniqueness Noisy Data in Inverse Problems Regularizing the Inverse Problem Typical Regularization Operators The Smallest Model The Flattest Model Combining Regularization Incorporating a Reference Model Choosing the Norm Incorporating Prior Knowledge Current Practices Correcting Mathematical Deficiencies Learned Regularization Probabilistic Regularization Introduction Probability Theory in Regularization Bayes Theorem and Prior Distributions Specifying the Likelihood and Prior Distributions Neighbourhoods, Cliques and Markov Random Fields Learning Probabilistic Regularization Functionals Step (a): Obtain Prior Information and Create Training Set Step (b): Quantify Training Set in Gibbs Distribution Step (c): Probabilistic Inversion to Recover φ m Search Algorithms The Newton Step Constrained Optimization: Positivity

5 Contents iv. Synthetic Examples Generating Synthetic Models Recovering the α values Covariance Model Selection Theory Markov Chains MCMC sampling: The Metropolis-Hastings Algorithm MCMC sampling: Gibbs sampler Implementation Numerical example Heterogeneity in α Regularization in the O(M) Inversion Approach Cross-validation Example Non-Gaussian Distributions Normalization in the p-norm Approximating the Normalization Examples Comparing p-distributions Summary Deterministic Regularization Introduction Theory Recovering R(z) as an Unknown Function Binning Strategies The Forward Model The Effect of the Normalization Assigning Data Errors Regularizing r Solving for r Testing the Algorithm Synthetic Example Parameterizing R(z) as a p-norm Numerical results Summary The HT algorithm Background Theory Basics Algorithm Construction Testing the HT Algorithm Cross-well Seismic tomography The -D Gravity Experiment Summary

6 Contents v Comparing Regularization Functionals Concept Inversion with Generic Regularization Forms of R(m) Multivariate Gaussian The p-norm Generic R(z) The Hybrid Global-Local Search Algorithm Examples Training set : Box in Homogenous Halfspace Summary Summary and Future Work Bibliography

7 List of Figures vi List of Figures. A Typical Inversion Problem Inversion results with different regularization. (a) True model, (b) Smallest model, (c) Flattest horizontal model, (d) Flattest vertical model, (e) Raypaths used to acquire data Inversion results with combined regularization. (a) True model, (b) Result with (α s, α x, α z ) = ( 3,, ) and m ref = 0, (c) First reference model, (d) Result with (α s, α x, α z ) = ( 3,, ) and m ref from (c), (e) Second reference model, (f) Result with (α s, α x, α z ) = ( 3,, ) and m ref from (e) 9. Comparison of L and L regularization functionals Recovered models using different model norms. (a) true model, (b) L norm solution, (c) L norm solution Flowchart outlining increasing sophistication for regularization in geophysical inversion Example of neighbourhood in a mesh for one cell (black). Its first order neighbourhood consists of the dark grey cells. The second order neighbourhood consists of the dark and light grey cells Flowchart of complete inversion process Examples of possible prior models (c C) C for (a) increasing K value, (b) varying α s, K=00, (c) varying α x, K=00 and (d) varying α z, K= (a) Resulting C matrix for C = (Ws T W s ), (b) C = (Wx T W x ), (c) C = (Wz T W z ), (d) C = (Ws T W s + Wx T W x + Wz T W z ) Variation of covariance matrix with changing α s. The variance of the model parameters decreases with increasing α s Flowchart outlining covariance model selection algorithm via Gibbs sampling 3. Different possibilities for p(θ i Cj ) Markov chains for particular θ vector for (a) C, (b) C and (c) C Results of O(M) parameter alpha inversion. (a) α x recovered weights, (b) α z recovered weights Results of O(M) parameter alpha inversion with regularization. (a) Crossvalidation values, (b) α x recovered weights, (c) α z recovered weights..... MCMC sampling results for different p values. (a) α x, (b) α z, (c) α s (a) Example of training set (α x, α z, α s ) = (,, 0.0), (b) covariance matrix, C, pertaining to training set Definition of dual grid for recovery of R(z): η is the data grid and r i denotes the i th model cell

8 List of Figures vii 3. Comparison of different error methods with true errors. ɛ represents the error value Data for varying N: (a) N=0, ɛ = 0.33, (b) N=0, ɛ = 0.3, (c) N=3, ɛ = Resulting models for data in Figure Resulting models for varying Π value (a)l sample histogram compared to true distribution, (b) Resulting data from samples, (c) Recovered and true models Data and errors for (a) z, (b) z and (c) z Recovered and true R(z i ) for (a) z, (b) z, (c) z 3 and (d) resulting total Q R i (z i ) for deterministic regularization compared to true probabilistic value. i= 3.9 Recovered (functional and p-distribution) and true R(z i ) for (a) z, (b) z and (c) z Recovered models for HT algorithm and multivariate Gaussian for set TS: (a)m, (b)m, (c)m 3, (d)m, (e)m Recovered models for HT algorithm and smallest model for set TS3: (a)m, (b)m, (c)m 3, (d)m, (e)m Recovered R(z ) for TS using multivariate and deterministic methods Four models obtained using a hybrid search approach with optimal p-norm regularization. Iteration was the final result Four models obtained using a hybrid search approach with generic R(z) regularization. Iteration 3 was the final result (a)true model, (b)-(f) Recovered models using five learned regularization functionals from TS

9 List of Tables viii List of Tables. Results of several 3-parameter α inversions. The asterisk indicates that the program ran to the maximum number of iterations Comparison between optimization results and statistics of Markov chains Statistics of Markov chains for different p-distributions Results for optimal p-distributions Results of HT algorithm and multivariate Gaussian for TS Results of HT algorithm for gravity example. Values of α shown are multiplied by 3. F i indicates how many cells in layer i contained block values. 7.3 Recovered and Expected α ratios Statistics of Markov chains for different p-distributions for TS. The asterisk indicates results are from the optimization code Optimal p-norm parameters for TS Numerical summary of hybrid search for m using optimal p-norm regularization Numerical summary of hybrid search for m using generic R(z) regularization. 90. Numerical evaluation of recovered models

10 Acknowledgements ix Acknowledgements

11 Chapter. Introduction Chapter Introduction In order to collect information about the composition of the Earth below the subsurface, geophysicists make use of remote sensing methods. Sometimes natural signals can be monitored or measured (e.g. gravitational or magnetic fields) and, in other cases, signals are measured due to artificial sources (e.g. high frequency electromagnetics). No matter what geophysical survey is conducted, the result is data that is directly dependent on the subsurface of the Earth. The geophysicist is then faced with the task of interpretation of these data. Often, this interpretation will result in an inference of subsurface physical property. For example, a gravity survey will result in a model of Earth density while an electromagnetic survey may result in one of electrical conductivity. Thus, geophysical surveys are usually chosen based on desirable physical property information. The most important point, however, is that there exists a mathematical link between the measured quantity (data) and physical property distribution (model). Figure. illustrates this link.. The Forward Problem Being able to calculate data values from a particular model is referred to as the forward problem. Often the forward problem is represented as d = F [m] (.) where F [ ] is called the forward operator. When the forward problem is linear, it may simply be written as d = Gm where G contains kernel information. It is easier to consider a linear problem because the problem is reduced to a simple matrix-vector multiplication. G is defined as size N by M such that M model parameters yield N data. However, it is the rank of G that is more pertinent. For instance, if (after an eigenvalue decomposition of G) there are only K independent basis vectors, then there are only K independent data. When more independent data is acquired, there is more information about the model obtained. Ideally, one would want to acquire M independent data values. This would mean that the entire model space would be sampled. Figure. illustrates this concept. On the left is a graphical representation of model space. Within model space there are M degrees of freedom. Suppose a geophysical experiment is run where N (N < M) data are acquired. Since N < M it is apparent that the whole of model space has not been sampled. As a result, model space is divided into two categories: m A is the activated portion of model space and m UA is the un-activated (or annihilator) portion of model space. Since there are N kernels, we can represent the transition from model space to data space as d = F [m A ]. However, the kernels never sampled the un-activated model space. Thus, F [m UA ] = 0. The term annihilator is appropriate for this portion of the model space because any amount of these model parameters can be added to a model without affecting the value of d. Having a viable forward problem is vital in geophysics. It is used widely in survey design and in the interpretation stages of the experiment. It is often good practice to design models

12 Chapter. Introduction Model Space d = F[m A ] 0 = F[m UA ] Data Space m UA m A d m = d m Figure.: A Typical Inversion Problem representing different scenarios so as to ensure that the chosen survey will adequately sample this area of model space. After data has been collected, forward modelling can be used by the geophysicist to try to compose a model that fits the observed data well. Finally, it is a vital tool in the inversion process where it is used to calculate predicted data for all proposed models.. The Inverse Problem Above, the route from model space to data space in Figure. was introduced. Naturally, the opposite application is desirable; recover the model from the data. After performing a survey, the geophysicist is left with a set of numbers that relate to the subsurface. Although there are many methods of interpretation, inversion is by far the most sophisticated... Non-Uniqueness If we had access to M independent data values in a linear problem, inversion would be trivial. The solution could be obtained by m = G d. However, this is rarely the case. Due to the desired resolution in most problems, the Earth has to be divided into many parameters. Often, in a mineral exploration scale, this number can exceed Regardless of the survey, it is unlikely that we could obtain 0000 independent data. As a result, in general, geophysical inversion problems are under-determined (N << M) such that there is an inherent non-uniqueness in the problem. One can think of non-uniqueness as being

13 Chapter. Introduction 3 influenced by the annihilator portion of model space. If this portion of model space can assume any value, then there are infinite number of models that satisfy the problem. To illustrate this example, consider the following number problem where N = M. d = m (.) If m = [,, 3] T, then d = [0.,.,.99] T. Since G in this case is square, one would think that the solution could be m = G d. However, a closer look at G reveals that this is not the case. Consider the singular value decomposition of G such that G = USV T where U contains the data basis vectors, S contains the eigenvalues and V contains the model basis vectors. Performing the SVD, G can be expressed as follows. G = [ ] [ ] [ (.3) Immediately, one should notice that there are only non-zero eigenvalues. Thus, only the first two eigenvectors are required to reconstruct G. G = [ In general terms, it can be written as ] [ ] [ ] T ] T (.) G = U p S p V T p (.) where p refers to the number of non-zero eigenvalues. Replacing G in (.) with (.), the following solution for m can be obtained. d = U p S p V T p m Up T d = S p Vp T m (.) Sp Up T d = Vp t m V p S p U T p d = V p V T p m m c = V p Vp T m is defined as the pseudo-inverse solution (V p Vp T = I if model space was completely spanned). For the particular values of d and m defined above, the pseudoinverse solution is m c = [.30, 3., 3.73] T. Thus, the pseudo-inverse solution is not very accurate. Since only p basis vectors are used then the remaining vector must be an annihilator for the problem. The product of the third model basis vector (v 3 ) and G should be zero = (.7) These values are very near machine precision and are essentially zero. As a result, we can represent the solution as d = G(m c + nv 3 ) (.) where n is any real number. problem is non-unique. Thus, there are infinite solutions to the problem and the

14 Chapter. Introduction.. Noisy Data in Inverse Problems One thing about the relation in (.) is that regardless of the choice of n, each model completely fits the data. This is desirable if the data is perfectly accurate. However, in most geophysical inverse problems this is not the case. Firstly, inaccuracies in the measuring device must be considered. Second, there may be secondary geophysical signal from natural phenomena, man-made structures, and other bodies that are not of interest. Regardless of what the sources are, noise is a real problem in geophysical data. As a result, when modelling data, one should attempt to account for it. The easiest way to account for the noise in geophysical data is to model it with a statistical distribution. The most common way is with a Gaussian distribution. Defining the noise random variable as ɛ, its distribution can be written as p(ɛ) = ɛ e σ C (.9) where σ is the standard deviation of the noise. When inverse modelling, the goal was to produce a model, m, where d = Gm. However, since there is noise in the data, it is more prudent to find models that have a residual relating to the noise. ɛ = d Gm (.) Since there are N realizations of the noise, the probability of the noise vector will be the multiplication of each distribution. NP p(ɛ) = C N e i= ɛ i σ i (.) Thus, to get an estimate of the model, we need a maximum likelihood estimate of the noise by re-parameterizing it in terms of the model. min m φ d = N (d i (Gm) i ) σ i= i This is defined as the data misfit. It can be rewritten in a matrix-vector form as (.) φ d = W d(d Gm) (.3) where W d contains σ i elements along its diagonal. The solution calculated via (.) will still reproduce the data exactly. In order to find a model that does not reproduce the data exactly, a desired value for φ d must be specified. Statistically, any quantity y i = n x i i= where x i is distributed normally is distributed as chi-squared (χ ). This χ distribution must have n degrees of freedom such that E[y] = n. Looking back at (.) φ d is the sum of the squared, normally distributed ɛ. As a result, its expected value is as follows. E[φ d ] = λn (.) Statistically, the misfit must equal the number of data, N, such that λ =. However, this largely depends on the accuracy of the data errors as well as the distribution of the noise. As a result, λ is included if the user feels that the data must be fit more or less closely. Regardless, this is a constraint in the inverse problem - φ d = λn. Thus, it seems natural that we must define some other criterion to guide the solution.

15 Chapter. Introduction..3 Regularizing the Inverse Problem Although an inverse problem should be very data driven, the presence of noise causes one to be hesitant about reproducing the data exactly. Furthermore, the under-determined nature of geophysical inverse problems makes them mathematically challenging when misfit is the only criterion. As a result, some form of regularization is required in the problem. Earlier in the chapter, non-uniqueness was introduced and explained. In fact, (.) showed that there was infinite number of solutions that could satisfy an N = (one data was a linear combination of the others), M = 3 problem. Thus, it seems natural to try to steer the solution towards a desirable area of model space. By this, it is meant that solutions should be chosen from models that exhibit particular character. This limits the number of possible choices. Typically, the regularization can be expressed as φ m = W mm (.) where W m is some discretized operator (standard choices will be discussed in the next section). φ m is referred to as the model norm. Thus, the goal is to find a model that has a minimum value of the model norm subject to fitting (.3) to a certain degree. The best way of completing this task is to combine (.3) and (.) in a model objective function, φ. φ = φ d + βφ m (.) β is referred to as the regularization parameter and it controls the balance between the model norm and data misfit. For every value of β there is a solution. The problem is posed as a minimization problem. min m φ = φ d + βφ m s.t. φ d = λn (.7) The goal is to find the value of β where the objective function has a minimum with a desirable misfit value. For large values of β, the misfit will have to be quite high as the emphasis is being put on the regularization. Conversely, smaller values of β yield small misfit values. There are several methods currently available to estimate an appropriate value of β []. However, the simplest is to make use of (.) and line search for the optimal value of β..3 Typical Regularization Operators The last item remaining in the definition of an inverse problem is the form of the regularization. This is perhaps the most difficult question because one has to provide information about what type of model is desirable. Often there is no information about the subsurface other than geophysical data. As a result, regularization in inverse problems to date has focussed on generic properties of geology. This has provided many useful results without adding any specific geologic information []. In this section, the standard choices for regularization are introduced and explained..3. The Smallest Model Until now, inversion regularization has depended on broad geological concepts. Sometimes these concepts are wholly conceived in a geological framework. However, in this case, part

16 Chapter. Introduction of the motivation for including smallness in regularization is mathematical. Smallness is exactly as it sounds - it is a term that enforces smaller values of model parameters to be chosen. Consider a two-dimensional problem. The most general way of incorporating smallness into a model is define the norm (or size) of the model. φ s = m(x, z) dxdz (.) Here, a L norm is used to describe smallness. This is the easiest norm to use mathematically and is often appropriate. Discretizing the above integral, it can be approximated as follows. φ s M m j x j z j (.9) j= Rearranging the above into a norm formation and assuming the discretization error is negligible, the following expression is obtained. φ s = W sm (.0) W s is the smallness operator and has x j z j elements along its diagonal. Since the regularization acts as a penalty function, smaller values of φ s will be encouraged. The other important point about φ s is that W s is diagonal. Thus, if φ m = φ s, and β is large, the inverse problem will be tractable as W s has a stable inverse. The limitation of only using φ s is that values of the model parameters themselves are regularized. Thus, the minimum norm solution will be recovered without any regard to structure or adjacent parameter interaction. Using φ s as regularization may be more appropriate for slightly under-determined or over-determined problems. Figure. shows results of using different regularization with the same data set. Figure.(e) displays the tomographic raypaths used in the data acquisition. The seismic tomography problem will be introduced later. In this problem, there were data and 00 model parameters. Figure.(a) shows the true model that is representative of a fault separating two geologic bodies overlying a halfspace. Figure.(b) shows the result using φ m = φ s. Although this model fits the data to the specified degree, it is clear that the recovered model is not appropriate. One can see, however, that many of the model parameters are at or near zero. Thus, the regularization has the desired effect..3. The Flattest Model The next type of standard regularization is derived from an assumption about geologic structure. The assumption is that the desired model will not have many large jumps in adjacent model parameters. That is, the derivative of the model in either direction should be as small as possible. This will minimize structure and promote gradual change in model parameter value. Again, this is expressible in a L-norm sense as (the effects of a L norm will be explored later) φ x = ( dm dx ) dxdz (.) for the horizontal derivative and φ z = ( dm dz ) dxdz (.)

17 Chapter. Introduction 7 for the vertical derivative. Again, after discretizing the integrals, they can be re-written in a vector norm. For a 3 cell by 3 cell mesh, W x and W z are W x = W z = dz dx dx dz φ x = W xm (.3) φ z = W zm (.) (.) (.) assuming dx and dz are constant for every cell. Otherwise, each row has to be multiplied by a cell-centered dimensional ratio. Figure.(c) shows the result when the regularization is φ x. The model is very flat in the lateral direction and the interface between the overlying layers and the basement is imaged well. Figure.(d) shows the result when the regularization is φ z. Due to the data, the fault boundary is not imaged well. However, the lateral boundary is now flattened out and the model has bigger, coarser jumps in the lateral direction..3.3 Combining Regularization Figure. illustrates how different regularization functions yield completely different solutions. Each solution has desirable characteristics so it is a natural extension to try to combine the results. The easiest way to do this is by noting that (.0), (.3) and (.) have similar forms. As a result, we may combine regularization functions (φ s + φ x + φ z ) as follows. φ m = α s φ s + α x φ x + α z φ z = mt (α s W T s W s + α x W T x W x + α z W T z W z )m (.7) = W mm (.) The purpose of the α coefficients is to allow the user to control the relative weight of each property in the inversion. In current practice, there is no robust way of determining these coefficients. Often a set of defaults, such as (α s, α x, α z ) = ( 3,, ), is used or they are scaled to represent the spatial size of model space. Figure.3(b) shows the result of this type of inversion. It is clear that this does an adequate job at reproducing the true model in Figure.3(a) but that the values of the parameters are not correct, nor are the boundaries imaged strikingly well.

18 Chapter. Introduction depth (m) (a). 0. depth (m) (b) 3 depth (m) 0 (c) 0 0 location (m) 0 location (m) (e) 0 depth (m) (d) depth (m) 0 0 location (m) 0 location (m) Tx Rx location (m) Figure.: Inversion results with different regularization. (a) True model, (b) Smallest model, (c) Flattest horizontal model, (d) Flattest vertical model, (e) Raypaths used to acquire data.3. Incorporating a Reference Model The last part of the standard regularization procedure involves incorporating a reference model. By this, it is meant that a model is specified that the recovered model should try to resemble. Until this point, everything has been referenced to a halfspace of zeros. That is, we were searching for a smallest model when referencing every parameter to zero. Evidently, it is possible that a more likely value for a parameter is or and that parameter s size should be measured as the distance from this value. Equation (.) can be amended simply to include a reference model. φ m = W m(m m ref ) (.9) How a reference model should be formulated is still to be discussed. However, Figure.3 presents two examples of reference models used. The first, in Figure.3(c), is a model where information about the halfspace interface is well known. The result, in Figure.3(d), has a much clearer halfspace boundary than Figure.3(b). Figure.3(e) is a reference model where the fault boundary is known well. The inversion result, in Figure.3(f), recovers this boundary as well as the parameter values. However, it only improves the result in the

19 Chapter. Introduction 9 depth (m) (a) 0 0 position (m). 0. depth (m) (b) 0 0 position (m). 0. depth (m) depth (m) (c) 0 (e) 0 0 position (m) 0 position (m) depth (m) depth (m) (d) 0 (f) 0 0 position (m) 0 position (m) Figure.3: Inversion results with combined regularization. (a) True model, (b) Result with (α s, α x, α z ) = ( 3,, ) and m ref = 0, (c) First reference model, (d) Result with (α s, α x, α z ) = ( 3,, ) and m ref from (c), (e) Second reference model, (f) Result with (α s, α x, α z ) = ( 3,, ) and m ref from (e) upper half of the model where the reference model was correct. Resolution of the halfspace interface has been lost. Thus, as shown by this example, a reference model can have a profound effect on the inversion result.. Choosing the Norm All of the theory presented to this point has been based on choosing L norms to represent the regularization and the misfit. This was done because L norms are mathematically nice in that they are quadratic and easily expressible. However, the choice of this norm is completely arbitrary. In order to be considered a norm, a function must comply with the following rules.. f(x) = 0 x = 0. f(x + y) f(x) + f(y) 3. f(cx) = c f(x)

20 Chapter. Introduction L L penalty m Figure.: Comparison of L and L regularization functionals These properties are defined as positivity, sub-additivity and homogeneity. The most common norm is the Lp norm. ( N ) p x p = x i p (.30) i= In geophysical inverse problems, focus is put primarily on the L and squared L norm. Figure. compares a L and squared L norm over a range of argument values. The y-axis can be interpreted as a penalty value that coincides with a model value on the x-axis. Both norms are centered on m = 0. The important point to notice is how quickly the penalty increases on the L norm compared to the L. As a result, a solution involving L norms will try to keep all values closer to its minimum. This is not the case for the L norm. There is a slow and gradual increase in penalty function. Thus, it may still promote more model values around its minimum but would not be adverse to a few larger parameters when the data requires it. These are fundamental differences between the two norms. The L norm tends to smooth while the L norm produces blocky solutions. Figure. shows results of using both the L norm (Figure.(b)) and the L norm (Figure.(c)). It is easy to see that the L norm has much sharper boundaries and is able to image the fault discontinuity better. Because of the nature of the true model, the L norm produces a superior result.. Incorporating Prior Knowledge At this point, the general inverse problem has been defined. The recovered model depends heavily on the definition of the data misfit, φ d, and the model norm, φ m. Within each measure, several quantities are user-defined. This includes things such as data errors, α coefficients and a reference model. Thus, the inversion depends heavily on prior information no matter how specific it is. This section looks at current practices for quantifying and specifying this information and then introduces new techniques that will be developed in later chapters.

21 Chapter. Introduction (a) depth (m). 0 0 position (m) 0. (b) (c) depth (m). depth (m) position (m) 0 0 position (m) Figure.: Recovered models using different model norms. (a) true model, (b) L norm solution, (c) L norm solution.. Current Practices Often geological information is available in an area where geophysical data was collected. This information can have many different forms. It can be a surface map of rock type and other prominent geological features. It could potentially involve some physical property measurements of the rocks. At its most useful level, geological information may be presented as a drill core log from a nearby location. It is also possible that the information is more conceptual - a geologist may favour an ideal model of the geology even if he cannot provide definitive physical information to prove it. All of this information could be considered useful geological information. However, prior information need not be only geological. In fact, often multiple geophysical surveys are conducted in an area. As a result, a geophysicist can use results from one survey to help constrain another. There is also the possibility of geochemical information being available. This can also provide prior information for a geophysicist to include in the inversion process. Even though there is often an abundance of information available, the most difficult task can be translating it into a useful form. This is the biggest limitation on prior information and is often a stumbling point. As a result, current practices often under-use geophysical information. Figure. outlines several degrees of prior information use in a flowchart. At

22 Chapter. Introduction LEVEL ONE: - Inversion as a black box -Defaults used for all parameters - m defined in terms of mathematical quantities -Smallness, flatness included and considered LEVEL TWO: - Mathematical Adaptation - m altered - ref halfspace values changed LEVEL FOUR: - Norm Adaptation - m amended norm-wise -L norms replace L for blocky solutions LEVEL THREE: - Geologic Information - m amended with geologic information considered -m ref developed - geologic information considered) BEYOND LEVEL FOUR: -Prior information collected, organized and translated -Used as constraints for coefficient, norm optimization etc. Figure.: Flowchart outlining increasing sophistication for regularization in geophysical inversion the most basic level, many inversions are performed as a black box operation. In this case, the user may not understand the concept of inversion. Often the inversion code will have a set of defaults for user-defined parameters. These defaults are chosen because they work well most of the time. At this level, the regularization is purely mathematical. As a result, the inversion can produce a usable image with minimal prior knowledge incorporation. The second level of incorporating prior information into inversion still has a mathematical flavour but the user is informed on the concepts used to develop regularization. Here, φ m is composed of mathematical operators that represent certain features. In (.7), it was shown that regularization typically includes smallness and flatness. Without any hard geological evidence, a user may wish to only include smallness or flatness or introduce new mathematical norms such as smoothness (second derivative). Although this type of information does not come from hard evidence, a geoscientist may have reason to define φ m in this way. After reviewing several inversions, the user may also adapt the reference model to a more appropriate base level but will not have any information to create a more sophisticated reference model. The third type of regularization definition directly incorporates geologic information. This is an extension of level two in that the α coefficients and m ref are defined. The α coefficients may be weighted due to geological structure or geochemical analysis. However, these values are guessed (from a relative point of view) and are not optimal. The reference model is created from prior geologic information. It can be as simple as a halfspace if the information is not very detailed. Conversely, it can be extremely detailed if there is an abundance of drill and physical property information. It should be noted that all the prior

23 Chapter. Introduction 3 information at this level must be represented by the α coefficients and the reference model. This is not always an easy task. The fourth regularization level is the modification of level three by changing the norms. This level is often reached after several inversions have been conducted with L norms. At this point, the geophysicist may feel that the smooth nature of the L norm is not appropriate. This often results in the definition of the L norm that was introduced in the previous section. Levels one to four of Figure. outline most current practices in regularizing geophysical inversion. What follows is a discussion of potential extensions to these practices that go beyond level four... Correcting Mathematical Deficiencies Often the mathematics of a geophysical survey can cause problems when the design of a regularization functional treats every cell (parameter) in the same way. This is especially a problem with potential fields where cells further away from the receiver have relatively less signal effect than those closer. For instance, consider the measurement of magnetic field. d i = M g ij κ j (.3) j= Like any other linear relation, the data is the sum of the kernels (g ij ) for every cell multiplied by its susceptibility value, κ j. However, the kernels for each cell vary as follows. g ij r 3 ij (.3) r ij is the distance between the data location and the model cell. Thus, a cell further from the source has less effect on the value of the data than a closer cell of the same κ value. For it to have the same effect, the value of κ for the cell furthest away must be larger. Since in a default regularization functional smaller model parameters are encouraged, the recovered model will opt for a solution with a lot of structure near surface. (This illustrates the equivalence of potential fields - a small near surface body can have the same signal as a larger, deeper body.) As a result, every cell was not given equal chance of containing non-zero κ. Thus, the regularization must account for this mathematical deficiency in the forward operator []. Consider the model norm presented in (.). To include a correction for the forward operator, it is rewritten as follows. φ m = W mzm (.33) Z T Z contains the deficiency correction for each cell along its diagonal (and m is a vector of κ values). In the magnetics problem (surface data), the deficiency correction would be w j = where z is the depth of a cell (the depth issue is of more concern than a general z 3 distance issue). Thus, Z would be defined as follows. Z = diag( w,..., w M ) (.3) After applying the deficiency correction, the smallness penalty value for a cell is κ j δx jδz j z 3 j opposed to κ j δx jδz j. Essentially, this cell can now have a κ value of z 3 j κ j in order to have as

24 Chapter. Introduction the same penalty. Thus, all cells are equally likely of containing structure. This method can be applied to any problem that has this kind of deficiency. This theory is at the root of the idea that regularization in geophysical inversion should be linked to the forward operator.. Learned Regularization Learned regularization functionals are defined as optimal model norms given a set of prior information on the physical property in question. Currently, as explained in previous sections, methods to include prior information are limited. As a result, inversions are performed with regularization that does not promote optimal model structure. This research focusses on developing techniques to specify more informative regularization. This is attacked in three different ways varying from a pure probabilistic approach to a pure deterministic approach. Each method is a viable tool to express regularization functionals that have been learned from prior knowledge. Because these regularization functionals are more specific, the level of detail and complexity of the functional increases. As a result, complementary issues such as optimization algorithms and search methods must become more sophisticated. The goal of this paper is not to augment these aspects of inversion theory. Rather, the goal is to investigate how learned regularization functionals can be generated. Chapters - outline the learned regularization methods. Application of the procedures is presented in Chapter. A generic geophysical model consisting of an anomalous body embedded in a uniform space is considered. Learned Regularization I: Probabilistic Regularization When thinking about regularization in a probabilistic sense, the natural extension is to introduce the idea of distributions (or probability density functions). A distribution of a random variable is a summary of the likelihood of its possible values. For instance, a perfect coin would be best described by a binomial distribution with p, the probability of it obtaining one value, equalling 0.. This simple example is easy to grasp due to the fact that there are only two possible outcomes. In geophysical problems, we are often faced with predicting the most likely outcome for one parameter that physically has an unlimited range. Thus, we must consider a continuous distribution, p(x), where the following properties must hold.. p(x) > 0 x. p(x)dx = To summarize, there must be some probability (regardless of how small) that a particular value can occur that when summed over all possible values must equal. In order to pose geophysical regularization in terms of a probabilistic variable, we must propose a form for the distribution. The simplest way to do this is via a Gibbs distribution. p(m) = Z e T R(m) (.3) It is up to the scientist to determine desirable forms of R(m) and potential values of T. This can be as simple as using a multivariate Gaussian (Chapter ). However, it is possible to define any form as long as the normalization value (or partition function), Z, can be

25 Chapter. Introduction calculated. What makes the regularization learned is that R(m) is usually defined parametrically. That is, R = R(m; θ) where θ is a vector of parameters that have optimal values given a set of possible realizations of m. This set is defined as the training model set, M T = {m t i }, where the individual training models were obtained from some information unrelated to the distribution or particular inversion. For instance, a geologist may have ideas of several possible Earth models that can be translated into physical property information and used to defined the regularization. The key to this process is that a priori information is classified and becomes extremely useful in the inversion process. Learned Regularization II: Deterministic Model Based Regularization The model based deterministic approach to regularization differs from the probabilistic one in that we do not strictly follow the rules of probability. Instead, we assume that we have access to significant prior information on a filtered version of the model. This is represented as z. A key point is that we simply choose an appropriate norm for z. This could take any form that we choose. For instance, we may wish it to be represented by a p-norm or some generic norm. Since we have prior information on the variable z, we are able to parameterize the particular norm and recover the best fitting parameters. This approach is more practical in many ways because the connection between the model and z is ignored. Chapter 3 investigates its origins and presents its strengths and weaknesses. Learned Regularization III: Deterministic Survey Dependent Regularization In Section.., the idea that deficiencies in the forward operator should be corrected for in the regularization was introduced. This was accomplished by analyzing the kernels that govern the forward model and counteracting any inherent decay or deficiency. However, in a paper developed by Haber and Tenorio [], deficiencies in the forward operator are accounted for in a learned regularization framework. The idea centres around creating synthetic data from the training model set - D t = F [M t ]. Then, by using typical inversion methodology with φ m = φ m (θ), one can invert D t to obtain a set of inverted models, {m i }. The goal is to find the vector θ that produces inverted models that are closest to their progenitors in M t. That is, min ψ = m i(θ) m t i. (.3) The method is contingent on the form of the regularization being specified. As a result, optimal parameters for a specific type of regularization are sought. Chapter details the formulation of this approach and explores several synthetic examples.

26 Chapter. Probabilistic Regularization Chapter Probabilistic Regularization. Introduction In this chapter a probabilistic framework to represent geophysical prior information is introduced and developed. Specifically, we seek to represent the prior distribution of the geophysical model, m, as p(m) = Π e R(θ,m) where Π is a normalization and θ is a vector of parameters. Through probability theory, we seek a method of reducing the specification of R to interactions between neighbouring cells. The latter part of the chapter concentrates on using information in the form of training models to recover optimal values of θ. Depending on the specification of R, methods to attack this problem vary from routine optimization in a maximum likelihood framework to sampling techniques in MCMC theory. In the field of image processing there has been significant work done to learn prior distributions from statistics of observed images ([9],[]). These advanced techniques take make use of Gibbs distributions and take a maximum entropy approach and are based on selecting appropriate filters of m from a bank of possibilities. In this chapter, a simpler approach to representing prior information is sought. Upon conclusion of this chapter, the reader will be familiar with the statistical incorporation of geophysical information into learned regularization functions. In addition, methodology for the selection of appropriate forms of R is introduced and developed.. Probability Theory in Regularization In this section, the laws of probability as they apply to regularization are discussed. z is assumed to be a filtered quantity of the training set such that z = f(m)... Bayes Theorem and Prior Distributions In probability theory, Bayes theorem is a relation between the conditional distributions of two variables. p(m d) = p(d m)p(m) (.) p(d) If we wish to condition on z as well as m, (.) is amended as follows. p(m d, z) = p(d m, z)p(m, z) p(d, z) (.) Above, we refer to p(d m, z) as the likelihood, p(m, z) as the prior distribution and p(m d, z) as the posterior distribution.

27 Chapter. Probabilistic Regularization 7 The main question remaining in (.) is how to expand p(m, z). probabilities the prior can be decomposed. p(m, z) = p(z m)p(m) = p(m z)p(z) Using conditional (.3) However, we are still required to define either p(m z) or p(z m). In order to do this, we must know what the definition of z is. For simplicity, we assume that the relation between z and m is linear such that z = F m where F is a linear operator. One possible example of F would be a derivative operator. z = m (.) is a discretized (un-normalized) first order derivative. = (.) By writing out the form of the matrix we can make two conclusions. z is not independent of m. The resulting set {z i } is not independent. The first conclusion is a result of our definition in (.) and is always true. The second depends on the form of F. Returning to the problem of defining our prior, (.3) requires the specification of a conditional and marginal distribution. We assume that we have obtained information on z. Thus, we wish the prior distribution to be p(z) and, referring to the second equation in (.3), we are required to specify p(m z). This is an interesting issue if considered carefully. We assume that F is a matrix where rows(f ) < columns(f ). For example, if F = and m contains M elements, z would contain M elements. Given a vector m, p(z m) is simply a series of points in probability space. Let us consider the reverse process. Given a vector z, p(m z) is an M-dimensional distribution given M- constraints (and no boundary conditions). Without additional prior information on m, this distribution must cover R M and is a uniform distribution. As a result, (.3) is the equivalent of writing the following. p(m, z) p(z) (.) Therefore, we can write the following Bayesian proportionality referring to (.). p(m d, z) p(d m, z)p(z) (.7) The problem has been reduced to the definition of a prior distribution on z alone. The next task is to specify the types of distributions in (.7).

28 Chapter. Probabilistic Regularization.. Specifying the Likelihood and Prior Distributions In order to demonstrate the complexities of using a probabilistic approach in our inversion, we must begin specifying individual distributions. For instance, the Gaussian (or L) distribution has the advantage that it is a smooth distribution that carries an inherent sensitivity to extreme values. We shall define our likelihood as follows. p(d m, z) = Π L e (d F [m])t C (d F [m]) (.) Π L is a normalization constant that ensures unit area under the distribution. The reader should realize that the form of the likelihood can be anything - the Gaussian representation is by far the most common. (Note: The definition of a Gaussian likelihood is akin to assuming the presence of Gaussian noise in the data.) The definition of the prior, p(z), is the most important. This will depend on the information made available by a geologist or some other scientist and we must define this distribution very generally. The easiest way to do this is by defining a Gibbs distribution. p(z) = Π P e R(z) (.9) The notion of Gibbs distributions comes from thermodynamics where R(z) is an energy function. (It is interesting to note that probability distributions take their exponential form due to thermodynamic constraints. For any two states, the total energy must be additive but probability must be multiplicative.) We can define this energy function in any form (as long as it is finite). However, it is often more useful to define the energy in terms of local interactions of parameters. If this occurs, we have defined a Gibbs distribution as a Markov random field [] and henceforth, we must consider the notion of neighbourhoods and cliques (see [] for derivation and reference therein)...3 Neighbourhoods, Cliques and Markov Random Fields An important part of understanding Gibbs distributions is being able to define a neighbourhood. Consider a -D mesh or lattice. Every cell i has at least cells that directly border it. These cells comprise its first order neighbourhood. Thus for the entire lattice, S, a neighbourhood is defined as N = {N i i S} where N i is a set of sites neighbouring point i. There are two criteria that a neighbourhood must obey. a) A site cannot neighbour itself: i / N i b) The neighbourhood relation is mutual: i N i i N i Neighbourhoods on meshes and lattices have different orders. For example, the first order neighbourhood for a cell in the interior of a rectangular mesh consists only of the four cells that directly border it. However, when considering the second order neighbourhood, those cells that diagonally border the interior cell are included. Thus, the second order neighbourhood includes eight cells. This is illustrated in Figure.. Cliques are groups of cells within a neighbourhood that are all mutually neighbours. Thus, when considering a first order neighbourhood, the cliques for cell i can be defined as c = {c,..., c }, c = {i i S}, c k = {(i, i ) i N i, i S}, k =.... Figure. helps to concrete this notion visually. Consider the black cell as cell i. The locations i

29 Chapter. Probabilistic Regularization 9 Figure.: Example of neighbourhood in a mesh for one cell (black). Its first order neighbourhood consists of the dark grey cells. The second order neighbourhood consists of the dark and light grey cells. are the dark grey cells. Thus, the cliques for cell i are defined by a function of cell i alone as well as interactions with the cells at locations i. If the neighbourhood was enlarged to second order, there would be 3 and cell interaction cliques. Utilizing the definition of cliques and neighbourhoods we can decompose our energy function into Gibbs potentials. R(z) = V c (z) (.) c The subscript, c is meant to imply summation over all the orders of cliques present in the neighbourhood. The function V c (z) is referred to as a Gibbs potential. The form of V c can vary as the size of the clique changes. This decomposition of R(z) gives the problem a local foundation and allows equivalence to Markov Random Fields to be drawn. Given any lattice or set, S, a random field is defined as the labelling of a particular value to every cell. A random field that is defined by local interactions is called a Markov Random Field (MRF). z is a MRF if and only if a) P (z) > 0, z (Positivity) b) P (z i z S [i] ) = P (z i z Ni ) (Markovianity)

30 Chapter. Probabilistic Regularization 0 Preliminary Inversion (a) Obtain prior information and create training set (b) Quantify training set in Gibbs distribution (c) Regularization inversion process (probabilistic or deterministic) Main Inversion (d)regularization specified (f)geophysical Inversion (g) Geophysical model estimation (e) Field data measurements obtained Figure.: Flowchart of complete inversion process where S [i] denotes all values of z except z i, and z Ni are the values of z at the sites neighbouring i. This formulation creates enormous flexibility within prior specification. At its root, p(z) is specified by the definition of a neighbourhood, the resulting cliques and the specification of the Gibbs potentials. By inserting (.) into (.9), the following relation is obtained. p(z) = Π P e P c V c(z) (.) We have been successful in changing the prior from a general Gibbs distribution into one that depends on local characteristics. This is an important result. Given a possibility for m, we require a method of assigning a probability. Before establishing the relationship between the Gibbs distribution and Markov random fields it was unclear how to proceed. Now, we are able to use relationships between neighbouring cells to assess probability and define the prior. Once the prior has been specified, we may return to Bayes Theorem in (.) and concentrate on estimating m. By using the theory in this section, this estimate of m is obtained via a pure probabilistic route. We now turn our attention to finding a form of c V c(z) from prior information in a learned manner..3 Learning Probabilistic Regularization Functionals Figure. illustrates the learned regularization inversion process. There are two inversion steps outlined. In this section, the preliminary inversion step is documented. This involves the recovery of a regularization functional from prior information. What follows is a discussion of steps (a)-(c)..3. Step (a): Obtain Prior Information and Create Training Set Prior information on our physical property model, m, could come from may different sources. For instance, a geologist could have a conceptual model of rock units or fault boundaries. There may be some geochemical data that suggests a certain mineral composition. Regardless, we assume that this information can be translated into information on the physical property in question. This information could be provided in a couple of different ways.

31 Chapter. Probabilistic Regularization a) A model, m ref, is provided as a mean of our distribution. b) Several models, M t = {m i }, i =,..., K, are provided in a training set and are supposed to be samples from p(m). If we consider option (a) carefully, it is apparent that we are not provided with enough information to define the distribution wholly. The minimum number of moments needed to define a distribution is two (mean and variance for a Gaussian). Thus, we would be faced with the task of prescribing another moment independently. However, we could easily combine option (a) with information provided in option (b) to resolve this issue. The route involving moment prescription essentially reduces our problem to a parametrization of p(m). The simplest example, a Gaussian, is shown below. p(m) = Π G e (m m ref ) T C (m m ref ) (.) If we had samples that we hypothesized were from the above distribution we could recover the elements of the covariance matrix, C. The only problem with this approach is that we must prescribe the form of the distribution beforehand. The form of C dictates how large the neighbourhood and clique systems are. It is apparent that our prior information must come in the form of (b). In this case, we have been given several models that a geologist has deemed probable. Figure.3 shows four examples of these types of models. In this example, the scientist has decided that we are most likely looking for a box model (within a halfspace). He has assigned relative model values to the box and the halfspace. Although these models are similar in type, there are differences in location and model values. Therefore, we must know what properties of m (recall z = f(m)) the training set represents. This is an issue that needs to be resolved by the geophysicist and we will address it in the next section. By establishing how the information is obtained, we can move on to the quantification step..3. Step (b): Quantify Training Set in Gibbs Distribution Equation (.) was developed as the most general way to express our prior distribution. Once we have obtained information such as that in Figure.3, we must be able to quantify it in some manner. To do this in a tractable manner, we must assume that all of our models within M t are independent samples from p(m) (in this case, our training set contains information on m such that z = m in (.)). Thus, we can safely propose that the joint probability of the entire set is simply the multiplication of all the individual probabilities. p(m t ) = K p(m t i) = i= (Π P ) K e P K i=( P c Vc(mt i )) (.3) Taking the logarithm of the previous equation we obtain the following. ( K ( ) ) log(p(m t )) = V c (m t i) + Klog(Π P ) i= c (.) The above is the basis of the maximum likelihood estimator (MLE). We wish to maximize the probability so we can simply transform this into a minimization problem with the

32 Chapter. Probabilistic Regularization 7 z z x 0 0 x z z 0 0 x 0 0 x Figure.3: Examples of possible prior models following objective function. ψ = ( ) K V c (m t i) + Klog(Π P ) (.) i= c In order to find the minimum of (.), two items must be specified. First, we must decide on the form of V C (m t i ). Second, we must be able to define Π P. Regardless, (.) is a general expression using training models to quantify a prior distribution. Because of the multivariate nature of the distribution, the choices for V C (m) are limited. If we wish to optimize (.), the normalization must be expressible. As a result, we choose the simplest case: the multivariate Gaussian. To specify the multivariate Gaussian distribution we need to define V c (m) = m T C m (.) c in order to fully specify (.). C is the inverse covariance matrix. Within this matrix, the size of the neighbourhood and corresponding cliques are defined. Note that the reference model is omitted but can be inserted at any point (replace m with m m ref ). In most geophysical inverse problems, desirable features of the recovered model include anomalous bodies and relative changes in physical property value. As a result, it is reasonable to limit the neighbourhood of m to first order (see Figure.). Thus, for one cell

33 Chapter. Probabilistic Regularization 3 m k the clique potentials would include a term for m k alone as well as one for its difference with neighbouring cells. V c (m k ) = V k (m k ) + V k,k (m k, m k ) k c = α k m k + k α k ( ) mk m (.7) k In the above equation, δ is included as the distance between locations k and k in order to normalize the derivative. k is a neighbouring cell. α k and α k are the parameters of interest that act as weights within the different potentials. For instance, a relatively high α k value will promote smaller derivative values in an inversion (increases penalty in regularization). The value of α k could be allowed to vary for every difference or be held constant in some fashion. Both cases will be examined in this chapter. For simplicity, the first case involves holding α k constant for every cell as well as holding α k constant for horizontal differences and vertical differences respectively. The resulting α vector is defined as α = (α s, α x, α z ) for a -D problem. Inserting these concepts into (.), the Gibbs potentials are defined for the multivariate Gaussian. V c (m) = mt (α k Ws T W s + α x Wx T W x + α z Wz T W z )m (.) c The operators W s,x,z are as defined in Chapter. It is now easy to see the relation between (.) and (.). Thus, C = α k W T s W s + α x W T x W x + α z W T z W z (.9) is the inverse covariance matrix and for the purpose of optimization we rewrite it as C (α). It was stated that two tasks must be completed before recovering our distributions. Firstly, we had to define our Gibbs potentials. This has been accomplished in (.). Secondly, the corresponding normalization must be defined (Π P ). Since we have defined our distribution as a multivariate Gaussian, Π P is expressible. Now we can fully define (.) and obtain a MLE. ψ = = δ Π gauss P = (π) M det(c(α)) (.0) K m T i C(α) m i + K log i= K i= i= ) ((π) M det(c(α)) ( ) ( ) m T i C(α) m i + K log (π) M + K log det(c(α)) = K m T i C(α) m i + KM log (π) + K log ( det(c(α)) ) i= = K m T i C(α) m i + KM log (π) + K ( ) log det(c(α) ) = K m T i C(α) m i + KM log (π) K log ( det(c(α) ) ) i= (.)

34 Chapter. Probabilistic Regularization The Gibbs distribution for this simple example has been defined and, as a result, the objective function for our preliminary inversion is completely tractable. Next, the inversion process must be investigated..3.3 Step (c): Probabilistic Inversion to Recover φ m First, we shall write a formal minimization statement. min ψ = K m T i C(α) m i K log ( det(c(α) ) ) (.) i= However we decide to represent C(α) does not change how the minimization problem is approached. It is clear that we must somehow deal with the derivative of a (logged) determinant (the last term of (.)). Fortunately, there is an analytic approach to this problem. The corresponding theorem by Hanche-Olsen [7] is outlined below. Theorem. For any n-by-n matrix Φ(t), the determinant can be expressed as a multi-linear function of its rows such as det(φ) = f(φ,..., φ n ) (.3) where φ i refers to a particular row of Φ. Thus, by successively holding all rows constant except one, the determinant can be expressed as a linear function of each of its arguments. Applying the derivative operator to this notion produces dφ(t) dt = f( φ, φ,..., φ n ) + f(φ, φ,..., φ n ) f(φ, φ,..., φ n ) (.) where φ symbolizes the derivative of the first row. Consider the first term of (.) if Φ(t) is the identity matrix. The result will be simply φ and, the complete determinant can be expressed as d (det(φ(t))) = trace( dt Φ(t)) (.) when Φ(t) = I. In order to develop a formula for non-identity matrices we can use determinant properties. For instance, if A is a constant invertible matrix det(aφ) = det(a)det(φ) = Using (.) and (.) together, the following result holds. det(a detφ. (.) ) det(a) d(det(φ(t))) dt = trace(a Φ(t)) when AΦ(t) = I (.7) If Φ is invertible, we can substitute A = Φ into (.7) and rearrange the final result. d(det(φ(t))) dt Finally, the above can be re-written into the following. d(logφ(t)) dt = det(φ(t))trace(φ (t) Φ(t)) (.) = trace(φ (t) Φ(t)) (.9)

35 Chapter. Probabilistic Regularization This theorem proves to be extremely useful in our derivation. Recalling (.), it is clear that (.9) provides us with an exact expression for the derivative we wish to take. Thus, by substituting C for Φ we can continue with the minimization. g j = dψ dα j = K i= m T i dc m i K dα j trace (C dc dα j ) (.30) g j is the derivative of our objective function and is required for any search algorithm we may impose. To this point, the probabilistic inversion has been generally outlined. The only point remaining is to deal with the specific form for C and to tune the search algorithm accordingly.. Search Algorithms In any optimization problem, a search algorithm is required to find the minimum of the objective function. The complexity of this algorithm often reflects the complexity of the problem - the number of parameters and linearity of objective function are two important issues to consider. In the problem of recovering weighting coefficients, we can have as few as two parameters. However, we may also have to deal with O(M) parameters. In addition, as shown in (.30), there is inherent non-linearity in the problem (the normalization constant in (.0) ensures that). By incorporating second order information, efficiency of the search can be improved. In this section, a Newton search approach is developed. As we shall see, incorporating simple bounds in this algorithm produces the gradient projection algorithm... The Newton Step The Newton step is based around incorporating curvature information to direct the search. Expanding the objective function about a location of α, we obtain the following. ψ(α + p) = ψ(α) + g T p + pt H(α)p + H.O.T (.3) H is the second derivative matrix and is called the Hessian (H ij = ψ α i α j ) and p is a perturbation vector. Neglecting the higher order terms (H.O.T ), we search for a vector p that minimizes (see [3]) ˆΨ(p) = ψ(α) + g T p + pt H(α)p (.3) The solution to p ˆΨ = 0 is p = H g (.33) The updated α vector is α k+ = α k + ωp (.3) where ω (0, ). If ˆΨ is a good representation of Ψ then ω =. That is, the Newton step is optimal. If ˆΨ is a poor approximation, then a line search to find ω is required. We are guaranteed that for sufficiently small ω, φ(α + ωp) < φ(α) (i.e. p is a descent direction).

36 Chapter. Probabilistic Regularization Taking the derivative of (.30), the expression for the Hessian can be obtained. H jk = K ( C trace dc ) α k dα j (.3) This expression is very difficult to obtain analytically because we lack a direct expression for C. Thus, the easiest approach is a numerical one. Because the Hessian is symmetric in this problem, the numerical approach is not extremely expensive. Thus, we calculate the derivative of C using finite differences. C C(α k + δα k ) C(α k ) (.3) α k δα k δα k is some small perturbation. Using (.3) to substitute into (.3) results in a good approximation to the Hessian and an efficient way of calculating the step length. Thus, the algorithm below is used in all problems that solve the system in (.33).. Set ψ 0 = ψ(α k ).. Set ω =. 3. Evaluate ψ = ψ(α k + ωp).. If ψ < ψ 0 return; else reduce ω (e.g.ω = ω ); goto 3. end; The above algorithm is used every time the system in (.33) is solved. What remains to be discussed is when to terminate the search for α. In theory, the termination would be when the gradient, g, of the updated model, α k+, equals zero. However, in practice this is difficult to obtain so we set our termination criterion as g g 0 < ɛ (.37) where g 0 = g 0 and ɛ is some small number. In addition, we check the size of the step and terminate the loop if it is below a threshold (tol). α k+ α k < tol (.3).. Constrained Optimization: Positivity For the covariance matrix, C(α), we require α 0. Thus, there is a need to ensure that the solution recovered via our algorithm in Section.. incorporates this constraint. Two methods of accomplishing this are proposed.

37 Chapter. Probabilistic Regularization 7 Avoidance When dealing with problems with few parameters, we may be better off employing a policy of avoidance. That is, we can devote a relatively small amount of computing power to checking the step length generated by (.33). Negative values of α can be avoided by readjusting the maximum line search value, ω. The maximum allowable size of ω, ω max is ω max = min{, max( α i p i ) : p i < 0} (.39) Replacing ω = with ω max in Section.. ensures negative values are avoided. The Gradient Projection Algorithm Although avoidance is simple and easy to use, we may require a more robust method (convergence guaranteed) for problems with more parameters or those where the search path is difficult. The gradient projection algorithm (GPA) deals with simple bounds on parameters and defines an acceptable space of solutions, S [0]. (P S x) i = { (x)i, i S L, i S (.0) L is our lower bound. In this problem, we may set L to some small number close to zero and let S be all real positive values greater or equal to L. Thus, we apply this to our description of the updated model from (.3). α k+ = P (α k + ωp) (.) The gradient projection algorithm also searches for an appropriate ω. However, this is done slightly differently by setting ω = β t where β is some number in (0, ). Thus, we can express α k+ (t) as the proposed model and search for a value of t that exhibits sufficient decrease. ψ(α k+ (t)) ψ(α k ) γ β t α k α k+ (t) (.) γ is typically some small parameter ( ). The algorithm begins by testing t = 0 and increments t until the condition in (.) is satisfied. Once the condition is met, then the algorithm tests for termination. If termination fails, then the updated model is used to calculate the new perturbation and we renew our search for t. Termination can be tested several ways. We employ the suggestion of Kelley [0] in our algorithm and define termination as follows. α k α k+ (t = 0) τ a + τ r α 0 α 0 (t = 0) (.3) τ a and τ r are absolute and relative error coefficients that can be tuned. The most often used values were τ a = 0 and τ r =. The advantage of the algorithm is that we do not have to speculate on the numerical decrease in the gradient. We do, however, have to check each parameter for its position relative to the preferred space. This could be costly for problems of higher dimensions.

38 . Synthetic Examples.. Generating Synthetic Models Chapter. Probabilistic Regularization To test the algorithm synthetic models must be used. In order to do this, (.) is considered. This distribution is simply a multivariate Gaussian with mean m ref and covariance C. If we redefine δm = m m ref, the distribution takes the following form. p(δm) = Π G e δmt C δm (.) Further, if the covariance matrix is positive definite (this requires that α s > 0) a Cholesky decomposition such that C = L T L can be performed. Thus, by defining ˆm = Lδm the distribution takes the following form. p( ˆm) = Π G e ˆmT ˆm (.) Now, the distribution is defined in terms of ˆm with distribution ˆm N(0, ). It is easy create samples of ˆm. To create a model of M dimensions with mean m ref and covariance C one simply has to create an M-dimensional ˆm vector and perform the following transformation. m = L ˆm + m ref (.) The natural question when considering (.) is how to specify the reference model. In the synthetic case, it is not really a big issue. One can specify any reference model and the resulting models with tend to average to its value. However, this distribution is, in fact, the regularization used in the geophysical inversion. In cases where we do not have to generate the training models from a known α vector, we may choose our reference model to be the mean of the training set. m ref = M t (.7) The choice of reference model impacts the objective function and should be considered carefully. Since the synthetic approach is used to test and validate the algorithm, it is wise to ensure that the synthetic training set actually is produced by the intended covariance matrix. The best way to do this is to produce the approximate covariance matrix of the training set. c ij = K K (m (k) i k= µ i )(m (k) j µ j ) (.) K is the number of training models and m (k) i refers to the k th sample of m i. If the models are actually representative of the true covariance matrix then c should converge to C. To quantify the fit, the infinity norm c C C was used. That is, the matrix is rearranged into a vector and the largest value is used as a norm. In order to smooth the curves, the average of this norm was taken over different training sets. Like any variance, the fit is improved as K increases. Figure.(a) shows the fit as a function of K. In Figure.(a), (α s, α x, α z ) = ( 3,, ) and the number of samples (training models) is varied from to 000. As dictated by the Law of Large Numbers, the sample mean will converge to the true mean as K. The same follows for the covariance matrix. This is illustrated in Figure. as the value of the norm decreases as K increases. By the time K = 00,

39 Chapter. Probabilistic Regularization 9 (a) 0 (b) c C / C c C / C K α s value (c) 0. c C / C α value x (d) c C / C α value z Figure.: (c C) C for (a) increasing K value, (b) varying α s, K=00, (c) varying α x, K=00 and (d) varying α z, K=00. the largest difference value is only about 0.0 (or percent). One can conclude that the algorithm will perform better as K. Although algorithm performance is better with higher K, it is wise to investigate the effects of the α parameters on covariance fit. Figures.(b)-(d) illustrate this. In this example, one α parameter is varied and the other two are held constant at one. There is a smaller error with smaller α s and larger α x and α z. However, the variation is relatively small over six order of magnitudes of α values. It is reasonable to conclude that the α values are not a major source of error when considering covariance fit. The number of samples, K, is the most influential parameter. Figure. shows what the covariance matrix looks like if it was only constructed with W s, W x or W z as well as a combination of the three. Formulations without W s (Figure. (b) and (c)) result in large parameter variances while those with the presence of W s (Figure. (a) and (d)) have significantly reduced variances. Figure. demonstrates how these variances change with increasing α s. The variances are reduced as α s increases. Given the analysis of this section, we can be assured that our training sets are adequate representations of the true covariance models for the upcoming synthetic examples... Recovering the α values Given the conclusions of the previous section, it is wise to test the algorithm with numerical experiments. In these examples, specific values of the α s are used. The only thing that is varied is the number of samples used. In order to compare results, the standard L norm

Chapter. Probabilistic Regularization 30 (a) (b) x 0 0. 0 parameter # 0 0 0 0. 0. 0. parameter # 0 0 0 0 0 0 0 0 0 0 parameter # 0 0 0 0 0 0 0 parameter # parameter # (c) 0 0 0 0 x parameter # (d) 0 0 0 0 0.

: (a) Resulting C matrix for C = (W T s W s ), (b) C = (W T x W x ), (c) C = (W T z W z ), (d) C = (W T s W s + W T x W x + W T z W z ) (α s,α x,α z )=(0.0,,) (α s,α x,α z )=(0.,,) (a).

40 Chapter. Probabilistic Regularization 30 (a) (b) x parameter # parameter # parameter # parameter # parameter # (c) x parameter # (d) parameter # parameter # Figure.: (a) Resulting C matrix for C = (W T s W s ), (b) C = (W T x W x ), (c) C = (W T z W z ), (d) C = (W T s W s + W T x W x + W T z W z ) (α s,α x,α z )=(0.0,,) (α s,α x,α z )=(0.,,) (a). (b) parameter # parameter # parameter # parameter # (α s,α x,α z )=(,,) (α s,α x,α z )=(,,) (c) 0. (d) 0.0 parameter # parameter # parameter # parameter # Figure.: Variation of covariance matrix with changing α s. The variance of the model parameters decreases with increasing α s.

Estimation of Cole-Cole parameters from time-domain electromagnetic data Laurens Beran and Douglas Oldenburg, University of British Columbia.

Estimation of Cole-Cole parameters from time-domain electromagnetic data Laurens Beran and Douglas Oldenburg, University of British Columbia. SUMMARY We present algorithms for the inversion of time-domain