AEROSOL MODEL SELECTION AND UNCERTAINTY MODELLING BY RJMCMC TECHNIQUE

Size: px

Start display at page:

Download "AEROSOL MODEL SELECTION AND UNCERTAINTY MODELLING BY RJMCMC TECHNIQUE"

Willa Jefferson
5 years ago
Views:

1 AEROSOL MODEL SELECTION AND UNCERTAINTY MODELLING BY RJMCMC TECHNIQUE Marko Laine 1, Johanna Tamminen 1, Erkki Kyrölä 1, and Heikki Haario 2 1 Finnish Meteorological Institute, Helsinki, Finland 2 Lappeenranta University of Technology, Lappeenranta, Finland ABSTRACT We apply Bayesian model selection techniques on GO- MOS inverse problem to obtain information on which type of aerosol model fits best the data and to show how the uncertainty of the aerosol model can be included in the error estimates. We use reversible jump Markov chain Monte Carlo (RJMCMC), which is a general extension to the Metropolis-Hastings MCMC algorithm that can explore parameter space of varying/unknown dimensionality. This makes it suitable for the model determination problem. This work is related to Envisat AO NORM. Key words: Envisat, GOMOS, aerosol model, Bayesian model selection, MCMC, RJMCMC. 1. INTRODUCTION Advances in computer resources and algorithms have made it possible to consider increasingly complicated models for data. In geophysical sciences the estimation of unknowns in large models are usually handled using linearizations and approximations that can have effect on the uncertainty estimates of the retrievals. Such things as the effect of the modelling error or the effect of the discretization on the uncertainty of the results require careful analysis and are best studied with methods that do rely on approximations. Bayesian inference provides an unified and natural framework to consider uncertainty in estimated values and the model uncertainty. In many cases, classical approximative estimation methods can be seen as special cases of Bayesian analyses. In Bayesian inference, the uncertainty of the estimated value is the primary target of the investigation and whenever computationally possible, the results of the analysis is the full multi dimensional posterior probability density of the unknowns. Bayesian methods allow the study of many kinds of uncertainties, including uncertainty in the model itself. Prior information can be pooled from different sources and incorporated statistically correctly, and the correlation structure of the unknowns can be fully explored. For applying Bayesian inference to modelling we have the MCMC methods, which are computer intensive method to simulate possible values of the unknowns from its posterior probability distribution. For account on applying Bayesian MCMC methods in geophysical applications see for example [1]. 2. BAYESIAN MODEL SELECTION Choosing the right model is a complicated matter. The problem clearly can not be solved by purely statistical considerations. However in order to the model be valid, it has to be able to predict the observations. Statistical methods are able to tell which of the possible solution offer the best fit given the set of models to consider, the data observed, and the prior information that is available. In many cases the ground truth is unknown. We could have several speculative alternatives on the physical behaviour of the system, eg. depending on some unknown state of nature in the location under consideration. Then it is reasonable to also model the uncertainty in the model, for example by introducing several alternative models and let the data to decide what of them to use. And if no single model stands out, then this uncertainty must be taken into account in results by averaging the predictions over the models according to their posterior weights. We briefly introduce the main concepts of model determination in Bayesian framework and discuss various probability distributions of the unknowns that are concerned. Let x stand for the unknown variables of primary interest and η (k) for extra unknown model parameters in the k:th model. We want to use observed data y to estimate the unknowns x, η (k) and also infer about the unknown model k. Index k is a label for finite or countable set of pre-selected models. In the GOMOS example presented in section 6 symbol x will stand for constituent line densities and η (k) will be the aerosol cross section parameters for selected four models k = 1,...,4. To apply Bayesian inference we need to assign prior probabilities jointly for all the unknowns, p(x,η (k), k). By the rules of conditional probabilities the joint probability can be written as a product of conditional probabil- Proc. Envisat Symposium 27, Montreux, Switzerland April 27 (ESA SP-636, July 27)

2 ities p(x,η (k), k) = p(x η (k), k)p(η (k) k)p(k), (1) which reveals the hierarchical nature of the unknowns. Priors can be given sequentially by first assigning probabilities for the unknown variables p(x η (k), k) and the model parameters p(η (k) k) in each model, and then prior probabilities for different models, p(k). In addition, we must build the likelihood function, p(y x,η (k), k), giving the distribution of the observations y, from the forward model and statistical description of the noise. The joint posterior distribution for the unknowns x, η (k) and k given the data y follows the Bayes formula and is a product of the likelihood and the priors: p(x,η (k), k y) = p(y x,η(k), k)p(x η (k), k)p(η (k) k)p(k). p(y) (2) Again, the hierarchical structure of the model is seen. The Bayes formula can be used to reverse the conditional probabilities. Given the observed data y we want infer about the unknown x given the model (or set of models), about the unknown model parameters η (k) and also about the unknown model k. The unconditional probability of the data p(y) in the denominator of the Bayes formula is a normalizing constant that makes the resulting distribution to be a probability distribution. Formally it has to be calculated by integrating over all the unknowns, which is usually high dimensional task. This makes the explicit computation of the posterior distribution a challenging problem. Let us next consider the problem of selecting the best model k from set of competing models. Different models can be judged according to the evidence they give to the observations using marginal model probabilities: p(y k) = p(y θ (k), k)p(θ (k) k)p(k) dθ (k), (3) where θ (k) = (x,η (k) ) stands for the vector of all the unknowns except the model index k. The posterior model probabilities can be written with aid of Bayes formula as p(k y) = p(y k)p(k). (4) p(y) Model comparisons are typically done using posterior odds: p(k 1 y) p(k 2 y) = p(y k 1) p(k 1 ) p(y k 2 ) p(k 2 ), (5) where the first term in the right p(y k 1 )/p(y k 2 ) is called the Bayes factor, which gives the relative evidence of model k 1 wrt. model k 2 given by the data y. The calculation of model probability p(k y) or even that of the evidence p(y k) poses some challenges especially if the class of models considered is large and if there is no natural hierarchy between the models that could be used. Several method for the calculations have bee proposed either by using approximations that avoid the problems of high dimensional integration, or by using results of MCMC runs on single models. The RJMCMC method presented in this article, allows for a simple method for constructions the Bayes factors and model posterior single MCMC run done simultaneously on all the selected models. 3. MARKOV CHAIN MONTE CARLO MCMC In the most general settings we are interested in the whole posterior distribution of the unknowns. Some times we only need some statistics of the distribution such as the mean and standard deviation. These all lead to high dimensional integration problems that generally has no closed form solutions. The exception being the maximum a posterior (MAP) estimate that can be calculated as an solution of an optimization problem of finding the peak of the posterior distribution. Numerical methods for calculating high dimensional integrals have to resort to Monte Carlo methods for all but very moderate sized problems, say dimension less that 4. However, an estimation problem can be considered fully solved if we have an algorithm that can simulate values from the posterior distribution of the unknowns as the integrals can then be replaced with sample averages which are easy to calculate. But even standard Monte Carlo techniques are in trouble when the dimension is high because of the sparseness of spaces of high dimensions. The solution is to use methods based on high dimensional random walks. The most important of these simulation algorithms is called Metropolis-Hasting algorithm (MH). It belongs to set of methods called Markov chain Monte Carlo (MCMC), meaning that the resulting simulated values form a stochastic process where a new value can depend on the previous one, thus forming kind of a random walk in the multi dimensional space of the unknowns. As the name of the method indicates, the results of the simulation is called chain. The MH chain is constructed using a proposal distribution that suggests new values for the unknown x. The value is accepted according to a simple rule that ensures the right distribution of the resulting chain. Adaptive methodologies for the choosing of the proposal distributions [2, 3, 4] allow for efficient simulations even in high dimensional situations with large number of unknowns. The MH algorithm has several useful generalizations and special cases for different purposes. For the rest of this section we let θ stand for all the unknowns of our model, for example both the unknown state variables and unknown model parameters. The basic idea behind MH is that instead of needing to compute the values of the posterior p(θ y) directly, we only need to compute ratios of the posteriors in two distinct parameter values, p(θ 2 y)/p(θ 1 y). This cancels out the normalising constant p(y) and also the parts of the likelihood

3 function p(y θ) that do not depend on the unknown θ. With a little more detail, the MH algorithm can be described as a method for generating a chain of possible parameter realisations θ,θ 1,... in such a way that it forms a sample from the posterior distribution p(θ y). Starting from an initial guess θ. In each step i of the algorithm the current value is θ i and we propose a new value θ using a proposal distribution q(θ i, ). As the notation suggests this proposal can depend on the current value θ i. Typically the proposal q is multi dimensional Gaussian distribution, centered at the current stage θ i. The new value is accepted using acceptance probability α(θ i,θ ) that is defined as ( α(θ i,θ ) = min 1, p(y θ )p(θ )q(θ,θ i ) p(y θ i )p(θ i )q(θ i,θ ) ), (6) in which case we set θ i+1 = θ. If θ is not accepted, the chain stays at the current value, that is θ i+1 = θ i, and the chain just repeats itself. If the proposal is symmetric, so q(θ 1,θ 2 ) = q(θ 2,θ 1 ), as it is case with the Gaussian density, we see that qs cancel out and a new value θ it is accepted unconditionally if it is better than the previous value, e.g. if p(θ y)/p(θ i y) > 1. If it is not better in the above sense, then θ is accepted as a new value with probability equal to the posterior ratio p(θ y)/p(θ i y). Using the standard Markov chain theory it is easy to show that this algorithm produces a chain of values that will eventually follow the posterior distribution p(θ y). "Eventually" means that we must allow some burn-in time to let the chain reach the limiting distribution. The MH algorithm can be thought as traveling uphill towards the peak of the posterior distribution, but occasionally taking steps downhill. The percentage of time spent at each altitude of the hill corresponds exactly to the probability of that height. After the MCMC run we have at our disposal a chain of values of the parameter vector. The inference about the unknowns is done with statistics calculated from the chain. The mean of the chain is the best point estimate for the unknown, a histogram or kernel density gives visual estimate of the uncertainty in estimates values, etc. If we think of the generated chain as a matrix where the number of rows corresponds to the size of the MCMC sample and the number of columns corresponds to the number of unknowns in the model, then each row is a possible realization of the model and these appear in correct proportions corresponding to the posterior distribution. 4. REVERSIBLE JUMP MCMC Reversible jump MCMC, RJMCMC is a general extension to the standard MH to allow the chain to explore unknowns of varying dimension [5]. This makes it possible to use MCMC methods for Bayesian model determination. Idea behind RJMCMC is that we can form the proposal distribution in such way that it can perform reversible jumps between spaces of different dimensions. This means for example that the random walk of the MH algorithm can simultaneously explore different models for the same data. The RJMCMC method is becoming a standard statistical machinery when doing analyses with complicated models. Its main advantage is its generality, it extend the standard MH to a variety of problems where the number of unknowns is also unknown, or when there are discrete choices in the model. In GOMOS profile retrieval such would be the aerosol cross section model used, or the number of constituents that are to be inverted from one occultation. The next section describes a version of the RJMCMC algorithm that makes it is possible to write computer code that solves the model determination problem for large class of problems typical in geophysical modelling. 5. IMPLEMENTING RJMCMC RJMCMC algorithm as initiated by Green [5] can be presented in theoretical framework that extends the standard MH algorithm to a very general state space of the unknowns. We will not present the general theory here, but refer to [5]. Instead, we show how the method can be implemented in case when we are contemplating several different models for the same data. This is also based on the work of Green [6] and is called automatic RJMCMC. In automatic RJMCMC [5, 6] a MCMC sampler is constructed that automatically jumps between models. Suppose that for each model k, the target posterior distributions can be approximated by mean vectors µ k and covariance matrices C k = Rk T R k, where R k s are the Cholesky decompositions. We can then use Gaussian distributions with mean µ k and covariance matrix C k, N(µ k, C k ), (or some scaled version of it) as the proposals in MH algorithm. Automatic RJMCMC uses the following transformation of the parameters between the models. Let again θ (k) stand for the vector or all the unknowns in model k. In MCMC setting we can consider it as one row in the chain matrix representing possible realization of unknowns corresponding to model k. A scaled and normalized versions of the chain values can be computed as z k = (θ (k) µ k )R 1 k. (7) If model j has same dimension as model i we have a simple transformation from model space i to model space j as θ ( j) = µ j + z i R j. (8) If the dimensions of two models do not match, we either drop some columns or add new dimensions using independent Gaussian random numbers, u N(, I), and

4 arrive to transformations µ i + [z i ] n j 1 R j if n i > n j θ (i) µ = i + z i R j if n i = n [ ] j zi µ i + R j if n i < n j. u Here [z] j 1 means the first j components of vector z. Next we need a proposal distribution for choosing a new model. Let p(i, j) to be the probability to try to jump to a model j when the chain is currently at the the model i. So if the current model is i the next model in chosen with a draw from distribution p(i, ). Assume that model j is drawn. Then the current parameter vector is transformed to the new model according to equation (9). The acceptance probability for the RJMCMC sampler can be written as written as ( ) α(θ (i),θ ( j) ) = max 1, p(y θ( j), j)p(θ ( j), j)p( j, i) R j p(y θ (i), i)p(θ (i), i)p(i, j) R i g, T(λ) = exp (1) where R is the determinant of matrix R and g = φ(u) if n i > n j, g = φ(u) 1 if n i < n j, and g = 1 if n i = n j, φ being the probability distribution function of independent multi dimensional Gaussian values, N(, I). (9) This sampler is easy to implement but its success depends on how good the Gaussian approximations are to provide decent proposals from model to model move. It is however typical in many geophysical applications to have uncertainty that is approximately Gaussian. This is the reason for the classical estimation methods to work so well as they many times do. Using RJMCMC allows us to use Bayesian model selection methods and incorporate prior information, such as positivity constraints in a statistically sound manner. Also we recover the nonlinear correlation structures between the uncertainties that usually are not found by other than with fully Bayesian methods. 6. GOMOS AEROSOL MODEL To demonstrate the use of RJMCMC, we apply the automatic RJMCMC method to aerosol model in the GO- MOS retrieval. The forward model is the standard GO- MOS model for the spectral transmission according to the Beers law. The cross section that is used for aerosol line density is however just an approximations of the underlying aerosol extinctions process that depends on lot of different factors. That is why an approximate analytical approximation of the wavelength dependency is used. It is typically done by using a function that behaves like 1/λ, where λ is the wavelength. See [7] for a comparison of different aerosol extinction models for GOMOS inversion using simulated transmission data. We consider four different aerosol cross section models: the standard 1/λ model (model 1), second degree polynomial on λ (model 2), 1/λ 2 dependence (model 3), and a second degree polynomial on 1/λ (model 4). The aerosol models are parametrized using the aerosol extinction at 5 nm (models 1 and 2) or at 3, 5 and 6 nm (models 3 and 4). A positivity prior is put on these values. We concentrate on inverting the integrated line densities from the transmission spectra. This is called the spectral inversion step in GOMOS literature. The vertical inversion of transforming the line densities to the actual constituent densities is a linear operation that is done after the line densities for all the heights have been inverted and it is not considered here. Let N be the vector of integrated line densities of the constituents to be retrieved (O 3, NO 2, NO 3, air, aerosols) and matrix α the corresponding cross sections. The cross section of aerosol depends on the model parameters η (k). The forward model for the observed transmission T is ( ) α(η (k) )N +ǫ(λ), with ǫ(λ) N(,σ 2 wλ 2 ). (11) As the chosen aerosol model will affect the size of the residuals, the error variance is assumed to be of form σ 2 (λ) = σ 2 w 2 λ with known weights w λ for each wavelength λ and model dependent unknown scalar σ 2, which is also estimated by MCMC. The likelihood function is thus seen to be ( p(t N,η (k) ) exp 1 ) 2σ 2 SS(N,η(k) ), (12) where SS(N,η (k) ) is the weighted sum of squares function ( ( T(λ) exp α(η (k) )N ) SS(N,η (k) ) = λ w λ ) 2. (13) As for priors, only positivity constraint for the line densities is used. As the cross section for air closely resembles that of the aerosols, more informative prior for the density of air could be used to identify the aerosol model more easily. For the unknown error variance factor σ 2 non informative inverse Gamma prior is used. All the four model are taken a priori to be equally likely Results The RJMCMC method is so far tried on a limited number of selected occultations and no definitive analyses have been done. For each line of sight, and with given aerosol model, the problem of inverting the line densities from the transmittance is a nonlinear problem with 5 unknowns. So this is fairly easy problem given appropriate initial guesses and if the noise level is not overwhelming. The estimation problem can be solved in least squares sense as nonlinear optimization problem, using eg. Levenberg-Marquardt method. This is also an easy problem for standard standard MCMC simulation.

5 1 posterior probabiliities of the four aerosol models 3.28 x 12 O3 1 x 116 NO probability x NO x 126 Air Aerosols model probabilities height [km] Figure 1. RJMCMC run is done for each height in one GOMOS occultation. The posterior model probabilities are calculated for the four models at each height. The colours show how the different cross sections models get weight depending on the altitude. The colouring is the same as in figures 2 and 3, Model 1: red, model 2 green, model 3 blue, model 4 magenta Figure 2. MCMC chain from GOMOS line density inversion with different aerosol cross section models. The plot labeled "aerosols" is the relative optical extinction at 5 nm. Last plot shows the posterior model probabilities p(k y). MCMC can even be extended to a one step solution, where all the heights are solved simultaneously, with regularization (smoothness) priors on the vertical structure of the profiles. To use RJMCMC with model selection we use the following strategy. First, for each occultation height and for each aerosol model MCMC runs are done separately using delayed rejection adaptive Metropolis algorithm [4] for robust and efficient way to find the individual posterior distributions. From the MCMC chains of these runs the mean vectors and covariance matrices together with their Cholesky factors are calculated to produce µ i, and R i, i = 1,...,4 that are needed in the RJMCMC stage. Secondly, RJMCMC run is done for a chain of length 5 using automatic RJMCMC algorithm. The resulting chains are visually investigated using 1-D plots like in figure 2 to judge if the chains have converged. For model selection we calculate the relative times the RJMCMC has spent in each model. In figure 1 the results for each altitude for the chosen occultation is shown. In the most of the heights one model clearly stands out as the only one that fits the data using the criteria of model posterior probability. As an illustration on model averaging we choose one altitude at 18 km where all the four models have gained some posterior probability. Figures 2, 3 and 4 show results on this selected altitude. Figure 2 shows the MCMC chains for the line densities for one selected GOMOS occultation. The horizontal axis runs with the simulation indexes, vertical axis being the simulated and accepted values for the line density for each constituent. The color indicates in witch model the algorithm is in each step. Plot on the lower left cor- ner labeled "Aerosol" show the relative aerosol extinction at 5 nm for all models. The last plot shows relative times spent in each model. Of the total 5 MCMC simulations the model 1 is visited 129 times, model , model and model times, which makes the corresponding marginal model posterior probabilities p(k i y), i = 1,...,4 to be.3,.321,.634, and.43. Figure 3 gives the estimated posterior densities describing the uncertainties in the line densities. Marginal posteriors for each three models are shown separately. The posteriors are combined to form an uncertainty estimate that takes into account also the uncertainty in the model. The posterior probabilities of the models are just the relative times the chain has spent on each model. This depend on given prior weights for each model and on how well each different model fit the data compared to other models. In the present example, all the models are taken a priori to be equally likely, so p(k) = 1/4 for k = 1,...,4. Model averaging is useful when we are not able to get the best model. The model is then a mixture of different models each weighted according to their posterior weight. The uncertainty in the model is taken into account in the predictions and in the posterior inference of the constituent values. Lastly, to study the aerosol cross sections estimated by the four models considered here, we calculate the optical extinctions for all the aerosol model parameters in the MCMC chain. Then, for each wavelength, we can check the distribution of values of the extinctions realised in the chain and from these values calculate the posteroir limits of possible values. The resulting predictive envelopes are in figure 4.

6 7. CONCLUSIONS O NO3 x x 1 15 model 1 model 2 model 3 model 4 averaged model NO Air x Figure 3. Posterior density estimates of the constituent line densities calculated from the MCMC chains. The thicker line is the uncertainty coming from the averaged model that takes into account the model uncertainty. x 1 26 RJMCMC provides simulation based method for model determination and uncertainty analysis. If one model clearly stands out, then we can select it as the true model, but if the data does not give any indication on the right model, and no accurate prior for the true model is available, then the uncertainty in the model has to be taken into account in the model predictions. The study presented here is still "a proof of concept" for the RJMCMC method, and no comprehensive runs have been done yet to say any definitive results on the right GOMOS aerosol model. The study of aerosols in GOMOS inversion is further complicated by the fact that, in addition to aerosols, parts of the unmodelled variations in the GOMOS spectra are due to the scintillation effects caused by turbulence. The effects are actively studied in the FMI at the moment, and the current project will hopefully give useful methodological tools for these studies. Acknowledgments This work is done under financial support from Finnish Funding Agency for Technology and Innovation (Tekes) within project MASI Modelling and simulation. REFERENCES model model model model Figure 4. Estimated aerosol extinctions for the selected altitude of the example given in the text. Solid line is the fitted median cross section. Grey areas correspond to 5%, 95% and 95% posterior limits of the extinctions. [1] J. Tamminen and E. Kyrölä. Bayesian solution for nonlinear and non-gaussian inverse problem by Markov chain Monte Carlo method. Journal of Geophysical Research, 16(D13): , 21. [2] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7(2): , 21. [3] H. Haario, M. Laine, M. Lehtinen, E. Saksman, and J. Tamminen. MCMC methods for high dimensional inversion in remote sensing. Journal of the Royal Statistical Society, Series B, 66(3):591 67, 24. [4] H. Haario, M. Laine, A. Mira, and E. Saksman. DRAM: Efficient adaptive MCMC. Statistics and Computing, 16: , 26. [5] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4): , [6] P. J. Green. Trans-dimensional Markov chain Monte Carlo. In P. J. Green, N. L. Hjort, and S. Richardson, editors, Highly Structured Stochastic Systems, number 27 in Oxford Statistical Science Series. Oxford University Press, 23. [7] F. Vanhellemont, D. Fussen, J. Dodion, C. Bingen, and N. Mateshvili. Choosing a suitable analytical model for aerosol extinction spectra in the retrieval of UV/visible satellite occultation measurements. Journal of Geophysical Research, 111(D2323), 26.

Markov chain Monte Carlo methods in atmospheric remote sensing

1 / 45 Markov chain Monte Carlo methods in atmospheric remote sensing Johanna Tamminen johanna.tamminen@fmi.fi ESA Summer School on Earth System Monitoring and Modeling July 3 Aug 11, 212, Frascati July,