Markov chain Monte Carlo methods in atmospheric remote sensing

Similar documents
Inverse problems and uncertainty quantification in remote sensing

AEROSOL MODEL SELECTION AND UNCERTAINTY MODELLING BY RJMCMC TECHNIQUE

DRAGON ADVANCED TRAINING COURSE IN ATMOSPHERE REMOTE SENSING. Inversion basics. Erkki Kyrölä Finnish Meteorological Institute

Markov chain Monte Carlo

UV/VIS Limb Retrieval

17 : Markov Chain Monte Carlo

Lecture 6: Markov Chain Monte Carlo

Computer Practical: Metropolis-Hastings-based MCMC

16 : Approximate Inference: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo

Bayesian Methods for Machine Learning

Computational statistics

Introduction to Machine Learning CMU-10701

STA 4273H: Statistical Machine Learning

Markov Chain Monte Carlo methods

CSC 2541: Bayesian Methods for Machine Learning

An introduction to adaptive MCMC

Brief introduction to Markov Chain Monte Carlo

Markov chain Monte Carlo methods for high dimensional inversion in remote sensing

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

MCMC Sampling for Bayesian Inference using L1-type Priors

Quantifying Uncertainty

Markov chain Monte Carlo Lecture 9

Robert Collins CSE586, PSU Intro to Sampling Methods

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov Chain Monte Carlo (MCMC)

Advances and Applications in Perfect Sampling

MCMC: Markov Chain Monte Carlo

Advanced uncertainty evaluation of climate models by Monte Carlo methods

Introduction to Bayesian methods in inverse problems

CPSC 540: Machine Learning

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Sampling from complex probability distributions

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

Bayesian Inference and MCMC

Machine Learning for Data Science (CS4786) Lecture 24

Markov Chain Monte Carlo methods

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Advanced Statistical Methods. Lecture 6

Markov-Chain Monte Carlo

Monte Carlo Methods. Leon Gu CSD, CMU

16 : Markov Chain Monte Carlo (MCMC)

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Robert Collins CSE586, PSU Intro to Sampling Methods

Lecture 8: The Metropolis-Hastings Algorithm

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer intensive statistical methods

Robert Collins CSE586, PSU. Markov-Chain Monte Carlo

Markov Chains and MCMC

CSC 446 Notes: Lecture 13

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Markov chain Monte Carlo

Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Monte Carlo methods for sampling-based Stochastic Optimization

ST 740: Markov Chain Monte Carlo

Session 3A: Markov chain Monte Carlo (MCMC)

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Markov Chain Monte Carlo, Numerical Integration

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

CS281A/Stat241A Lecture 22

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Markov Chain Monte Carlo

STA 4273H: Sta-s-cal Machine Learning

Introduction to Machine Learning

MCMC Methods: Gibbs and Metropolis

Some Results on the Ergodicity of Adaptive MCMC Algorithms

STA 294: Stochastic Processes & Bayesian Nonparametrics

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Convex Optimization CMU-10725

Lecture 2: From Linear Regression to Kalman Filter and Beyond

The Metropolis-Hastings Algorithm. June 8, 2012

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Stochastic optimization Markov Chain Monte Carlo

Simulation - Lectures - Part III Markov chain Monte Carlo

Kernel Adaptive Metropolis-Hastings

Bayesian linear regression

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Sampling Methods (11/30/04)

Markov Chains and MCMC

Markov chain Monte Carlo

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Principles of Bayesian Inference

Bayesian Estimation with Sparse Grids

LECTURE 15 Markov chain Monte Carlo

Transcription:

1 / 45 Markov chain Monte Carlo methods in atmospheric remote sensing Johanna Tamminen johanna.tamminen@fmi.fi ESA Summer School on Earth System Monitoring and Modeling July 3 Aug 11, 212, Frascati July, 31, 212

2 / 45 Markov chain Monte Carlo method (with applications to atmospheric remote sensing) Contents of this lecture Nonlinear inverse problems Introduction to Markov chain Monte Carlo Implementing MCMC in practice: adaptive Metropolis algorithm Examples of applying MCMC to atmospheric remote sensing

3 / 45 Solving nonlinear problems Posterior distribution π(x) = p(x y) = p lh(y x)p pr (x). p lh (y x)p pr (x)dx can be evaluated pointwise up to the normalizing constant π(x) p lh (y x)p pr (x). MAP estimate gives an estimate of most probable value. Note notation from now on: we denote with π both the target (posterior) distribution and its pdf.

4 / 45 What we typically also need is: expectation, mean covariance matrix probability regions (e.g., 9%) how probably x A? In general case these require integration with respect to the posterior distribution: IE(f (x)) = f (x)π(x)dx Examples of (scaled) probablity regions in GOMOS retrievals

5 / 45 As discussed yesterday, typical ways of solving non-linear problems in practice include Linearizing the problem Assuming close to Gaussian structure at the estimate Iterative finding of the solution

6 / 45 Exploring the posterior vs. optimizing Figure by Erkki Kyrölä (FMI)

7 / 45 What if my posterior distribution is like this? 2 first dimensions of the chain with 5% and 95% target contours 1 x 2 2 3 4 5 6 7 In this case Gaussian posterior distribution (mean and covariance) is not a good approximation. 8 9 5 4 3 2 1 1 2 3 4 5 x 1

Computation of the posterior distribution Systematic exploration of the posterior distribution Typically very time consuming and practically impossible in large dimensons Monte Carlo integration Sample randomly X 1,..., X N from the posterior distribution When samples are independent and identically distributed law of large numbers holds IE(f (x)) = f (x)π(x)dx 1 N N f (X t ) t=1 To study the posterior distribution it requires normally computation of p(y) = p lh (y x)p pr (x)dx it means that the whole space of possible x values needs to be sampled Many variations exist how to make the sampling One of the methods is Markov chain Monte Carlo (MCMC) - topic of this lecture 8 / 45

9 / 45 Law of Large numbers (LLN) LLN says that sample average converges towards the expectation 1 N N X t IE(X) when N t=1 When LLN holds, we can approximate expectation with sample mean: IE(f (x)) = f (x)π(x)dx 1 N N f (X t ) t=1

1 / 45 Markov chain Monte Carlo technique MCMC is based on sampling points X t which are cleverly selected In MCMC the sampled points form a Markov chain Let us have a chain X = X 1, X 2,... now the chain is Markov chain if the probability to jump to X t+1 depends only on previous point X t : P(X t+1 X 1, X 2,... X t ) = P(X t+1 X t ) The sampled points are thus not independent like in Monte Carlo sampling. However, LLN can be shown to hold if the points are selected in a proper way.

11 / 45 Ergodicity of Markov chains Assuming that the Markov chain: has the desired distribution π(x) as the stationary distribution (reversibility) Detailed balance equation holds: π(x i ) P(X j X i ) = π(x j ) P(X i X j ) samples properly all parts of the state space: is aperiodic - don t repeat itself is irreducible - all places can be reached then the chain is ergodic and the LLN holds 1 N N f (X t ) IE π (f (X)) when N t=1 Standard MCMC algorithms are created so that LLN holds for reasonable target distributions.

12 / 45 MCMC in practice: Several variants of MCMC exist but most common are: Metropolis algorithm (Metropolis Et. Al. 1953, J.Chem. Phys.). Metropolis-Hastings algorithm (Hastings 197, Biometrika). Gibbs sampling(gelfand and Smith 199, J. Amer.Stat.Assoc.).

13 / 45 Metropolis-Hastings algorithm At each iteration step t Step 1) (Proposal step) Sample Z q( X t ) here Z is a candidate point and q( X t ) is a selected proposal distribution Step 2) (Acceptance step) Accept the candidate point by using the acceptance probability: ( α = min 1, π(z )q(x ) ( t Z ) = min 1, p ) lh(y Z ) p pr (Z ) q(x t Z ) π(x t )q(z X t ) p lh (y X t ) p pr (X t ) q(z X t ) Put X t+1 = Z if accepted X t+1 = X t if rejected.

14 / 45 Metropolis algorithm When the proposal distribution q is symmetric (eg. Gaussian N(X t,c q ) centered at the present point X t ): Step 2) (Acceptance step) Acceptance probability is now simply: ( α = min 1, π(z ) ) ( = min 1, p ) lh(y Z )p pr (Z ) π(x t ) p lh (y X t )p pr (X t ) Put X t+1 = Z if accepted X t+1 = X t if rejected.

15 / 45 Note: Acceptance probability is selected so that detailed balance (π(x)p(y x)) = π(y)p(x y)) holds. This ensures that the chain converges towards the correct target distribution π When the proposal distribution is correctly chosen the LLN assures that the expectation can be approximated by using empirical mean (under mild conditions for π). Theoretically, any proposal q having the same support as π should work. Usually the proposal distribution q is selected so that it is easy to sample candidate points from it. (eg. Gaussian, fixed spheres or regions around the present point) In practice, some proposals q are better than the others.

16 / 45 Note (cont): Computing the accptance probability of a posterior distribution requires evaluating π(z ) π(x t ) = p lh(y Z ) p pr (Z ) p lh (y X t ) p pr (X t ) thus the scaling factor p(y) = p lh (y x)p pr (x)dx needs not to be evaluated! Candidate point is always accepted if more probable. Also non-zero probability to accept less probable values.

17 / 45 z 7 z 8 z 5 z 6 z 9 z 1 z 4 z 1 z 2 z 3 z Sampled chain is z, z 1, z 1, z 1, z 4, z 5, z 6, z 6, z 8, z 9, z 1,...

18 / 45 MCMC demo Posterior distribution by Monte Carlo sampling

19 / 45 MCMC demo Posterior distribution by Monte Carlo sampling: kernel estimate

2 / 45 2.6 x 19 Aerosol line density (1/cm2) 2.4 2.2 2 1.8 1.6 1.4 99 % 95 % 9 % 68.3 % 1.2 1.8 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 Ozone line density (1/cm2) x 1 2 Example of marginal posterior distribution of ozone and aerosol horizontal column densities in GOMOS retrieval. The dots indicate sampled points and their distribution indicates the posterior distribution of the solution. The contour curves, computed from sampled points by using Gaussian kernel estimates correspond to 68.3, 9, 95, and 99% probability regions.

21 / 45 MCMC in practice Even though the basic algorithm is very simple and, in theory, LLN holds for eg. simple Gaussian proposal distribution, few things need to be characterized: Select a proposal distribution. This is an important topic and we ll discuss this more in the following slides Select starting point. In theory, any point will work. In practice it is worth selecting a reasonable point, e.g., the solution of an iterative algorithm. How long chain is needed? This is difficult to say and depends on the target distribution and its dimension. Important to be careful and get hands on experience on this. Several convergence diagnostics also exist and can be sometimes useful.

22 / 45 MCMC in practice cont. Burn-in Typically the beginning of the chain is rejected as a burn in period.

23 / 45 Proposal distribution Efficient sampling requires good propsal distribution Nice sampling: good proposal Slowly converging chain: Too small proposal Chain does not move properly: Too large proposal

24 / 45 Good proposal Good balance between accepted and rejected points. This is typically monitored during sampling. In Metropolis algorithm with Gaussian target and Gaussian proposal the optimal acceptance ratio is a r = n of accepted points n of proposed points.234 when the dimension of the target is roughly larger than 6. Optimal acceptance ratio is slightly larger in smaller dimensions (eg. in 2D case.35). (See Gelman et al, Efficient Metropolis Jumping Rules, Bayesian Statistics 5, 1996) If the target is close to Gaussian acceptance ratio.2.3 should be ok. If the distribution is strongly non-gaussian smaller acceptance ratio is obtained even with good proposal.

25 / 45 For sampling Gaussian distribution N(µ, C) The optimal Gaussian proposal in Metropolis algorithm is: q( ) N(, c d 2 C) where c d = 2.4 d (See Gelman et al, Efficient Metropolis Jumping Rules, Bayesian Statistics 5, 1996)

26 / 45 Applying Metropolis algorithm for GOMOS spectral inversion Metropolis algorithm: Easy technique for sampling points X,..., X n from the target distribution: E π (f (x)) 1 n f (X i ) n In practice: GOMOS case: the amount of data is enormous, the posterior distributions vary a lot, depending on atmospheric conditions and the noise level of the data no fixed proposal distribution works effectively for MCMC. i=1 Manual tuning of proposal distribution impossibe. Adaptive and automatic MCMC necessary!

27 / 45 Data/noise Data using bright star Data using dim star 1 1 Transmission (corrected for refraction).8.6.4.2 Transmission (corrected for refraction).8.6.4.2 25 3 35 4 45 5 55 6 65 Wavelength (nm) 25 3 35 4 45 5 55 6 65 Wavelength (nm) Strong variability in signal-to-noise ratio depending on the type of star.

28 / 45 Variability in posterior distributions for different cases 4 Air 4 Ozone 6 NO2 1 NO3 3 Aerosols 3 2 1 3 2 1 4 2.5 2 1.8 1 1.2.9 1 1.1 1 2 5 5 4 4 6 1 3 3 2 1 3 2 1 4 2.5 2 1.8 1 1.2.9 1 1.1 1 2 5 5 Estimated 1σ (68.3%) probability limits for each gas. The top row corresponds to a typical star (magnitude 2) case and the bottom row to a dim star (magnitude 4). The absolute values have been scaled so that the true value corresponds to value 1 and the y axis represents the identifiability.

Several decades of variability depending on the altitudes 1 9 Local ozone density sonde 1983 GOMOS 1983 8 Altitude (km) 7 6 5 4 3 2 1 1 7 1 8 1 9 1 1 1 11 1 12 Local density (1/cm 3 ) 29 / 45

3 / 45 Adaptive MCMC Idea: optimize proposal distribution during sampling. Learn from past chain. One needs to be careful since ergodicity can be lost. One of the commonly used approaches is Adaptive Metropolis algorithm (See Haario et al, An adaptive Metropolis algorithm, Bernoulli, 21) Several variants of this algorithm also exist but we concentrate here mainly on the original version.

31 / 45 Adaptive Metropolis (AM) algorithm - the idea AM is simply basic Metropolis algorithm with Gaussian proposal distribution which is adapted during sampling In AM the covariance matrix of the proposal distribution is updated based on earlier sampled points. We define { C, t t C t = c d Cov(X,..., X t 1 ) + c d ε, t > t. Here c d = 2.4/ d as earlier. The additional term ε > ensures that the distribution does not become singular.

32 / 45 Empirical covariance matrix determined by points x,..., x t IR d : ( ) Cov(x,..., x t ) = 1 t x i xi T (t + 1)x t x T t, t i= where the mean is x t = 1 t + 1 t x i. i=

33 / 45 Implementing AM algorithm: Recursion formulas for updates Recursive formula for the mean x t : x t+1 = t + 1 t + 2 x t + 1 t + 2 x t+1 Recursive formula for the covariance C t : C t+1 = t 1 C t + c ( d tx t 1 X T t 1 t t (t + 1)X tx T t ) + X t Xt T.

34 / 45 Adaptive Metropolis algorithm At each iteration step t > t Step 1) (Proposal step) Sample candidate from proposal distribution N(X t 1, C t ) Step 2) (Acceptance step) Accept using Metropolis algorithm Step 3) Update proposal distribution by using the recursion formulas.

Adaptive Metropolis algorithm cont. Proposal distributions which adapt suitable size and/or shape by using the information of the previous states in the chain Gaussian proposal distribution N(X t, C t ), where the covariance C t depends on time. The chain is not a Markov chain but the right ergodic properties can be proved (ref. Haario et al, 21). 6 4 2 2 4 6 8 25 2 15 1 5 5 1 15 2 25 The idea of the proof is that along the time the proposal distribution converges and the adaptation diminishes. 35 / 45

AM: Adaptive Metropolis algorithm 4 35 3 25 2 15 1 5 2 15 1 5 5 1 15 2 AM: Convergence towards right target distribution. Green distribution target. Solid line AM. 36 / 45

AP: Adaptive proposal algorithm (Haario et al, 1999) 15 1 5 5 1 15 2 25 2 15 1 5 5 1 15 2 AP: adapting proposal distribution during sampling based on only most recently sampled points. 37 / 45

38 / 45 AP: Adaptive proposal algorithm 45 4 35 3 25 2 15 1 5 2 15 1 5 5 1 15 2 AP: not right covergence properties if adaptation is continued throughout the sampling. If adaptation stopped after the burn-in period then AP algorithm is OK. In practice, AP works well for reasonably Gaussian targets also when adaptation is continued.

39 / 45 GOMOS: Sampling posterior distribution using MCMC Motivation: Visualizaiton of posterior distribution. Freedom in defining a priori: Could be learn more by using more complicated priors than Gaussian? Positivity, for example? Freedom in defining noise: Could be learn more by using different measurement noise structures than Gaussian. Could expectation be more robust estimate than MAP estimate? Can we study the identifiability in non-linear problems? Validation of the operative algorithm: Is the posterior distribution really Gaussian - how well does the covariance matrix approximate the posterior distribution? Does the posterior distribution include local maximums - does the iterative Levenberg-Marquardt algorithm get stuck on these points?

Implementation - visualization of posterior distributions Automatic and adaptive algorithm essential for implementing MCMC in real GOMOS application. In practice AM (and AP, an earlier version of it) used for studying non-linear GOMOS spectral inversion. Air 25 x 1 x 1 17 5 x 115 1.6 1.85 4 99% 1.4 3 1.8 1.2 68.3% 2 1.75 1 1.8 1.18 1.2 1.22 1.24 1.26 1.18 1.2 1.22 1.24 1.26 1.18 1.2 1.22 1.24 1.26 O 3 x 1 2 O 3 x 1 2 O 3 x 1 2 NO 2 NO 3 Aerosol x 1 7 8 6 4 2 1.18 1.2 1.22 1.24 1.26 O 3 x 1 2 Simulated case: Two-dimensional kernel estimates of the marginal posterior distributions of the gases at 3 km. On x axis we present the ozone values and on the y axis from left: air, NO 2, NO 3, and aerosol. The contours refer to 68.3, 9, 95, and 99% probability regions. The true values used in the forward model are denoted with dots. 4 / 45

41 / 45 Example of implementing other than Gaussian noise structure By linearizing the GOMOS spectral inversion the noise becomes non-gaussian The likelihood function after the linearization is: 1 p(ỹ x) = (2π) m/2 ) C 1 2 ( m exp 1 f (x; λ i ) y(λ i ) 2 i=1 σ 2 i This can easlily be implemented in MCMC ) 2 ỹ.

Simulated example: correct error characterization (non-gaussian) improves results. 42 / 45

43 / 45 Noise statistics: robust inversion Uncertainties larger at lower part. Robust l 1 norm likehlihood: ( ) m 2 f i (x) y i p(y x) exp. i=1 Example: robust inversion improves results at low altitudes σ i

44 / 45 Examples - identifiability in GOMOS retrieval 3 Air 4 Ozone 4 NO 2 1 NO 3 3 Aerosol 2 1 3 2 1 3 2 1.5 2 1.9 1 1.1.9 1 1.1 1 2 2 2 4 Marginal posterior densities for various retrieved constituents with varying 1 2 number of spectral data: all 1417 data points (blue line), every second (red line) and every fourth data point (green line).

45 / 45 MCMC toolbox for matlab Available at: http://helios.fmi.fi/ lainema/mcmc Developed by Marko Laine (FMI).