Inverse Problems in the Bayesian Framework

Similar documents
An example of Bayesian reasoning Consider the one-dimensional deconvolution problem with various degrees of prior information.

Marginal density. If the unknown is of the form x = (x 1, x 2 ) in which the target of investigation is x 1, a marginal posterior density

16 : Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC)

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Bayesian Inference and MCMC

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Introduction to Machine Learning CMU-10701

STA 4273H: Sta-s-cal Machine Learning

In many cases, it is easier (and numerically more stable) to compute

Strong Lens Modeling (II): Statistical Methods

Bayesian Methods for Machine Learning

Introduction to Bayesian methods in inverse problems

Graphical Models and Kernel Methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

MCMC Sampling for Bayesian Inference using L1-type Priors

The Metropolis-Hastings Algorithm. June 8, 2012

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

Markov Chain Monte Carlo (MCMC)

I. Bayesian econometrics

19 : Slice Sampling and HMC

Lecture 6: Markov Chain Monte Carlo

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Learning the hyper-parameters. Luca Martino

Chapter 7. Markov chain background. 7.1 Finite state space

APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

17 : Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

CSC 2541: Bayesian Methods for Machine Learning

Answers and expectations

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

01 Probability Theory and Statistics Review

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Gibbs Sampler Componentwise sampling directly from the target density π(x), x R n. Define a transition kernel

Markov chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo

Density Estimation. Seungjin Choi

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Formulas for probability theory and linear models SF2941

Lecture 8: The Metropolis-Hastings Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Introduction to Machine Learning

Markov chain Monte Carlo Lecture 9

CPSC 540: Machine Learning

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Kernel adaptive Sequential Monte Carlo

1 Introduction. Systems 2: Simulating Errors. Mobile Robot Systems. System Under. Environment

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Markov Chains and MCMC

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Expectation Propagation Algorithm

Variational Learning : From exponential families to multilinear systems

Monte Carlo Methods. Leon Gu CSD, CMU

Computer Intensive Methods in Mathematical Statistics

Multiple Random Variables

Introduction to Machine Learning CMU-10701

A review of probability theory

Pattern Recognition and Machine Learning

Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I

STA 4273H: Statistical Machine Learning

Simulation - Lectures - Part III Markov chain Monte Carlo

The Multivariate Gaussian Distribution [DRAFT]

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Results: MCMC Dancers, q=10, n=500

Computational statistics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

F denotes cumulative density. denotes probability density function; (.)

Probabilistic numerics for deep learning

DRAGON ADVANCED TRAINING COURSE IN ATMOSPHERE REMOTE SENSING. Inversion basics. Erkki Kyrölä Finnish Meteorological Institute

16 : Approximate Inference: Markov Chain Monte Carlo

Markov Chain Monte Carlo

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Mobile Robotics II: Simultaneous localization and mapping

ECE 636: Systems identification

MARKOV CHAIN MONTE CARLO

Stat 516, Homework 1

Quantifying Uncertainty

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Introduction to Probability and Statistics (Continued)

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Numerical integration and importance sampling

Markov Chain Monte Carlo

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Reminder of some Markov Chain properties:

Computational Science and Engineering (Int. Master s Program)

Use of probability gradients in hybrid MCMC and a new convergence test

Markov Chain Monte Carlo

ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

CSC 2541: Bayesian Methods for Machine Learning

Bayesian Regression Linear and Logistic Regression

Transcription:

Inverse Problems in the Bayesian Framework Daniela Calvetti Case Western Reserve University Cleveland, Ohio Raleigh, NC, July 2016

Bayes Formula Stochastic model: Two random variables X R n, B R m, where B is the observed quantity, X is the quantity of primary interest. Goal: Find the posterior probability density, π X (x b) = density of X, given the observation B = b, b = b observed.

Bayes Formula Prior information: Given the prior density π X (x), encoding our prior information or prior belief about the possible vaues of X. Likelihood: Assuming that X = x, what would the forward model predict for the value distribution of B? Encode this information in π B (b x). Evidence: With prior and likelihood, compute π B (b) = π XB (x, b)dx = π B (b x)π X (x)dx. R n R n Bayes formula for probability densities π X (x b) = π B(b x)π X (x). π B (b)

Linear Inverse Problems Consider the problem of estimating x from b = Ax + e, A R m n. Stochastic extension: Write B = AX + E, and assume the noise and prior model X N (0, D), E N (0, C). Usually it is assumed that X and E are independent. In particular, E { XE T} = E { X } E { E } T = 0.

Linear Inverse Problems However, this is not necessary, and we may have E { XE T} = R R n m. Define a new random variable [ X Z = B ] R n+m. Covariance matrix of Z: ZZ T = = [ ] X [ X B T B T ] [ XX T XB T ] BX T BB T. Compute the expectation of this matrix.

Linear Inverse Problems Expectations: E { XX T} = D, E { XB T} = E { X (AX + E) T} = E { XX T A T + XE T} = E { XX T} A T + E { XE T} = DA T + R. Furthermore, E { BX T} = E { XB T} T = AD + R T,

Linear Inverse Problems and, finally, E { BB T} = E { (AX + E)(AX + E) T} = E { AXX T A T + EX T A T + AXE T + EE T} = ADA T + R T A T + AR + C. Conclusion: [ Cov(Z) = D AD + R T DA T + R ADA T + R T A T + AR + C ] [ Γ11 Γ = 12 Γ 21 Γ 22 ].

Schur Complements and Conditioning Given a Gaussian random variable [ ] X Z = R n+m B with covariance Γ R (m+n) (m+n), what is the probability density of X, given B = b? 0.8 0.6 0.4 0.2 5 4 3 2 1 0 0 1 2 3 4 5

Schur Complements and Conditioning Assume that X N (0, Γ), where Γ R n n is a given SPD matrix. Partitioning of X, X = [ X1 X 2 ] R k R n k. Question: Assume that X 2 = x 2 is observed. What is the conditional probability density of X 1, π X1 (x 1 x 2 ) =?

Schur Complements and Conditioning Write π X (x) = π X1,X 2 (x 1, x 2 ). Bayes formula: The distribution of unknown part x 1 provided that x 2 is known, is π X1 (x 1 x 2 ) π X1,X 2 (x 1, x 2 ), x 2 = x 2,observed. In terms of the Gaussian density, π X1,X 2 (x 1, x 2 ) exp ( 1 ) 2 x T Γ 1 x. (1)

Schur Complements and Conditioning Partitioning of the covariance matrix: [ ] Γ11 Γ Γ = 12 R n n, (2) Γ 21 Γ 22 where and Γ 11 R k k, Γ 22 R (n k) (n k), k < n, Γ 12 = Γ T 21 R k (n k).

Schur Complements and Conditioning Precision matrix B = Γ 1. Partition B: [ B11 B B = 12 B 21 B 22 ] R n n. (3) Quadratic form x T Bx appearing in the exponential: [ ] [ ] [ B11 B Bx = 12 x1 B11 x = 1 + B 12 x 2 B 21 B 22 x 2 B 21 x 1 + B 22 x 2 ],

Schur Complements and Conditioning x T Bx = [ x1 T x2 T ] [ ] B 11 x 1 + B 12 x 2 B 21 x 1 + B 22 x 2 = x1 T ( ) ( ) B11 x 1 + B 12 x 2 + x T 2 B21 x 1 + B 22 x 2 = x1 T B 11 x 1 + 2x1 T B 12 x 2 + x2 T B 22 x 2 = ( x 1 + B 1 11 B ) TB11 ( 12x 2 x1 + B 1 11 B ) 12x 2 + x2 T ( B22 B 21 B 1 11 B 12) x2. }{{} independent of x 1

Schur Complements and Conditioning From this key equation for conditional densities it follows that ( π X1 (x 1 x 2 ) exp 1 ( x1 + B 1 11 2 B ) TB11 ( 12x 2 x1 + B 1 11 B ) ) 12x 2. Thus the conditional density is Gaussian, with mean and covariance matrix x 1 = B 1 11 B 12x 2, C = B 1 11. Question: How to express these formulas in terms of Γ?

Schur Complements and Conditioning Consider a partitioned SPD matrix Γ R n n. For any v R k, x 0 v T Γ 11 v = [ v T 0 ] [ ] [ Γ 11 Γ 12 v T Γ 21 Γ 22 0 ] > 0, showing the positive definiteness of Γ 11. The same holds for Γ 22. In particular, Γ 11 and Γ 22 are invertible.

Schur Complements and Conditioning To calculate the inverse of Γ, we solve the equation Γx = y in block form. By partitioning, x = [ x1 x 2 ] R k R n k, y = [ y1 y 2 ] R k R n k. we have Γ 11 x 1 + Γ 12 x 2 = y 1, Γ 21 x 1 + Γ 22 x 2 = y 2.

Schur Complements and Conditioning Eliminate x 2 from the second equation, x 2 = Γ 1 ( ) 22 y2 Γ 21 x 1, substitute back into the first equation: Γ 11 x 1 + Γ 12 Γ 1 ( ) 22 y2 Γ 21 x 1 = y1, and by rearranging the terms, ( Γ11 Γ 12 Γ 1 22 Γ ) 21 x1 = y 1 Γ 12 Γ 1 22 y 2. Define the Schur complement of Γ 22 : Γ22 = Γ 11 Γ 12 Γ 1 22 Γ 21

Schur Complements and Conditioning It can be shown that Γ 22 must be invertible, and therefore x 1 = 1 Γ 22 y 1 1 Γ 22 Γ 12Γ 1 22 y 2. Similarly, interchanging the roles of x 1 and x 2, where x 2 = is the Schur complement of Γ 11. 1 Γ 11 y 1 2 Γ 11 Γ 21Γ 1 11 y 1, Γ11 = Γ 22 Γ 21 Γ 1 11 Γ 12 In matrix form: [ ] [ x1 = x 2 Γ 1 22 Γ 1 22 Γ 12Γ 1 22 Γ 1 11 Γ 1 11 Γ 21Γ 1 11 ] [ ] y1 y 2

Schur Complements and Conditioning Conclusion: [ Γ 1 = Γ 1 22 Γ 1 22 Γ 12Γ 1 22 Γ 1 11 Γ 1 11 Γ 21Γ 1 11 ] [ B11 B = 12 B 21 B 22 ]. Thus The conditional density π X1 (x 1 x 2 ) is a Gaussian π X1 (x 1 x 2 ) N (x 1, C), where and x 1 = B 1 11 B 12x 2 = Γ 12 Γ 1 22 x 2, C = B 1 11 = Γ 22.

Linear Inverse Problems For simplicity, let us assume that R = 0. Posterior density π X (x b) is a Gaussian density, with mean and covariance x = Γ 12 Γ 1 22 b = DAT( ADA T + C ) 1 b, Φ = Γ 22 = Γ 11 Γ 12 Γ 1 22 Γ 21 = D DA T( ADA T + C ) 1 AD.

Example: Numerical differentiation Discretization: f (t) = t 0 g(τ)dτ + noise. b = Ax + e, where 1 A = 1 1 1 n..... 1 1

Stochastic extension Stochastic model B = AX + E. Model for noise: Independent components, Probability density: π E (e) = E j N (0, σ 2 ), 1 j n. ( ) 1 n/2 ( 2πσ 2 exp 1 ) 2σ 2 e 2. Likelihood: ( π B (b x) exp 1 b Ax 2 2σ2 ).

Autoregressive Prior Models Discrete model, x j = g(t j ), t j = j n, 0 j n, Consider two possible prior models: 1. We know that x 0 = 0, and believe that the absolute value of the slope of g is bounded by some m 1 > 0. 2. We know that x 0 = x n = 0 and believe that the curvature of g is bounded by some m 2 > 0.

Autoregressive models 1. Slope: g (t j ) x j x j 1, h = 1 h n, Prior information: We believe that x j x j 1 h m 1 with some uncertainty. 2. Curvature: g (t j ) x j 1 2x j + x j+1 h 2. Prior information: We believe that x j 1 2x j + x j+1 h 2 m 2 with some uncertainty.

Autoregressive model In both cases, we assume that x j is a realization of a random variable X j. Boundary conditions: 1. X 0 = 0 with certainty. Probabilistic model for X j, 1 j n. 2. X 0 = X n = 0 with certainty. Probabilistic model for X j, 1 j n 1.

Autoregressive prior models 1. First order prior: X j = X j 1 + γw j, W j N (0, 1), γ = h m 1. 2. Second order prior: X j = 1 2 (X j 1 + X j+1 ) + γw j, W j N (0, 1), γ = 1 2 h2 m 2. x j x j x j+1 x j 1 STD=γ x j 1 STD=γ

Matrix form: first order model System of equations: X 1 = X 1 X 0 = γw 1 X 2 X 1 = γw 2.. X n X n 1 = γw n L 1 = 1 1 1...... 1 1 Rn n, X = X 1 X 2. X n, W = W 1 W 2. W n. L 1 X = γw, W N (0, I n ),

Matrix form: second order model System of equations: X 2 2X 1 = X 2 2X 1 + X 0 = γw 1 X 3 2X 2 + X 1 = γw 2.. 2X n 1 X n 2 = X n 2X n 1 + X n 2 = γw n 1 L 2 = 2 1 1 2 1...... 1 2 R(n 1) (n 1), X = X 1 X 2., X n 1 L 2 X = γw, W N (0, I n 1 ),

Prior density Given a model that is, LX = γw, W N (0, I), π W (w) exp ( 12 ) w 2, we conclude that π X (x) exp ( 1 ) ( 2γ 2 Lx 2 = exp 1 [ ] ) 1 2 x T γ 2 LT L x. The inverse of the covariance matrix = precision matrix is D 1 = 1 γ 2 LT L.

Testing a Prior Question: Given a covariance D, how can we check if the prior corresponds to our expectations? Symmetric decomposition of the precision matrix (let γ = 1 for simplicity): D 1 = L T L. We know that W = LX N (0, I). Sampling of X : 1. Draw a realization w N (0, I n ) 2. Set x = L 1 w.

Random draws from priors Generate m draws from the prior using the Matlab command randn. n = 100; t = (0:1/n:1); m = 5; % number of discretization intervals % number of draws % First order model. Boundary condition X_0 = 0 L1 = diag(ones(1,n),0) - diag(ones(1,n-1),-1); gamma = 1/n; % m_1 = 1 W = gamma*randn(n,m); X = L1\W;

Plots of the random draws 0.2 0.1 0 0.1 0.04 0.03 0.02 0.01 0 0.01 0.2 0 0.2 0.4 0.6 0.8 1 0.02 0 0.2 0.4 0.6 0.8 1

Belief envelopes Diagonal elements of the posterior covariance: Γ jj = ηj 2 = posterior variance of X j. Posterior belief: x j 2η j < X j < x j + 2η j with posterior probability 95%.

Bayesian solution 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2 3 2 1 0 1 2 3 Mean solution and 2 STD belief envelope

Linear Inverse Problems An alternative (but equivalent) formula for the posterior: Prior, ( π X (x) exp 1 ) 2 x T D 1 x, and likelihood, Posterior density: π B (b x) exp ( 1 ) 2 (b Ax)T C 1 (b Ax). π X (x b) π X (x)π B (b x) exp ( 12 Q(x) ),

Linear Inverse Problems Quadratic term in the exponential: Q(x) = (b Ax) T C 1 (b Ax) + x T D 1 x Collect the terms of the same order in x together: Q(x) = x T ( A T C 1 A + D 1) x 2x T A T C 1 b + b T C 1 b. }{{} =M Complete the square: Q(x) = (x T M 1 A T C 1 b) T M(x T M 1 A T C 1 b) +...

Linear Inverse Problems Conclusion: The posterior mean and covariance have alternative expressions, x = ( A T C 1 A + D 1) 1 A T C 1 b and Φ = ( A T C 1 A + D 1) 1. The formula for x is also known as Wiener filtered solution.

Tikhonov Regularization Revisited Consider the linear model b = Ax + e, e N (0, C). Likelihood: π B (b x) exp Assume a Gaussian prior: or, in terms of densities, π X (x) exp ( 1 ) 2 (b Ax)T C 1 (b Ax). X N (0, D), ( 1 ) 2 x T D 1 x.

Tikhonov Regularization Revisited Bayes formula: p X (x b) p X (x)p B (b x) ( = exp 1 2 x T D 1 x 1 ) 2 (b Ax)T C 1 (b Ax). By writing D 1 = L T L, the negative of the exponent is H(x) = 1 ( b Ax 2 2 C + Lx 2), a Tikhonov functional.

Tikhonov Regularization Revisited Therefore, x MAP = argmax { π X (x b) } = argmin { b Ax 2 C + Lx 2}. The Tikhonov regularization parameter is absorbed in the prior covariance matrix as well as the noise covariance matrix.

Non-Gaussian models The Gaussian models are insufficient when the forward model is non-linear, the prior is non-gaussian, the noise is non-additive, the noise is non-gaussian.

Monte Carlo Integration Assume that a probability density π X is given in R n. Problem: Estimate numerically an integral of type E { f (X ) } = f (x)π X (x)dx =? R n Expectation of X : f (x) = x. Covariance of X : f (x) = (x x)(x x) T.

Monte Carlo Integration Difficulties with numerical integration using quadrature methods: We may not know the support of µ X (Support: The set in which the function is not vanishing). Where should we put our quadrature points? If n is large, an integration grid becomes huge: K points/direction means K n grid points. Try Monte Carlo integration!

Monte Carlo Integration Example: Given a two-dimensional set Ω R 2. Estimate the area of Ω. Raindrop integration: Assume that Ω Q = [0, a] [0, b]. Draw points from uniform density over Q: { x 1, x 2,..., x N}, x j Uniform(Q). Estimate of the area Ω : solve for Ω. Ω Q = Ω ab # of points x j Ω, N

Monte Carlo Integration The approximation corresponds to Monte Carlo integral Ω Q = 1 χ Ω (x)dx 1 Q Q N N χ Ω (x j ), j=1 where χ Ω (x) = { 1 if x Ω 0 ifx / Ω and 1/N is the equal weight that every point x j has. 1 Q χ Q(x) = uniform density over Q

Monte Carlo Integration Generalize: Given a probability density π X, write f (x)π X (x)dx 1 R n N N f (x j ), j=1 where { x 1, x 2,..., x N} is drawn independently from the probability distribution π X. Problem: How does one draw from a probability density in R n?

Sampling and Markov chains: Random walk Random walk is a process of moving around by taking random steps. Most elementary random walk: 1. Start at a point of your choice x 0 R n. 2. Draw a random vector w 1 N (0, I ) and set x 1 = x 0 + σw 1. 3. Repeat the process: Set x k+1 = x k + σw k+1, w k+1 N (0, I ).

Sampling and Markov chains: Random walk In terms of random variables: X k+1 = X k + σw k+1, W k+1 N (0, I n ). The conditional density of X k+1, given X k = x k is π(x k+1 x k ) = ( 1 (2πσ 2 exp 1 ) ) n/2 2σ 2 x k x k+1 2 = q k (x k, x k+1 ). The function q k is called the kth transition kernel.

Sampling and Markov chains: Random walk Since q 0 = q 1 = q 2 =..., i.e., the step is always equally distributed, we call the random walk time invariant. (k = time). The chain { Xk, k = 0, 1, } of random variables, is called discrete time stochastic process. Particular feature: the probability distribution X k depends of the past only through the previous member X k 1 : π(x k+1 x 0, x 1,..., x k ) = π(x k+1 x k ). A stochastic process having this property is called a Markov chain.

Sampling and Markov chains: Random walk Given: an arbitrary transition kernel q, a random variable X with probability density π X (x) = p(x), generate a new random variable Y by using the kernel q(x, y), that is, π(y x) = q(x, y). Question: What is the probability density of this new variable? The answer is found by marginalization, π Y (y) = π(y x)π X (x)dx = q(x, y)p(x)dx.

Sampling and Markov chains: Random walk If the probability density of the new variable is equal to the one of the old one, that is, q(x, y)p(x)dx = p(y), p is called an invariant density of the transition kernel q. The classical problem in the theory of Markov chains is: Given a transition kernel, find the corresponding invariant density.

Invariant density and sampling Recall the sampling problem: Given a probability density p = p(x), generate a sample that is distributed according to it. If we had a transition kernel q with invariant density p, generating such sample from p(x) would be easy: Start with some x 0 ; draw x 1 from q(x 0, x 1 ); In general, given x k, draw x k+1 from q(x k, x k+1 ). Rephrasing the sampling problem: Given a probability density p, find a kernel q such that p is its invariant density.

Metropolis Hastings algorithm Given a transition density y K(x, y), x R n current point, consider a Markov process: if x R n is the current point, we have two possibilities: 1. Stay at x with probability r(x), 0 r(x) < 1, 2. Move by using a transition kernel K(x, y).

Let x and y be realizations of random variables X, Y. π X (x) = p(x), and y is generated according to the algorithm above. Question: What is the probability density of Y?

Let A be the event that we opt for moving from x, A be the event of staying put. The probability of Y B R n assuming a move is P { Y B X = x, A } = K(x, y)dy. The kernel K is scaled so that P { X = x, A } = P { Y R n X = x, A } = K(x, y)dy = 1 r(x). R n (4) B

On the other hand, if we stay put, Y B happens only if X B, P { Y B X = x, A } = r(x)χ B (x) = where χ B is the characteristic function of B. { r(x), if x B, 0, if x / B,

Hence, the total probability of arriving from x to B is P { Y B X = x } = P { Y B X = x, A } + P { Y B X = x, A } = K(x, y)dy + r(x)χ B (x). B

Marginalize over x and calculate the probability of Y B P { Y B } = P { Y B X = x } p(x)dx = = = B B ( p(x) ( ( B ) K(x, y)dy dx + χ B (x)r(x)p(x)dx ) p(x)k(x, y)dx dy + r(x)p(x)dx B ) p(x)k(x, y)dx + r(y)p(y) dy.

Since P { Y B } = B π(y)dy, we must have π Y (y) = p(x)k(x, y)dx + r(y)p(y). Our goal is then to find a kernel K such that π Y (y) = p(y), that is p(y) = p(x)k(x, y)dx + r(y)p(y), or, equivalently, (1 r(y))p(y) = p(x)k(x, y)dx.

Substituting (4) in this formula, with the roles of x and y interchanged, we obtain p(y)k(y, x)dx = p(x)k(x, y)dx This equation is called the balance equation. This holds, in particular, if the integrands are equal, p(y)k(y, x) = p(x)k(x, y). The latter equation is known as detailed balance equation.

Metropolis-Hastings algorithm Start by a selecting a proposal distribution, or candidate generating kernel q(x, y); The kernel should be chosen so that generating a Markov chain with it is easy. A Gaussian kernel is a popular choice.

Metropolis-Hastings algorithm If q satisfies the detailed balance equation, i.e., p(y)q(y, x) = p(x)q(x, y), we are done, since p is an invariant density. More likely, the equality does not hold. If p(y)q(y, x) < p(x)q(x, y). (5) force the detailed balance equation to hold, defining K as where α is chosen so that K(x, y) = α(x, y)q(x, y), p(y)α(y, x)q(y, x) = p(x)α(x, y)q(x, y).

Metropolis-Hastings algorithm The kernel α need not be symmetric, so let α(y, x) = 1. Now the other factor is uniquely determined. We must have α(x, y) = p(y)q(y, x) p(x)q(x, y) < 1. Observe that if the inequality (5) goes the other way, interchange the roles of x and y, and let α(x, y) = 1. In summary K(x, y) = α(x, y)q(x, y), { } p(y)q(y, x) α(x, y) = min 1,. p(x)q(x, y)

Metropolis-Hastings algorithm Draws in two phases: 1. Given x, draw y using the transition kernel q(x, y). 2. Calculate the acceptance ratio, α(x, y) = p(y)q(y, x) p(x)q(x, y). 3. Flip the α coin: draw t Uniform([0, 1]); if α > t, accept y, otherwise stay where you are.

Metropolis-Hastings algorithm: Random walk proposal If q(x, y) = q(y, x), the algorithm simplifies: 1. Given x, draw y using the transition kernel q(x, y). 2. Calculate the acceptance ratio, α(x, y) = p(y) p(x). 3. Flip the α coin: draw t Uniform([0, 1]); if α > t, accept y, otherwise stay where you are.

Random walk Metropolis-Hastings 1. Set sample size N. Pick initial point x 1. Set k = 1. 2. Propose a new point, y = x k + δw, w N (0, I). 3. Compute the acceptance ratio, α = π(y) π(x k ). 4. Flip α-coin: Draw ξ Uniform([0, 1]), 4.1 If α ξ, accept: x k+1 = y, 4.2 If α < ξ, stay put: x k+1 = x k. 5. If k < N, increase k k + 1 and continue from 2., else stop.

Two steps Build the program in two steps: 1. Random walk sampling, no rejections 2. Add the rejection step

Sampling, no rejections nsample = 10000; % Sample size x = [0;0]; % Initial point step = 0.1; % Step size of the random walk Sample = NaN(2,nsample); % For memory allocation Sample(:,1) = x; for j = 2:N y = x + step*randn(2,1); end % Accept unconditionally x = y; Sample(:,j) = x;

Add the α-coin Write the condition in logarithmic form: π(y) π(x) > t log π(y) log π(x) > log t,

x = [0;0]; % Initial point step = 0.1; % Step size of the random walk Sample = NaN(2,nsample); % For memory allocation Sample(:,1) = x; logpx =... for j = 2:N y = x + step*randn(2,1); logpy =... t = rand; if logpy - logpx > log(t) % accept x = y; logpx = logpy; end Sample(:,j) = x; end

If a move is not accepted, the previous point has to be repeated:..., x k 1, x k, x k, x k, x k+1, x k+2,... }{{} rejections The acceptance rate tells the relative rate of acceptances. Too low acceptance rate: The chain does not move Too high acceptance rate: The chain is essentially a random walk and learns nothing of the underlying distribution (cf. raising kids). What is a good acceptance rate? Rule of thumb: 15%-35%.

Example: Inverse problem in chemical kinetics Reversible single reaction pair, A k 1 B, k 1 Data: With known initial values, measure [A](t j ), 1 j n for The noisy observation model is t min = t 1 < t 2 < < t n = t max. b j = [A](t j ) + e j, e j = additive noise, noise level =σ. Inverse Problem: Estimate k 1 and k 1.

Forward model: Mass Balance Equations Denote Assuming unit volume, c 1 (t) = [A](t), c 2 (t) = [B](t). or where dc 1 dt dc 2 dt = k 1 c 1 + k 1 c 2, c 1 (0) = c 01 = k 1 c 1 k 1 c 2, c 2 (0) = c 02 [ dc dt = Kc, c = c1 c 2 [ k1 k K = 1 k 1 k 1 ], ].

Eigenvalues of K are Time constant Eigenvectors: v 1 = λ 1 = 0, λ 2 = k 1 k 1. [ 1 δ τ = ] [, v 2 = 1 k 1 + k 1. 1 1 ], δ = k 1 k 1. Solution: c = αv 1 + βv 2 e t/τ, α, β R.

Initial conditions imply In particular, α = c 01 + c 02 1 + δ, β = δc 01 c 02. 1 + δ where c 1 (t) = f (t; k) = c 01 + c 02 1 + δ k = [ k1 k 1 δc 01 c 02 e t/τ, 1 + δ ].

Observation model: Data consists of n measurements of c 1 (t) corrupted by additive noise, b j = f (t j, k) + e j, 1 j n. Observation errors e j mutually independent, zero mean normally distributed, e j N (0, σ 2 ).

Likelihood density Assume that the noise is Gaussian white noise: ( π noise (e) exp 1 ) 2σ 2 e 2. Likelihood density is π(b k) exp 1 2σ 2 n (b j f (t j, k)) 2. j=1

Posterior density Flat prior over an interval: We believe that 0 < k 1 K 1, 0 < k 1 K 1, with some reasonable upper bounds. Write π prior (k) χ [0,K1 ](k 1 )χ [0,K 1 ](k 1 ). Posterior density by Bayes formula, π(k b) π prior (k)π(b k). Contour plots of the posterior density?

Data 3.5 3 2.5 2 1.5 1 0.5 3.5 3 2.5 2 1.5 1 0.5 0 0 1 2 3 4 0 0 1 2 3 4 K 1 = 6, K 1 = 2, five uniformly distributed measurements of c 1 (t) over the interval [t min, t max ], t min = 0.1 τ, t max = 4.1 τ (left), t min = 5 τ, t max = 9 τ (right)

Posterior densities K 1 = 6, K 1 = 2, five uniformly distributed measurements of c 1 (t) over the interval [t min, t max ], t min = 0.1 τ, t max = 4.1 τ (left), t min = 5 τ, t max = 9 τ (right) The hair cross indicates the value used for data generation.

MCMC exploration Generate the data: Define t 1 t 2 t =. t n, b = b 1 b 2. b n, where b j = c 01 + c 02 1 + δ true δ truec 01 c 02 1 + δ true e t/τtrue + e j. δ true = k 1,true k 1,true, τ true = 1 k 1,true + k 1,true.

Random walk Metropolis-Hastings Start with the transient measurements. White noise proposal, k prop = k + δw, w N (0, I ). Choose first δ = 0.1, different initial points k 0 = (1, 2) or k 0 = (5, 0.1). Acceptance rates with these values are of the order 45%.

Scatter plots 2.5 2 1.5 1 0.5 2 1.5 1 0.5 0 1 2 3 4 5 Transient measurements, t min = 0.1 τ, t max = 4.1 τ 0 1 2 3 4 5 6

Sample histories:first component 5 4 3 2 6 5 4 3 2 1 0 2000 4000 6000 8000 10000 Initial value k 1 = 1 (left) and k 1 = 5 (right). 1 0 2000 4000 6000 8000 10000

Sample histories: Second component 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 2000 4000 6000 8000 10000 Initial value k 2 = 2 (left) and k 2 = 0.2 (right). 0 0 2000 4000 6000 8000 10000

Burn-in: first component 3 5.5 5 2.5 2 1.5 4.5 4 3.5 3 1 0 100 200 300 400 500 Initial value k 1 = 1 (left) and k 1 = 5 (right). 2.5 2 0 100 200 300 400 500

Burn-in: Second component 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 100 200 300 400 500 Initial value k 2 = 2 (left) and k 2 = 0.2 (right). 0 0 100 200 300 400 500

Scatter plots: Steady state measurement 4 3 2 1 0 0 5 10 15 20 t min = 5 τ, t max = 9 τ. Use the same step size as before. Initial point (k 1, k 2 ) = (1, 2).

Sample histories, a.k.a. fuzzy worms 20 15 10 4 3.5 3 2.5 2 1.5 5 0 0 2000 4000 6000 8000 10000 Increase the step size 0.1 1. Initial point (k 1, k 2 ) = (1, 2). Acceptance remains high, about 55% 1 0.5 0 0 2000 4000 6000 8000 10000

Scatter plots, steady state data 20 15 10 5 0 0 10 20 30 40 50 60 70 Increase the step size 0.1 1. Initial point (k 1, k 2 ) = (1, 2). Acceptance remains high, about 55%

Fuzzy worms 70 60 50 40 30 20 10 0 0 2000 4000 6000 8000 10000 20 15 10 5 0 0 2000 4000 6000 8000 10000