Computationally Efficient MCMC for Hierarchical Bayesian Inverse Problems

Size: px

Start display at page:

Download "Computationally Efficient MCMC for Hierarchical Bayesian Inverse Problems"

Theresa Day
5 years ago
Views:

1 Computationally Efficient MCMC for Hierarchical Bayesian Inverse Problems Andrew Brown 1, Arvind Saibaba 2, Sarah Vallélian 2,3 SAMSI January 28, 2017 Supported by SAMSI Visiting Research Fellowship NSF grant DMS to SAMSI 1 Department of Mathematical Sciences, Clemson University, Clemson, SC 29634, USA 2 Department of Mathematics, North Carolina State University, Raleigh, NC 27695, USA 3 Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC 27709, USA

2 Courtesy of A. Saibaba Inverse Problems Parameters x Model Ax + noise Data d Fox, Haario, and Christen (2013)

3 Assume that the process can be (approximately) represented as a linear operator E.g., discretization of an integral equation, solution to a system of PDEs, multiple linear regression model In practice, it is likely that the data are observed with measurement error With additive noise, the model becomes d = Ax + e, where we assume, for instance, that e N(0, µ 1 I). We are concerned with cases in which this problem isn t well behaved. E.g., poorly conditioned, infinitely many solutions

4 Regularized Inversion In general, ˆx = arg min x Ax d λ L(x x 0 ) 2 2 = (A T A + λ 2 Q) 1 (A T d + λ 2 Qx 0 ), where Q = L T L and x 0 is a default solution. Note: arg max ˆx = x ( k 1 exp ) 1 2 Ax d 2 2 }{{} f(d x) ( k 2 exp Inverse problem admits Bayesian interpretation ) λ 2 L(x x 0) 2 2 }{{} π(x λ) Hoerl and Kinnard (1970), Tikhonov and Arsenin (1977), Press et al. (2007), Fox et al. (2013)

5 The Bayesian Machinery Given prior information and a data generating model, goal = update information about the parameters of interest, given the observed data via Bayes rule: Posterior Likelihood Prior MAP estimator = posterior mode arg max x arg max π(x d, λ) = x f(d x)π(x λ) The posterior distribution facilitates more complete inferences Other point estimators (posterior mean, posterior median, etc.) In particular, it allows quantification of uncertainty about the estimators Berger (1985), Bernardo and Smith (1994), Robert (2007), Carlin and Louis (2009), Gelman et al. (2013),...

6 Suppose d x, µ N(Ax, µ 1 I), x σ N(0, σ 1 Γ) With µ and σ fixed, the posterior x d, σ, µ N (m, Σ), where Σ = ( µa T A + σγ 1) 1 m = ΣµA T d MAP = posterior mode = posterior mean ˆx = ( µa T A + σγ 1) 1 µa T d ( A T A + λl T L ) 1 A T d, where λ = σ/µ and Γ 1 = L T L. Lindley and Smith (1972)

7 Regularization Parameter? Often, λ is fixed a priori Cross validation, generalized cross validation, L-curves, etc. Alternatively, assign µ and σ (or λ = σ/µ) a prior distribution Data have more freedom to determine plausible values Uncertainty associated with the parameters is accounted for in a posteriori inferences Challenge: Posterior distribution may not be available in closed form Monte Carlo sampling, approximation methods, etc. O Sullivan (1986), Hansen (1992), Hansen and O Leary (1993)

8 Gibbs Sampling MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Suppose we have a k-dimensional parameter space, θ = (θ 1,..., θ k ) T, from which we wish to sample the posterior distribution, h(θ) := π(θ y) f(y θ)π(θ). Assuming the availability of the full conditional distributions, {p(θ i θ ( i), y) i = 1,..., k}, the Gibbs sampling algorithm is as follows: 1 Draw θ (t) 1 p(θ 1 θ (t 1) 2, θ (t 1) 3,..., θ (t 1) k ) 2 Draw θ (t) 2 p(θ 2 θ (t) 1, θ(t 1) 3,..., θ (t 1) k ). k Draw θ (t) k p(θ k θ (t) 1,..., θ(t) k 1 ) Repeat for t = 1,..., T. Under mild conditions, θ (t) = (θ (t) 1,..., θ(t) k )T converges to a draw from π(θ 1,..., θ k y) Geman and Geman (1984), Gelfand and Smith (1990), Carlin and Louis (2009)

9 Metropolis(-Hastings) Sampling MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Algorithm: x (l) = current state, x = proposed state h( ) = (unnormalized) density of the target distribution q( x (l) ) = (unnormalized) density of the proposal distribution Set x (l+1) = x with probability α(x (l), x ), else x (l+1) = x (l), where { α(x (l), x ) = min 1, h(x )q(x (l) x } ) h(x (l) )q(x x (l) ) Random walk: Want 44% ( 23%) acceptance for univariate (multivariate) parameters Proposal distribution closely approximating target distribution: Want acceptance rate as high as possible Gibbs sampling can be shown to be a special case of the Metropolis-Hastings algorithm with acceptance probability = 1. Metropolis et al. (1953), Hastings (1970), Roberts et al. (1997), Rosenthal (2011)

10 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Independence Sampling Independence MH sampler proposes states from a density that is independent of the current state of the chain q(θ θ (t 1) ) q(θ ) = g(θ ) Acceptance ratio: where w(θ) h(θ)/g(θ) Similar to the accept-reject algorithm h(θ )g(θ (t 1) ) h(θ (t 1) )g(θ ) w(θ ) w(θ (t 1) ), 1 Draw a candidate value X g with h(x) Mg(x), for some M 1, x 2 Accepts the draw with probability h(x)/m g(x)

11 Metropolis-within-Gibbs MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Often, one or more of the conditional distributions needed for Gibbs sampling is not available Solution: Substitute a Metropolis(-Hastings) step to approximate a draw from the conditional distribution The MH sub-chain is not irreducible with respect to the full parameter space Valid Markov chain with stationary distribution = full conditional distribution (remaining parameters held fixed) Convergence can be improved by running several sub-iterations at the MH step More common to use a single MH draw, which still works Müller (1991), Neal (1998), Robert and Casella (2004), Gamerman and Lopes (2006), Carlin and Louis (2009)

12 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject After a sufficiently large number of iterations (the burn in period), the simulated draws may be treated as realizations from the posterior of interest. Common practice is to run multiple chains (3 or 5) starting from different values and compare the draws from each chain (after burn-in) to assess convergence Chains can be run in parallel on a computer to save processor time Examine trace plots of chain output Can also use quantitative measures such as the (scalar and multivariate) potential scale reduction factor (PSRF) The parameters can be partitioned into subvectors θ = (θ T 1, θ T 2,..., θ T P ) T and updated in blocks p(θ 1 θ 2,..., θ P, y), etc. This can speed up convergence when subsets of the parameters are highly correlated in the posterior Gelman and Rubin (1992), Brooks and Gelman (1998), Carlin and Louis (2009)

13 Hierarchical Model MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject d x, µ N m (Ax, µ 1 I) x σ N n (0, σ 1 Γ) µ Ga(a µ, b µ ) σ Ga(a σ, b σ ) Full conditional distributions for Gibbs sampling: ( (µa x σ, µ, d N T n A + σγ 1) 1 µa T d, ( µa T A + σγ 1) ) 1 ( m µ x, σ, d Ga 2 + a µ, 1 ) 2 Ax d b µ ( n σ x, µ Ga 2 + a σ, 1 ) 2 Lx b σ where L T L = Γ 1. Geman and Geman (1984), Gelfand and Smith (1990), Gelman and Rubin (1992), Brooks and Gelman (1998), Carlin and Louis (2009)

14 Example - 1D Image Restoration MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Consider Fredholm integral equation of the first kind: b a K(s, t)f(t)dt I n (s) = n w j K(s, t j )f(t j ), j=1 Quadrature yields Ax + noise = d where d = (I n (s 1 ),..., I n (s n )) T + noise and the desired solution is x = (f(t 1 ),..., f(t n )) T One-dimensional image restoration: K = point spread function for infinitely long slit, limits of integration = ±π/2. Shaw (1972), Hansen (2007)

15 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Prior on x taken to be Gaussian process with power exponential covariance function: { x GP(0, σ 1 R(, )), R(x i, x j ) = exp 2 x } i x j π Regularization determined by λ = σ/µ, so we take a µ, b µ to concentrate µ about 1 and assign σ a vague prior Three chains run in parallel 2,000 iterations each Convergence checked via trace plots and PSRF

16 Example: 1D Image Restoration MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Figure: Observed data (left panel) and true solution (right panel) for the one-dimensional image restoration example. The solid black line in the left panel represents the noise-free observations Ax. Shaw (1972), Hansen (2007)

17 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject MAP with posterior mean ˆλ MAP with loo-cv estimated λ Posterior mean Estimator RE RMSE MAP Posterior Mean Table: Relative error (RE) and root mean square error (RMSE) of the estimates for the one-dimensional image restoration example.

18 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject π(µ) π(σ) µ σ Figure: Estimated marginal posterior densities for µ (left panel) and σ (right panel) for the one-dimensional image restoration example.

19 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Approximate Sampling from Conditional Distributions Sampling from the full conditional of x requires ( µa T A + σγ 1) 1/2 z, z N(0, I). When x is high-dimensional, this is computationally expensive For difficult conditional distributions, common to use Metropolis-Hastings with simpler proposal distributions Proposed alternative: Find a computationally cheap approximation, and correct for the approximation using M-H. Metropolis et al. (1953), Hastings (1970), Tierney (1994), Rosenthal (2011), Gelman et al. (2013)

20 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Note: Σ = (µa T A + σl T L) 1 = L 1 (µl T A T AL 1 + σi) 1 L T When A is poorly conditioned, we expect the spectrum of L T A T AL 1 to decay quickly. I.e., L T A T AL 1 = VΛV T V k Λ k V T k L T A T AL 1 does not need to be explicitly computed so that the k largest eigenvalues can be found relatively quickly

21 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Some algebra and Woodbury formula yields L 1 (µl T A T AL 1 + σi) 1 L T ( σ 1 L 1 I + µ ) 1 σ V kλ k Vk T L T = σ 1 L 1 (I V k DVk T )L T =: Σ where D = diag ( ) µλj. µλ j + σ Similarly, we can factor Σ = GG T with G = σ 1/2 L 1 (I V k DV T k ), where D = diag ( ) 1 ± 1 (D) jj

22 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Suggests a Gaussian proposal distribution with an easy-to-compute covariance matrix and factorization ( x µ, σ, d N Σ(µA T d), Σ ) Idea: Inside the block-gibbs sampler, use the cheap proposal as an approximation to the target (full conditional) distribution of x in an MH independence sampler Fast to sample from this distribution Fast to evaluate the likelihood function associated with this distribution

23 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Proposition The MH acceptance ratio can be computed as η(z, x) = w(z)/w(x), where ( w(x) = exp 1 ) 2 x (Σ 1 Σ 1 )x. Proposition The MH acceptance ratio can be simplified to η(z, x) = exp µ n [ ( ) 2 ( λ j v j Lz v j Lx) ] 2. 2 j=k+1 Note: z ± x L v j η(z, x) = 1 B et al. (2017+)

24 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Proposition Given the current state of the chain x, the expected value and the variance of the acceptance ratio are given by where N l := exp 1 µ η := E z x [η(z, x)] = N 1 w(x) ση 2 := V z x [η(z, x)] = 1 w 2 (x) µ2 2σ n j=k+1 ( 1 N 2 1 N 2 1 lµλ j lµλ j + σ (b AL 1 v j ) 2 j>k ), ( 1 + lµ σ λ j) 1/2, B et al. (2017+)

25 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Implication: We can bound in probability the deviation of the acceptance rate from its mean, given the current state x. E.g., by Chebyshev inequality, for any state x, Pr z x (η(z, x) [µ η ± 4.47σ η ]) 0.95

26 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Example: 1-D Image Restoration (again) log(λi) log(y) Index i 10-8 Predicted: 1 min(1,e[η]) Empirical: 1 E[min(1,η)] K Figure: (Left) Spectrum of the prior-preconditioned Hessian matrix for the one-dimensional image restoration example. (Right) Rejection rates y computed with different eigenvalue truncations for the one-dimensional image restoration example, log 10 scale. Approximate 95% pointwise error upper bound given by the dashed line. Note that the lower bound for the pointwise error is negative and therefore not plotted.

27 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Proposition The target density h(x) and the proposal density q(x) can be bounded as h(x) N 1 q(x) for all x, where N 1 1 is given by n N 1 := exp µ2 µλ j 2σ µλ j + σ (bt AL 1 v j ) 2 ( 1 + µ ) 1/2 σ λ j j=k+1 j>k B et al. (2017+)

28 MCMC Fast Approximations Acceptance Rate Convergence and Accept-Reject Implication: The subchain produced by our proposed sampler has stationary distribution π( d, µ, σ) and is uniformly ergodic K p (x, ) π( d, µ, σ) T V 2 ( 1 N1 1 ) p x supp π, where K p (x, ) is the p-step transition kernel starting from x and T V denotes the total variation norm. Suggests also that the approximating distribution q( ) can be used in an accept-reject algorithm instead of the independence MH sampler For a given candidate density, an independence MH algorithm is superior to an accept-reject algorithm in terms of sampling efficiency Robert and Casella (2004)

29 EEG CT Example: EEG Source Localization Distribution of source locations (dipoles) inside the brain generated by neuronal activity Current sources create electric potentials that are measured by electrodes on the scalp Model: d = Ax + noise d R m represents the electrode measurements at different locations along the scalp x R n represents the current sources on a discretized grid in the brain A R m n, m < n, is the leadfield matrix determined by conductivity and geometry of the head Bioelectromagnetic inverse problem has no unique solution Hauk (2004)

30 EEG CT Simulate randomly-oriented dipoles located on an intracerebral source grid Realistic head model is used to account for the physics and create the leadfield matrix using the boundary element method Noise is added to the observed scalp potentials Software: German Gomez-Herrero EEG Tutorial: io/tutorials/dipoles/tutorial_dipoles.htm Fieldtrip MATLAB Toolbox for EEG: Oostendrop and Oosterom (1989), Hauk (2004)

31 EEG CT

32 EEG CT dim(d) = 257, dim(x) = 1261 retain 257 eigenvalues Hyperparameters on µ set to concentrate about 1 Hyperparameters on σ taken to be vague Prior on x taken to be N(0, σ 1 I) (the usual ridge penalty) For both samplers, three chains are run in parallel for 2,000 iterations each Discard first 500 iterations as burn-in period Regularization term for MAP determined from MCMC output

33 EEG CT λ = µ/σ Parameter PSRF µ 1.00 σ x 1.577

34 EEG CT Figure: Solutions to the EEG inverse problem using block Gibbs (left panel), Hastings-within-Gibbs (middle panel), and MAP (right panel)

35 Efficiency EEG CT Algorithm Wall Time (s) CES Block Gibbs MHwG CES = Cost per effective sample = (integrated autocorrelation time) (total computation time) / (length of chain) Note: Effective sample size = (Chain length) / (autocorrleation time) Kass et al. (1998), Carlin and Louis (2009), Fox and Norton (2016)

36 EEG CT Example: Computed Tomography X-ray is passed through a body from a source (s = 0) to a sensor (s = S) along a line determined by angle and distance with respect to a fixed origin L(x, ω) Intensity I is attenuated due to absorption by the tissue through which the X-ray passes: { I(S) = I(0) exp S 0 } η(x(s))ds z = log(i(s)/i(0)) yields the Radon transform model for CT: z(x, ω) = η(x(s))ds L(x,ω) Reconstructing η( ) creates an image of the body that was scanned Discretized version: z = Ax + noise Kaipio and Somersalo (2005), Bardsley (2011)

37 EEG CT Simulated Computed Tomography Data Use Shepp-Logan phantom as the true image Target image is discretized so that dim(x) = = Simulate observed data (Radon transform model) over discretized lines and angles so that dim(z) = 5000 Data generating model: z = Ax + e, where e N(0, µ 1 I), µ 1/2 = 0.01 Ax. Kaipio and Somersalo (2005), Rue and Held (2005), Bardsley (2011)

38 EEG CT

39 EEG CT Only considering MH-within-Gibbs with 5000 dim(x) eigenvalues retained (block Gibbs is not feasible) Hyperparameters on µ set to concentrate about 1 Hyperparameters on σ taken to be vague Prior on x taken to be N(0, L T L), where L = discrete Laplacian (GMRF) to enforce smoothness a priori One chain run for 2,000 iterations each Discard first 1000 iterations as burn-in period Regularization term for MAP determined from MCMC output Rue and Held (2005)

EEG CT 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.

40 EEG CT Figure: MAP estimate (left panel) and posterior mean (right panel)

41 EEG CT Estimate Relative Error RMSE MAP Posterior Mean Wall time for MHwG sampler = s ( 45 min.) Posterior provides access to almost any estimator we want and quantifies the associated uncertainty

42 Features of the Approach Eigenvalue decomposition does not require explicitly computing L T A T AL 1, but only matrix-vector products. Finding the eigenvalues is a precomputation before iterating. Once found, the proposal is cheap for any given µ and σ. For ill-posed problems, the acceptance probability is close to (or exactly equal to) one. We are still modeling the full dimension of x, not projecting onto a lower-dimensional subspace. Exploiting the nature of the forward model This approach allows incorporation of strong or vague prior information about the solution through specification of the prior covariance (precision) matrix Prior information determined from fmri can help to solve the EEG problem Prior smoothness assumptions through a GP prior or Laplacian Dale et al. (2000), Banerjee et al. (2008), Higdon et al. (2008), Banerjee et al. (2012)

43 Q: Which estimator to use? A: It depends on what you want from your estimate. From a Bayesian (decision theoretic) perspective, the posterior mode (MAP) is not always the optimal estimator Squared error loss (i.e., estimation/prediction) Posterior mean Absolute error loss Posterior median 0-1 loss (i.e., thresholding) Posterior mode etc. The mean will not zero out elements of x in regularized solutions as much as the mode (but we can still determine thresholding rules via, e.g., k-means clustering) Estimators of sparse objects need not be sparse themselves in order to yield excellent performance. -Carvalho, Polson, and Scott (2010) Bhattacharya et al. (2015)

44 Variance Component Priors We use Gamma priors on µ and σ for convenience It has been noted that a Gamma prior for σ can be inappropriate for situations in which x is thought to be sparse Can artificially force σ away from zero Alternatives include: Jeffreys prior π(µ, σ) µ 1 (µ + σ) 1 Proper Jeffreys prior Half-Cauchy π(µ, σ) (µ + σ) 2 π(σ 1/2 µ 1/2 ) ( 1 + σ ) 1 µ Tiao and Tan (1965), Gelman (2006), Scott and Berger (2006), Polson and Scott (2010)

45 Global-Local Scale Mixtures d N(Ax, µ 1 I), x j N(0, η 2 τj 2 ), j = 1,..., j: Choice of prior on τ j produces different sparsity properties Ex: Laplace prior (LASSO) τ 2 j Exp(λ/2) t prior Horseshoe prior τ 2 j InvGamma(λ/2, λ/2) τ j HalfCauchy(0, λ) Generalized Double Pareto τ j Exp(λ 2 /2), λ 2 Gamma(α, β) Spike-slab prior τ j = γ j λ, γ j Bernoulli(π) Etc. How to extend the low-rank approximation approach to the global-local framework? George and McCulloch (1993), Tibshirani (1996), Park and Casella (2008), Polson and Scott (2010), Carvalho et al. (2010), Armagan et al. (2013)

46 Brown, D. A., Saibaba, A. K., and Vallélian, S., (2017+), Computationally efficient Markov chain Monte Carlo methods for hierarchical Bayesian inverse problems, submitted, arxiv: Thank you!

Efficient MCMC Sampling for Hierarchical Bayesian Inverse Problems

Efficient MCMC Sampling for Hierarchical Bayesian Inverse Problems Andrew Brown 1,2, Arvind Saibaba 3, Sarah Vallélian 2,3 CCNS Transition Workshop SAMSI May 5, 2016 Supported by SAMSI Visiting Research