MCMC Sampling for Bayesian Inference using L1-type Priors

Similar documents
wissen leben WWU Münster

Recent Advances in Bayesian Inference for Inverse Problems

Introduction to Bayesian methods in inverse problems

Discretization-invariant Bayesian inversion and Besov space priors. Samuli Siltanen RICAM Tampere University of Technology

arxiv: v2 [math.na] 5 Jul 2016

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

Markov Chain Monte Carlo

arxiv: v1 [math.st] 19 Mar 2016

On Bayesian Computation

Robust MCMC Sampling with Non-Gaussian and Hierarchical Priors

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Dimension-Independent likelihood-informed (DILI) MCMC

Principles of Bayesian Inference

Brief introduction to Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Well-posed Bayesian Inverse Problems: Beyond Gaussian Priors

Sequential Monte Carlo Samplers for Applications in High Dimensions

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Introduction to Machine Learning CMU-10701

Markov chain Monte Carlo methods in atmospheric remote sensing

Computational statistics

STA 4273H: Statistical Machine Learning

Bayesian Phylogenetics:

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo methods

LECTURE 15 Markov chain Monte Carlo

Multimodal Nested Sampling

Bayesian Methods for Machine Learning

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Learning the hyper-parameters. Luca Martino

Lecture 7 and 8: Markov Chain Monte Carlo

An example of Bayesian reasoning Consider the one-dimensional deconvolution problem with various degrees of prior information.

Markov Chain Monte Carlo methods

MCMC: Markov Chain Monte Carlo

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Monte Carlo Methods. Leon Gu CSD, CMU

Kernel adaptive Sequential Monte Carlo

Results: MCMC Dancers, q=10, n=500

Stochastic Spectral Approaches to Bayesian Inference

Advances and Applications in Perfect Sampling

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Inverse Problems in the Bayesian Framework

Bayesian networks: approximate inference

The Bayesian approach to inverse problems

Markov Chain Monte Carlo (MCMC)

17 : Markov Chain Monte Carlo

Session 3A: Markov chain Monte Carlo (MCMC)

Bayesian Paradigm. Maximum A Posteriori Estimation

The Metropolis-Hastings Algorithm. June 8, 2012

CPSC 540: Machine Learning

Density Estimation. Seungjin Choi

Empirical Bayes Unfolding of Elementary Particle Spectra at the Large Hadron Collider

Statistical Estimation of the Parameters of a PDE

Bayesian Inference and MCMC

Bayesian Regression Linear and Logistic Regression

Use of probability gradients in hybrid MCMC and a new convergence test

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Lecture 6: Markov Chain Monte Carlo

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Paul Karapanagiotidis ECO4060

Spatial Statistics with Image Analysis. Outline. A Statistical Approach. Johan Lindström 1. Lund October 6, 2016

Markov Chain Monte Carlo, Numerical Integration

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Bayesian Estimation with Sparse Grids

Introduction to Machine Learning

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

An introduction to Bayesian reasoning in particle physics

(4D) Variational Models Preserving Sharp Edges. Martin Burger. Institute for Computational and Applied Mathematics

The Recycling Gibbs Sampler for Efficient Learning

Markov chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo

STAT 425: Introduction to Bayesian Analysis

16 : Approximate Inference: Markov Chain Monte Carlo

Nonparametric Bayesian Methods (Gaussian Processes)

ST 740: Markov Chain Monte Carlo

Bayesian System Identification based on Hierarchical Sparse Bayesian Learning and Gibbs Sampling with Application to Structural Damage Assessment

Augmented Tikhonov Regularization

Integrated Non-Factorized Variational Inference

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Sampling from complex probability distributions

Solar Spectral Analyses with Uncertainties in Atomic Physical Models

Resolving the White Noise Paradox in the Regularisation of Inverse Problems

I. Bayesian econometrics

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

MCMC algorithms for fitting Bayesian models

Markov Chain Monte Carlo Algorithms for Gaussian Processes

Markov Chain Monte Carlo

Forward Problems and their Inverse Solutions

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Adaptive Monte Carlo methods

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Transcription:

MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012

, MÜNSTER Sampling for L1-type Priors 2 /30 Sparsity Constraints in Inverse Problems Current trend in high dimensional inverse problems: Sparsity constraints. Total Variation (TV) imaging: Sparsity constraints on the gradient of the unknowns. Compressed Sensing: High quality reconstructions from a small amount of data, if a sparse basis/dictionary is a-priori known (e.g., wavelets). Some nice images here!

, MÜNSTER Sampling for L1-type Priors 3 /30 Sparsity Constraints in Inverse Problems Commonly applied formulation and analysis by means of variational regularization, mostly by incorporating L1-type norms: { } û α = argmin m A u 2 2 + α D u 1 u R n assuming additive Gaussian i.i.d. noise N (0, σ 2 ) Notation: m R k : The noisy measurement data given u R n : The unknowns to recover w.r.t. the chosen discretization A R k n : Discretization of the forward operator w.r.t. the domains of u and m. D R l n : Discrete formulation of the mapping onto the (potentially) sparse quantity.

MÜNSTER Sampling for L1-type Priors 4 /30 Sparsity Constraints in Inverse Problems Sparsity constraints relying on L1-type norms can also be formulated in the Bayesian framework. Likelihood model: E N (0,σ M = A u + E 2 I k ) = p li (m u) exp ( ) 1 m A u 2 2 σ 2 2 Prior model: p pr (u) exp ( λ D u 1) Resulting posterior: p post(u m) exp ( ) 1 m A u 2 2 σ 2 2 λ D u 1 (a) exp ( 1 ) 2 u 2 2 (b) exp ( u 1) (c) 1/(1 + u 2 ),

Likelihood: exp ( ) 1 m A u 2 2 σ 2 2 Prior: exp ( λ u 1) (λ via discrepancy principle) Posterior: exp ( 1 2 σ 2 m A u 2 2 λ u 1 )

, MÜNSTER Sampling for L1-type Priors 6 /30 Sparsity Constraints in Inverse Problems Direct correspondence to variational regularization by maximum a-posteriori-estimation (MAP) inference strategy: û MAP := argmax u R n = argmax u R n p post(u m) { ( exp 1 )} 2 σ m A 2 u 2 2 λ D u 1 { } = argmin m A u 2 2 + 2 σ 2 λ D u 1 u R n = Properties of MAP estimate (e.g., discretization invariance) are well understood.

MÜNSTER Sampling for L1-type Priors 7 /30 Sparsity Constraints in Inverse Problems But there is more to Bayesian inference: Conditional mean-estimates (CM) Confidence intervals estimates Conditional covariance estimates Histogram estimates Generalized Bayes estimators Marginalization Model selection or averaging Experiment design Influence of sparsity constraints on these quantities: Less well understood. M. Lassas and S. Siltanen, 2004. Can one use total variation prior for edge-preserving Bayesian inversion? M. Lassas, E. Saksman, and S. Siltanen, 2009. Discretization invariant Bayesian inversion and Besov space priors. V. Kolehmainen, M. Lassas, K. Niinimäki, and S. Siltanen, 2012. Sparsity-promoting Bayesian inversion.

MÜNSTER Sampling for L1-type Priors 8 /30 Sparsity Constraints in Inverse Problems Key issue Examining L1-type priors might help to further understand the relation between variational regularization theory and Bayesian inference! Key problem Bayesian inference relies on computing integrals w.r.t. high-dim. posterior p post. Standard Monte Carlo integration techniques break down for ill-posed, high-dimensional problems with sparsity constraints. This talks summarizes partial results from: F. Lucka, 2012. Fast MCMC sampling for sparse Bayesian inference in high-dimensional inverse problems using L1-type priors submitted to Inverse Problems; arxiv:1206.0262v1,

MÜNSTER Sampling for L1-type Priors 9 /30 Monte Carlo Integration in a Nutshell E [f (x)] = f (x) p(x) d x R n Traditional Gauss-type quadrature: Construct suitable grid {x i } i, w.r.t ω(x) := p(x) and approximate by K i=1 ω if (x i ). = Grid construction and evaluation infeasible in high dimensions. Monte Carlo integration idea: Generate suitable grid {x i } i, w.r.t p(x) by drawing x i p(x) and approximate by 1 K K i=1 f (x i). By the Law of large numbers: 1 K K i=1 f (x i ) K E p(x) [f (x)] = in L1 with rate O(K 1/2 ) (independent of n). R n f (x) p(x) d x,

, MÜNSTER Sampling for L1-type Priors 10 /30 Markov Chain Monte Carlo Not able to draw independent samples? With {x i } i being an ergodic Markov chain, it still works! Markov chain Monte Carlo (MCMC) methods are algorithms to construct such a chain: Huge number of MCMC methods exists. No universal method. Most methods rely on one two basic schemes: Metropolis-Hastings (MH) Sampling [Metropolis et al., 1953; Hastings, 1970] Gibbs Sampling [Geman & Geman, 1984] Posteriors from inverse problems seem to be special. In this talk: Comparison between the most basic variants of MH and Gibbs sampling for high-dimensional posteriors from inverse problem scenarios.

Symmetric, Random-Walk Metropolis-Hastings Sampling Given: Density p(x), x R n to sample from. Let p pro(z) be a symmetric density in R n and x 0 R n an initial state. Define burn-in size K 0 and sample size K. For i = 1,...,K 0 + K do: 1 Draw z from p pro(z) and set y = x i 1 + z 2 Compute the acceptance ratio r = p(y) p(x i 1 ) 3 Draw θ [0, 1] from a uniform probability density. 4 If r θ, set x i = y, else set x i = x i 1. Return x K0 +1,..., x K. Requires one evaluation of p(x) and one sample from p pro per step, no real knowledge about p is needed, not even normalization. Black box sampling algorithm. Most widely used. Good performance requires careful tuning of p pro. Basis for very sophisticated sampling algorithms. Simulated annealing for the global optimization works in the same way.

In this talk: p pro = N (0, κ 2 I n)

Evaluate performance of a sampler via its autocorrelation function (acf): Let g : R n R 1, and g i := g(u i ), i = 1,..., K, then R(τ) := K τ 1 (g i ˆµ)(g i+τ ˆµ) ( lag-τ ac w.r.t. g ) (K τ)ˆϱ i=1 length = 4 length = 16 length = 32 1 length = 4 length = 16 length = 32 0 (a) Stochastic processes... (b)...and their autocorrelation functions

MÜNSTER Sampling for L1-type Priors 14 /30 A rapid decay of R(τ) means that samples get uncorrelated fast. Temporal acf (tacf): acf rescaled by computation time per sample, t s: R (t) := R(t/t s) for all t = i t s, i {0,..., K 1}. Use g(u) := ν 1, u, where ν 1 is the largest eigenvector of the covariance matrix of p(x) to test the worst case. length = 4 length = 16 length = 32 1 length = 4 length = 16 length = 32 (c) Stochastic processes... 0 (d)...and their autocorrelation functions,

MÜNSTER Sampling for L1-type Priors 15 /30 Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) 1 0 Model of a charge coupled device (CCD) in 1D. Unknown light intensity ũ : [0, 1] R +, indicator on [ 1, 2 ]. 3 3 1 Integrated into k = 30 CCD pixels [, k+1 ] [0, 1]. k+2 k+2 Noise is added. ũ is reconstructed on a regular, n-dim. grid. D is the forward finite difference operator with NB cond. ( p post(u m) exp 1 ) 2 σ m A 2 u 2 2 λ D u 1 0 0.2 0.4 t 0.6 0.8 1 (e) The unknown function ũ(t) 0.03 0.02 0.01 0 0 0.2 0.4 t 0.6 0.8 1 (f) The measurement data m,,

Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) 1 0.8 MH, λ = 100 MH, λ = 200 MH, λ = 400 0.6 R( τ ) 0.4 0.2 0 0 2e4 4e4 τ 6e4 8e4 10e4 Figure: Autocorrelation plots R(τ) for MH Sampler and n = 63.

Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) 1 0.8 MH Iso, n = 127, λ = 280 MH Iso, n = 255, λ = 400 MH Iso, n = 511, λ = 560 MH Iso, n = 1023, λ = 800 0.6 R*( t ) 0.4 0.2 0 10 0 10 2 t (sec) 10 4 10 6 Figure: Temporal autocorrelation plots R (t) for MH Sampler.

, MÜNSTER Sampling for L1-type Priors 18 /30 Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) Results: Efficiency of MH samplers dramatically decreases when λ or n increase. Even for moderate n, most inference procedures become infeasible. What else can we do? More sophisticated variants of MH sampling? Sample surrogate hyperparameter models? Try out the other basic scheme: Gibbs sampling.

, MÜNSTER Sampling for L1-type Priors 18 /30 Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) Results: Efficiency of MH samplers dramatically decreases when λ or n increase. Even for moderate n, most inference procedures become infeasible. What else can we do? More sophisticated variants of MH sampling? Sample surrogate hyperparameter models? Try out the other basic scheme: Gibbs sampling.

Single Component Gibbs Sampling Given: Density p(x), x R n to sample from. Let x 0 R n be an initial state. Define burn-in size K 0 and sample size K. For i = 1,...,K 0 + K do: 1 Set x i := x i 1. 2 For j = 1,...,n do: (i) Draw s randomly from {1,..., n} (random scan). (ii) Draw (x i ) s from the conditional, 1-dim density p( (x i ) [ s] ). Return x K0 +1,..., x K. In order to be fast one needs to be able 1. to compute the 1-dim distributions fast and explicit. 2. to sample from 1-dim distributions fast, robust and exact. Point 2. turned out to be rather nasty, involved and time consuming to implement Details can be found in the paper.

Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) 1 0.8 Gibbs, λ = 100 Gibbs, λ = 200 Gibbs, λ = 400 0.6 R( τ ) 0.4 0.2 0 0 200 400 τ 600 800 1000 Figure: Autocorrelation plots R(τ) for Gibbs Sampler and n = 63.

Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) R*( t ) 1 0.8 0.6 0.4 MH, λ = 100 MH, λ = 200 MH, λ = 400 Gibbs, λ = 100 Gibbs, λ = 200 Gibbs, λ = 400 0.2 0 0 1 t (sec) 2 3 Figure: Temporal autocorrelation plots R (t) for n = 63.

Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) 1 0.8 Gibbs, n = 127, λ = 280 Gibbs, n = 255, λ = 400 Gibbs, n = 511, λ = 560 Gibbs, n = 1023, λ = 800 0.6 R*( t ) 0.4 0.2 0 0 1 t (sec) 2 3 4 Figure: Temporal autocorrelation plots R (t) for Gibbs Sampler

Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) 1 0.8 0.6 MH Iso, n = 127, λ = 280 RnGibbs, n = 127, λ = 280 MH Iso, n = 255, λ = 400 RnGibbs, n = 255, λ = 400 MH Iso, n = 511, λ = 560 RnGibbs n = 511, λ = 560 MH Iso, n = 1023, λ = 800 RnGibbs, n = 1023, λ = 800 R*( t ) 0.4 0.2 0 10 0 10 2 t (sec) 10 4 10 6 Figure: Temporal autocorrelation plots R (t).

Total Variation Deblurring Example in 1D (from Lassas & Siltanen, 2004) New sampler can be used to address theoretical questions: Lassas & Siltanen, 2004: For λ n n + 1, the TV prior converges to a smoothness prior in the limit n. MH sampling to compute CM estimate for n = 63, 255, 1023, 4095. Even after a month of computation time only partly satisfying results. 1 u real n = 63, K = 500000, K0 = 200, t = 21m n = 255, K = 100000, k0 = 50, t = 16m n = 1023, K = 50000, k0 = 20, t = 45m n = 4095, K = 10000, k0 = 20, t = 34m n = 16383, K = 5000, k0 = 20, t = 1h 51m n = 65535, K = 1000, k0 = 20, t = 3h 26m 0 0 1 Figure: CM estimate computed for n = 63, 255, 1023, 4095, 16383, 65535 using Gibbs sampler on a comparable CPU.

Image Deblurring Example in 2D Unknown function ũ Measurement data m Gaussian blurring kernel Relative noise level of 10% Reconstruction using n = 511 511 = 261 121.

, MÜNSTER Sampling for L1-type Priors 27 /30 Image Deblurring Example in 2D (a) 1h comp. time (b) 5h comp. time (c) 20h comp. time Figure: CM estimates by MH sampler

, MÜNSTER Sampling for L1-type Priors 28 /30 Image Deblurring Example in 2D (a) 1h comp. time (b) 5h comp. time (c) 20h comp. time Figure: CM estimates by Gibbs sampler

, MÜNSTER Sampling for L1-type Priors 29 /30 Conclusions & Outlook MH is a black-box sampler. It may fail dramatically in specific scenarios. But this is not a general feature of MCMC! Gibbs sampler incorporate more posterior-specific information into the sampling and perform way better. Promising results in dimensions larger than any previously reported use for L1-type inverse problems (n > 1 000 000 still works...). = Results challenge common beliefs about MCMC in general. Work to do: Real applications: Sparse tomography using Besov space priors like in [Kolehmainen, Lassas, Niinimäki, Siltanen, 2012] Tackle theoretical questions, e.g., of how stair-casing in TV can be seen from a Bayesian perspective. Comparison to more sophisticated variants of MH and Gibbs schemes. Generalization to arbitrary D in Du 1.

, MÜNSTER Sampling for L1-type Priors 30 /30 Thank you for your attention! Full results and all details in: F. Lucka, 2012. Fast MCMC sampling for sparse Bayesian inference in high-dimensional inverse problems using L1-type priors submitted to Inverse Problems; arxiv:1206.0262v1 More sampling methods. 2D deblurring with n = 511 2 = 261 121. Nasty details of the Gibbs sampler! Implementation and code.