Sample-based UQ via Computational Linear Algebra (not MCMC) Colin Fox Al Parker, John Bardsley March 2012

Similar documents
Polynomial accelerated MCMC... and other sampling algorithms inspired by computational optimization

Monte Carlo simulation inspired by computational optimization. Colin Fox Al Parker, John Bardsley MCQMC Feb 2012, Sydney

Monitoring milk-powder dryers via Bayesian inference in an FPGA. Colin Fox Markus Neumayer, Al Parker, Pat Suggate

9.1 Preconditioned Krylov Subspace Methods

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Polynomial accelerated solutions to a LARGE Gaussian model for imaging biofilms: in theory and finite precision

Scientific Computing WS 2018/2019. Lecture 9. Jürgen Fuhrmann Lecture 9 Slide 1

CLASSICAL ITERATIVE METHODS

Algebraic Multigrid as Solvers and as Preconditioner

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012.

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

FEM and Sparse Linear System Solving

Lecture 18 Classical Iterative Methods

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Chapter 7 Iterative Techniques in Matrix Algebra

7.3 The Jacobi and Gauss-Siedel Iterative Techniques. Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP.

Introduction to Bayesian methods in inverse problems

Classical iterative methods for linear systems

Iterative Methods for Smooth Objective Functions

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

LINEAR SYSTEMS (11) Intensive Computation

Clone MCMC: Parallel High-Dimensional Gaussian Gibbs Sampling

Iterative Methods for Solving A x = b

Computational Linear Algebra

The Conjugate Gradient Method

Lecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University

From Stationary Methods to Krylov Subspaces

Iterative Methods. Splitting Methods

Preconditioning Techniques Analysis for CG Method

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Algebra C Numerical Linear Algebra Sample Exam Problems

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Numerical Methods in Matrix Computations

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

4.6 Iterative Solvers for Linear Systems

Here is an example of a block diagonal matrix with Jordan Blocks on the diagonal: J

Introduction to Scientific Computing

MCMC Sampling for Bayesian Inference using L1-type Priors

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Iterative Solution methods

General Construction of Irreversible Kernel in Markov Chain Monte Carlo

Iterative Methods and Multigrid

Goal: to construct some general-purpose algorithms for solving systems of linear Equations

Iterative Methods and Multigrid

Statistics & Data Sciences: First Year Prelim Exam May 2018

9. Iterative Methods for Large Linear Systems

Lecture 16 Methods for System of Linear Equations (Linear Systems) Songting Luo. Department of Mathematics Iowa State University

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Computational statistics

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Stabilization and Acceleration of Algebraic Multigrid Method

CHAPTER 5. Basic Iterative Methods

EXAMPLES OF CLASSICAL ITERATIVE METHODS

Comparison of Fixed Point Methods and Krylov Subspace Methods Solving Convection-Diffusion Equations

Numerical Solution Techniques in Mechanical and Aerospace Engineering

Introduction to Machine Learning CMU-10701

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Numerical Methods - Numerical Linear Algebra

On Reparametrization and the Gibbs Sampler

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti

Course Notes: Week 1

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Introduction to Mobile Robotics Compact Course on Linear Algebra. Wolfram Burgard, Bastian Steder

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Introduction to Iterative Solvers of Linear Systems

Iterative techniques in matrix algebra

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Scalable kernel methods and their use in black-box optimization

1 Conjugate gradients

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Spatial Statistics with Image Analysis. Outline. A Statistical Approach. Johan Lindström 1. Lund October 6, 2016

Scientific Computing: An Introductory Survey

Computational Economics and Finance

Lab 1: Iterative Methods for Solving Linear Systems

M.A. Botchev. September 5, 2014

Stability of Ensemble Kalman Filters

Numerical Methods I Non-Square and Sparse Linear Systems

OUTLINE ffl CFD: elliptic pde's! Ax = b ffl Basic iterative methods ffl Krylov subspace methods ffl Preconditioning techniques: Iterative methods ILU

A Matrix Theoretic Derivation of the Kalman Filter

Scientific Computing

Stat. 758: Computation and Programming

An introduction to adaptive MCMC

STA 294: Stochastic Processes & Bayesian Nonparametrics

Master Thesis Literature Study Presentation

The convergence of stationary iterations with indefinite splitting

APPLIED NUMERICAL LINEAR ALGEBRA

JACOBI S ITERATION METHOD

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Analysis of Iterative Methods for Solving Sparse Linear Systems C. David Levermore 9 May 2013

Jae Heon Yun and Yu Du Han

Linear Dynamical Systems (Kalman filter)

Lecture 11. Fast Linear Solvers: Iterative Methods. J. Chaudhry. Department of Mathematics and Statistics University of New Mexico

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Reversible Markov chains

Theory of Iterative Methods

University of Toronto Department of Statistics

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Transcription:

Sample-based UQ via Computational Linear Algebra (not MCMC) Colin Fox fox@physics.otago.ac.nz Al Parker, John Bardsley March 2012

Contents Example: sample-based inference in a large-scale problem Sampling built on computational linear algebra Gibbs sampling a Gauss-Seidel poloynomial acceleration of GS... and of MCMC sampling based on (P)CG, computing A 1/2, BFGS, PGCG a for Gaussians

Ocean circulation :: 2 samples from the posterior Data on traces, assert physics and observational models, infer abyssal advection 3 Oxygen, run 1 3 Oxygen, run 2 11 11 18 18 26 26 32 34 29 24 19 14 9 4 1 6 11 32 34 29 24 19 14 9 4 1 6 11 McKeague Nicholls Speer Herbei 2005 Statistical Inversion of South Atlantic Circulation in an Abyssal Neutral Density Layer

Gibbs sampling from normal distributions Gibbs sampling a repeatedly samples from (block) conditional distributions Forward and reverse sweep simulates a reversible kernel (self-adjoint wrt π) Normal distributions π (x) = det (A) 2π n exp { 1 } 2 xt Ax + b T x precision matrix A, covariance matrix Σ = A 1 (both SPD) wlog treat zero mean Particularly interested in case where A is sparse (GMRF) and n large When is π (0) is also normal, then so is the n-step distribution A (n) A Σ (n) Σ In what sense is stochastic relaxation related to relaxation? What decomposition of A is this performing? a Glauber 1963 (heat-bath algorithm), Turcin 1971, Geman and Geman 1984

Matrix formulation of Gibbs sampling from N(0, A 1 ) Let y = (y 1, y 2,..., y n ) T Component-wise Gibbs updates each component in sequence from the (normal) conditional distributions One sweep over all n components can be written y (k+1) = D 1 Ly (k+1) D 1 L T y (k) + D 1/2 z (k) where: D = diag(a), L is the strictly lower triangular part of A, z (k 1) N(0, I) y (k+1) = Gy (k) + c (k) c (k) is iid noise with zero mean, finite covariance Spot the similarity to Gauss-Seidel iteration for solving Ax = b x (k+1) = D 1 Lx (k+1) D 1 L T x (k) + D 1 b Goodman & Sokal 1989; Amit & Grenander 1991

Matrix formulation of Gibbs sampling from N(0, A 1 ) Let y = (y 1, y 2,..., y n ) T Component-wise Gibbs updates each component in sequence from the (normal) conditional distributions One sweep over all n components can be written y (k+1) = D 1 Ly (k+1) D 1 L T y (k) + D 1/2 z (k) where: D = diag(a), L is the strictly lower triangular part of A, z (k 1) N(0, I) y (k+1) = Gy (k) + c (k) c (k) is iid noise with zero mean, finite covariance Spot the similarity to Gauss-Seidel iteration for solving Ax = b x (k+1) = D 1 Lx (k+1) D 1 L T x (k) + D 1 b Goodman & Sokal 1989; Amit & Grenander 1991

Gibbs converges solver converges Theorem 1 Let A = M N, M invertible. The stationary linear solver x (k+1) = M 1 Nx (k) + M 1 b = Gx (k) + M 1 b converges, if and only if the random iteration y (k+1) = M 1 Ny (k) + M 1 c (k) = Gy (k) + M 1 c (k) converges in distribution. Here c (k) iid π n has zero mean and finite variance Proof. Both converge iff ϱ(g) < 1 Convergent splittings generate convergent (generalized) Gibbs samplers Mean converges with asymptotic convergence factor ϱ(g), covariance with ϱ(g) 2 Young 1971 Thm 3-5.1, Duflo 1997 Thm 2.3.18-4, Goodman & Sokal, 1989, Galli & Gao 2001 F Parker 2012

Average contractive iterated random functions Diaconis Freedman SIAM Review 1999, Stenflo JDEA 2012

Some not so common Gibbs samplers for N(0, A 1 ) splitting/sampler M Var ( c (k)) = M T + N converge if 1 Richardson ω I 2 ω I A 0 < ω < 2 ϱ(a) Jacobi D 2D A A SDD GS/Gibbs D + L D always SOR/B&F 1 ω D + L SSOR/REGS ω 2 ω M SORD 1 M T SOR ω 2 ω 2 ω ω ( D 0 < ω < 2 MSOR D 1 M T SOR 0 < ω < 2 +N T ) SOR D 1 N SOR Want: convenient to solve Mu = r and sample from N(0, M T + N) Relaxation parameter ω can accelerate Gibbs SSOR is a forwards and backwards sweep of SOR to give a symmetric splitting SOR: Adler 1981; Barone & Frigessi 1990, Amit & Grenander 1991, SSOR: Roberts & Sahu 1997

Some not so common Gibbs samplers for N(0, A 1 ) splitting/sampler M Var ( c (k)) = M T + N converge if 1 Richardson ω I 2 ω I A 0 < ω < 2 ϱ(a) Jacobi D 2D A A SDD GS/Gibbs D + L D always SOR/B&F 1 ω D + L SSOR/REGS ω 2 ω M SORD 1 M T SOR ω 2 ω 2 ω ω ( D 0 < ω < 2 MSOR D 1 M T SOR 0 < ω < 2 +N T ) SOR D 1 N SOR Want: convenient to solve Mu = r and sample from N(0, M T + N) Relaxation parameter ω can accelerate Gibbs SSOR is a forwards and backwards sweep of SOR to give a symmetric splitting SOR: Adler 1981; Barone & Frigessi 1990, Amit & Grenander 1991, SSOR: Roberts & Sahu 1997

A closer look at convergence To sample from N(µ, A 1 ) where Aµ = b Split A = M N, M invertible. G = M 1 N, and c (k) iid N(0, M T + N) The iteration ( ) y (k+1) = Gy (k) + M 1 (c (k) + b implies and Var E ( y (m)) [ ( µ = G m E y (0)) ] µ ( y (m)) [ ( A 1 = G m Var y (0)) A 1] G m (Hence asymptotic average convergence factors ϱ(g) and ϱ(g) 2 ) Errors go down as the polynomial P m (I G) = (I (I G)) m = ( I M 1 A ) m P m (λ) = (1 λ) m solver and sampler have exactly the same error polynomial

A closer look at convergence To sample from N(µ, A 1 ) where Aµ = b Split A = M N, M invertible. G = M 1 N, and c (k) iid N(0, M T + N) The iteration ( ) y (k+1) = Gy (k) + M 1 (c (k) + b implies and Var E ( y (m)) [ ( µ = G m E y (0)) ] µ ( y (m)) [ ( A 1 = G m Var y (0)) A 1] G m (Hence asymptotic average convergence factors ϱ(g) and ϱ(g) 2 ) Errors go down as the polynomial P m (I G) = (I (I G)) m = ( I M 1 A ) m P m (λ) = (1 λ) m solver and sampler have exactly the same error polynomial

Controlling the error polynomial The splitting gives the iteration operator A = 1 ( τ M + 1 1 ) M N τ G τ = ( I τm 1 A ) and error polynomial P m (λ) = (1 τλ) m The sequence of parameters τ 1, τ 2,..., τ m gives the error polynomial P m (λ) = m (1 τ l λ) l=1... so we can choose the zeros of P m! This gives a non-stationary solver non-homogeneous Markov chain Golub & Varga 1961, Golub & van Loan 1989, Axelsson 1996, Saad 2003, F & Parker 2012

The best (Chebyshev) polynomial 10 iterations, factor of 300 improvement Choose 1 = λ n + λ 1 τ l 2 + λ n λ 1 2 cos where λ 1 λ n are extreme eigenvalues of M 1 A ( π 2l + 1 ) 2p l = 0, 1, 2,..., p 1

Second-order accelerated sampler First-order accelerated iteration turns out to be unstable Numerical stability, and optimality at each step, is given by the second-order iteration y (k+1) = (1 α k )y (k 1) + α k y (k) + α k τ k M 1 (c (k) Ay (k) ) with α k and τ k chosen so error polynomial satisfies Chebyshev recursion. Theorem 2 2 nd -order solver converges 2 nd -order sampler converges (given correct noise distribution) Error polynomial is optimal, at each step, for both mean and covariance Asymptotic average reduction factor (Axelsson 1996) is σ = 1 λ 1 /λ n 1 + λ 1 /λ n Axelsson 1996, F & Parker 2012

input : The SSOR splitting M, N of A; smallest eigenvalue λ min of M 1 A; largest eigenvalue λ max of M 1 A; relaxation parameter ω; initial state y (0) ; k max output: y (k) approximately distributed as N(0, A 1 ) set γ = ( ( 2 ω 1) 1/2, δ = λmax λ min 4 set α = 1, β = 2/θ, τ = 1/θ, τ c = 2 τ 1; for k = 1,..., k max do if k = 0 then b = 2 α 1, a = τ cb, κ = τ; end ) 2, θ = λ max +λ min 2 ; else b = 2(1 α)/β + 1, a = τ c + (b 1) (1/τ + 1/κ 1), κ = β + (1 α)κ; end sample z N(0, I); c = γb 1/2 D 1/2 z; x = y (k) + M 1 (c Ay (k) ); sample z N(0, I); c = γa 1/2 D 1/2 z; w = x y (k) + M T (c Ax); y (k+1) = α(y (k) y (k 1) + τw) + y (k 1) ; β = (θ βδ) 1 ; α = θβ; Algorithm 1: Chebyshev accelerated SSOR sampling from N(0, A 1 )

10 10 lattice (d = 100) sparse precision matrix [A] ij = 10 4 δ ij + n i if i = j 1 if i j and s i s j 2 1. 0 otherwise 1 relative error 0.9 0.8 0.7 0.6 0.5 SSOR, ω=1 SSOR, ω=0.2122 Cheby SSOR, ω=1 Cheby SSOR, ω=0.2122 Cholesky 0.4 0.3 0.2 0 2 4 6 8 10 12 flops x 10 6 10 4 times faster

100 100 100 lattice (d = 10 6 ) sparse precision matrix 100 80 60 40 20 100 0 100 80 60 40 20 0 0 50 only used sparsity, no other special structure

Accelerating MCMC (cartoon) MH-MCMC repeatedly simulates fixed transition kernel P with πp = π via detailed balance (self-adjoint in π) n-step distribution: π (n) = π (n 1) P = π (0) P n n-step error in distribution: π (n) π = ( π (0) π ) P n error polynomial: P n = (I (I P)) n = (1 λ) n Hence convergence is geometric Polynomial acceleration: (modify iteration to give desired error polynomial) (choose x (i) w.p. α i ) π (0) n α i P i error poly: Q n = i=1 n α i (1 λ) i i=1

Accelerating MCMC (cartoon) MH-MCMC repeatedly simulates fixed transition kernel P with πp = π via detailed balance (self-adjoint in π) n-step distribution: π (n) = π (n 1) P = π (0) P n n-step error in distribution: π (n) π = ( π (0) π ) P n error polynomial: P n = (I (I P)) n = (1 λ) n Hence convergence is geometric Polynomial acceleration: (modify iteration to give desired error polynomial) (choose x (i) w.p. α i ) π (0) n α i P i error poly: Q n = i=1 n α i (1 λ) i i=1

CG (Krylov space) sampling x 2 p (0) x 2 5 =p (0) (0) x (2) p (1) x (1) x (2) x x p (1) x (1) x x 5 (1) x (0) x 1 x (0) x 1 Gibbs sampling & Gauss Seidel CG sampling & optimization Mutually conjugate vectors (wrt A) are independent directions V T AV = D A 1 = VD 1 V T so if z N (0, I) then x = V D 1 z N(0, A 1 ) Schneider & Willsky 2003, F 2008, Parker & F SISC 2012

CG sampling algorithm Apart from step 3, this is exactly (linear) CG optimization Var(y k ) is the CG polynomial Parker & F SISC 2012

CG sampling example So CG (by itself) over-smooths : initialize Gibbs with CG Parker & F SISC 2012, Schneider & Willsky 2003

Evaluating f(a) = A 1/2 If z N(0, I) then x = A 1/2 z N(0, A 1 ) Compute a rational approximation x = A 1/2 z Nα j (A σ j I) 1 x j=1 on spectrum {σ j } of A, based on Lanczos approximation: x = x 0 + β 0 VT 1/2 e 1 Contour integration Continuous deformation ODE all using CG Simpson Turner & Pettitt 2008, Aune Eidsvik & Pokern STCO 2012

Sampling by BFGS BFGS optimization over log π builds an approximation to the inverse Hessian A 1 A 1 0 + so if z 0 N(0, A 1 0 ) and z i N(0, 1) then N ρ i s i s T i i=1 z 0 + N ρi s i z i N(0, A 1 ) i=1 Can propagate sample covariance in large-scale EnKF... though CG sampler does a better job (don t know why) and seeding proposal in AM Veersé 1999, Auvinen Bardsley Haario & Kauranne 2010, Bardsley Solonen Parker Haario & Howard IJUQ 2011

Non-negativity constrained sampling by PGCG Linear inverse problem b = Ax + η η N(0, σ 2 I) ( p(x, λ, δ b) λ n/2+αλ 1 δ n/2+αδ 1 exp λ 2 Ax b 2 δ ) 2 xt Cx β λ λ β δ δ with probability mass projected to constraint boundary in metric of Hessian of log π { } 1 x k = arg min x 0 2 xt (λ k A T A + δ k C)x x T (λ k A T b + w) where w N(0, λ k A T A + δ k C). Compute by projected-gradient CG. 100 100 20 40 60 80 60 20 40 60 90 80 70 60 80 40 80 50 40 100 20 100 30 20 120 0 120 10 20 40 60 80 100 120 20 40 60 80 100 120 0 Bardsley F IPSE 2011

Some observations In the Gaussian setting stochastic relaxation is fundamentally equivalent to classical relaxation existing solvers (mutligrid, fast multipole, parallel tools) can be used as samplers More generally Accelerated smplers are not homogeneous chains not standard MCMC need a convergence theory for non-stationary random iterations Fast sampling looks like a good way to do UQ in large-scale problems error = Cn o o = 1/2 for Monte Carlo (not good)... but we are never in the asymptotic regime Further... C is much smaller than lattice methods especially when using importance sampling and scales weakly with problem dimension

Conclusion