Monte Carlo simulation inspired by computational optimization. Colin Fox Al Parker, John Bardsley MCQMC Feb 2012, Sydney

Size: px

Start display at page:

Download "Monte Carlo simulation inspired by computational optimization. Colin Fox Al Parker, John Bardsley MCQMC Feb 2012, Sydney"

Edward Carson
5 years ago
Views:

1 Monte Carlo simulation inspired by computational optimization Colin Fox Al Parker, John Bardsley MCQMC Feb 2012, Sydney

2 Sampling from π(x) Consider : x is high-dimensional ( ) and π is expensive to compute MH-MCMC repeatedly simulates fixed transition kernel P with πp = π via detailed balance (self-adjoint in π) n-step distribution: π (n) = π (n 1) P = π (0) P n n-step error in distribution: π (n) π = ( π (0) π ) P n error polynomial: P n = (I (I P)) n = (1 λ) n Hence convergence is geometric This was the state of stationary linear solvers in 1950 s now considered very slow Polynomial acceleration: (choose x (i) w.p. α i ) π (0) n α i P i error poly: Q n = i=1 n α i (1 λ) i i=1

3 Sampling from π(x) Consider : x is high-dimensional ( ) and π is expensive to compute MH-MCMC repeatedly simulates fixed transition kernel P with πp = π via detailed balance (self-adjoint in π) n-step distribution: π (n) = π (n 1) P = π (0) P n n-step error in distribution: π (n) π = ( π (0) π ) P n error polynomial: P n = (I (I P)) n = (1 λ) n Hence convergence is geometric This was the state of stationary linear solvers in 1950 s now considered very slow Polynomial acceleration: (choose x (i) w.p. α i ) π (0) n α i P i error poly: Q n = i=1 n α i (1 λ) i i=1

4 Gibbs sampling Gibbs sampling a repeatedly samples from (block) conditional distributions Forward and reverse sweep simulates a reversible kernel, as in MH-MCMC Normal distributions π (x) = det (A) 2π n exp { 1 } 2 xt Ax + b T x precision matrix A, covariance matrix Σ = A 1 (both SPD) wlog treat zero mean Particularly interested in case where A is sparse (GMRF) and n large When is π (0) is also normal, then so is the n-step distribution A (n) A Σ (n) Σ In what sense is stochastic relaxation related to relaxation? What decomposition of A is this performing? a Glauber 1963 (heat-bath algorithm), Turcin 1971, Geman and Geman 1984

5 Gibbs samplers and equivalent linear solvers Optimization... Sampling... Gauss-Seidel Cheby-GS CG/Lanczos Gibbs Cheby-Gibbs Lanczos

6 Matrix formulation of Gibbs sampling from N(0, A 1 ) Let y = (y 1, y 2,..., y n ) T Component-wise Gibbs updates each component in sequence from the (normal) conditional distributions One sweep over all n components can be written y (k+1) = D 1 Ly (k+1) D 1 L T y (k) + D 1/2 z (k) where: D = diag(a), L is the strictly lower triangular part of A, z (k 1) N(0, I) y (k+1) = Gy (k) + c (k) c (k) is iid noise with zero mean, finite covariance stochastic AR(1) process = first order stationary iteration plus noise Goodman & Sokal, 1989

7 Matrix splitting form of stationary iterative methods The splitting A = M N converts linear system Ax = b to Mx = Nx + b. If M is nonsingular x = M 1 Nx + M 1 b. Iterative methods compute successively better approximations by x (k+1) = M 1 Nx (k) + M 1 b = Gx (k) + g Many splittings use terms in A = L + D + U. Gauss-Seidel sets M = L + D x (k+1) = D 1 Lx (k+1) D 1 L T x (k) + D 1 b spot the similarity to Gibbs y (k+1) = D 1 Ly (k+1) D 1 L T y (k) + D 1/2 z (k) Goodman & Sokal, 1989; Amit & Grenander, 1991

8 Matrix splitting form of stationary iterative methods The splitting A = M N converts linear system Ax = b to Mx = Nx + b. If M is nonsingular x = M 1 Nx + M 1 b. Iterative methods compute successively better approximations by x (k+1) = M 1 Nx (k) + M 1 b = Gx (k) + g Many splittings use terms in A = L + D + U. Gauss-Seidel sets M = L + D x (k+1) = D 1 Lx (k+1) D 1 L T x (k) + D 1 b spot the similarity to Gibbs y (k+1) = D 1 Ly (k+1) D 1 L T y (k) + D 1/2 z (k) Goodman & Sokal 1989; Amit & Grenander 1991

9 Gibbs converges solver converges Theorem 1 Let A = M N, M invertible. The stationary linear solver x (k+1) = M 1 Nx (k) + M 1 b = Gx (k) + M 1 b converges, if and only if the random iteration y (k+1) = M 1 Ny (k) + M 1 c (k) = Gy (k) + M 1 c (k) converges in distribution. Here c (k) iid π n has zero mean and finite variance Proof. Both converge iff ϱ(g) < 1 Convergent splittings generate convergent (generalized) Gibbs samplers Mean converges with asymptotic convergence factor ϱ(g), covariance with ϱ(g) 2 Young 1971 Thm 3-5.1, Duflo 1997 Thm , Goodman & Sokal, 1989, Galli & Gao 2001 F Parker 2012

10 Some not so common Gibbs samplers for N(0, A 1 ) splitting/sampler M Var ( c (k)) = M T + N converge if 1 Richardson ω I 2 ω I A 0 < ω < 2 ϱ(a) Jacobi D 2D A A SDD GS/Gibbs D + L D always SOR/B&F 1 ω D + L SSOR/REGS ω 2 ω M SORD 1 M T SOR ω 2 ω 2 ω ω ( D 0 < ω < 2 MSOR D 1 M T SOR 0 < ω < 2 +N T ) SOR D 1 N SOR Want: convenient to solve Mu = r and sample from N(0, M T + N) Relaxation parameter ω can accelerate Gibbs SSOR is a forwards and backwards sweep of SOR to give a symmetric splitting SOR: Adler 1981; Barone & Frigessi 1990, Amit & Grenander 1991, SSOR: Roberts & Sahu 1997

11 A closer look at convergence To sample from N(µ, A 1 ) where Aµ = b Split A = M N, M invertible. G = M 1 N, and c (k) iid N(0, M T + N) The iteration ( ) y (k+1) = Gy (k) + M 1 (c (k) + b implies and Var E ( y (m)) [ ( µ = G m E y (0)) ] µ ( y (m)) [ ( A 1 = G m Var y (0)) A 1] G m (Hence asymptotic average convergence factors ϱ(g) and ϱ(g) 2 ) Errors go down as the polynomial P m (I G) = (I (I G)) m = ( I M 1 A ) m P m (λ) = (1 λ) m solver and sampler have exactly the same error polynomial

12 A closer look at convergence To sample from N(µ, A 1 ) where Aµ = b Split A = M N, M invertible. G = M 1 N, and c (k) iid N(0, M T + N) The iteration ( ) y (k+1) = Gy (k) + M 1 (c (k) + b implies and Var E ( y (m)) [ ( µ = G m E y (0)) ] µ ( y (m)) [ ( A 1 = G m Var y (0)) A 1] G m (Hence asymptotic average convergence factors ϱ(g) and ϱ(g) 2 ) Errors go down as the polynomial P m (I G) = (I (I G)) m = ( I M 1 A ) m P m (λ) = (1 λ) m solver and sampler have exactly the same error polynomial

13 Controlling the error polynomial The splitting gives the iteration operator A = 1 ( τ M ) M N τ G τ = ( I τm 1 A ) and error polynomial P m (λ) = (1 τλ) m The sequence of parameters τ 1, τ 2,..., τ m gives the error polynomial P m (λ) = m (1 τ l λ) l=1... so we can choose the zeros of P m! This gives a non-stationary solver non-homogeneous Markov chain (equivalently, can post-process chain by taking linear combination of states) Golub & Varga 1961, Golub & van Loan 1989, Axelsson 1996, Saad 2003, F & Parker 2012

14 The best (Chebyshev) polynomial 10 iterations, factor of 300 improvement Choose 1 = λ n + λ 1 τ l 2 + λ n λ 1 2 cos where λ 1 λ n are extreme eigenvalues of M 1 A ( π 2l + 1 ) 2p l = 0, 1, 2,..., p 1

15 Second-order accelerated sampler First-order accelerated iteration turns out to be unstable Numerical stability, and optimality at each step, is given by the second-order iteration y (k+1) = (1 α k )y (k 1) + α k y (k) + α k τ k M 1 (c (k) Ay (k) ) with α k and τ k chosen so error polynomial satisfies Chebyshev recursion. Theorem 2 2 nd -order solver converges 2 nd -order sampler converges (given correct noise distribution) Error polynomial is optimal, at each step, for both mean and covariance Asymptotic average reduction factor (Axelsson 1996) is σ = 1 λ 1 /λ n 1 + λ 1 /λ n Axelsson 1996, F & Parker 2012

16 Algorithm 1: Chebyshev accelerated SSOR sampling from N(0, A 1 ) input : The SSOR splitting M, N of A; smallest eigenvalue λ min of M 1 A; largest eigenvalue λ max of M 1 A; relaxation parameter ω; initial state y (0) ; k max output: y (k) approximately distributed as N(0, A 1 ) set γ = ( ( 2 ω 1) 1/2, δ = λmax λ min 4 set α = 1, β = 2/θ, τ = 1/θ, τ c = 2 τ 1; for k = 1,..., k max do if k = 0 then b = 2 α 1, a = τ cb, κ = τ; end ) 2, θ = λ max +λ min 2 ; else b = 2(1 α)/β + 1, a = τ c + (b 1) (1/τ + 1/κ 1), κ = β + (1 α)κ; end sample z N(0, I); c = γb 1/2 D 1/2 z; x = y (k) + M 1 (c Ay (k) ); sample z N(0, I); c = γa 1/2 D 1/2 z; w = x y (k) + M T (c Ax); y (k+1) = α(y (k) y (k 1) + τw) + y (k 1) ; β = (θ βδ) 1 ; α = θβ;

17 10 10 lattice (d = 100) sparse precision matrix [A] ij = 10 4 δ ij + n i if i = j 1 if i j and s i s j otherwise 1 relative error SSOR, ω=1 SSOR, ω= Cheby SSOR, ω=1 Cheby SSOR, ω= Cholesky flops x times faster

18 lattice (d = 10 6 ) sparse precision matrix only used sparsity, no other special structure

19 CG sampling x 2 p (0) x 2 5 =p (0) (0) x (2) p (1) x (1) x (2) x x p (1) x (1) x x 5 (1) x (0) x 1 x (0) x 1 Gibbs sampling & Gauss Seidel CG sampling & optimization Mutually conjugate vectors (wrt A) are independent directions V T AV = D A 1 = VD 1 V T so if z N (0, I) then x = V D 1 z N(0, A 1 ) Schneider & Willsky 2003, F 2008, Parker & F SISC 2012

20 CG sampling algorithm Apart from step 3, this is exactly (linear) CG optimization Var(y k ) is the CG polynomial Parker & F SISC 2012

21 Best approximation property Theorem 3 (Parker 2009) The covariance matrix Var(y k x 0, b 0 ) = V k T 1 k V k T has k non-zero eigenvalues which are the Lanczos estimates of the eigenvalues of A 1, σ i (Var(y k x 0, b 0 )) = 1. The eigenvectors of Var(y k x 0, b 0 ) are the Ritz vectors V θi k k v i which estimate the eigenvectors of A. When r k 2 = 0, then Var(b k y 0, b 0 ) = V k T k V T k and the k non-zero eigenpairs of Var(b k x 0, b 0 ) are the Lanczos Ritz pairs (θ k i, V kv i ). Var(y k x 0, b 0 ) approximates A 1 and Var(b k x 0, b 0 ) approximates A in the eigenspaces corresponding to the extreme and well separated eigenvalues of A. Parker & F SISC 2012

22 CG sampling example So CG (by itself) over-smooths : initialize Gibbs with CG Parker & F SISC 2012, Schneider & Willsky 2003

23 Some observations In the Gaussian setting GS GS... stochastic relaxation is fundamentally equivalent to classical relaxation... existing solvers ( mutligrid, fast multipole, parallel tools) can be used as is sampling has the same computational cost as solving... and is sometimes cheaper more generally acceleration of convergence in mean and covariance is not limited to Gaussian targets... and has been applied in some settings

Sample-based UQ via Computational Linear Algebra (not MCMC) Colin Fox Al Parker, John Bardsley March 2012

Sample-based UQ via Computational Linear Algebra (not MCMC) Colin Fox fox@physics.otago.ac.nz Al Parker, John Bardsley March 2012 Contents Example: sample-based inference in a large-scale problem Sampling