Computation fundamentals of discrete GMRF representations of continuous domain spatial models

Similar documents
Data are collected along transect lines, with dense data along. Spatial modelling using GMRFs. 200m. Today? Spatial modelling using GMRFs

Gaussian Elimination and Back Substitution

A short introduction to INLA and R-INLA

Linear Algebraic Equations

Computational Linear Algebra

Partial factor modeling: predictor-dependent shrinkage for linear regression

Matrix Factorization and Analysis

Multivariate Gaussian Random Fields with SPDEs

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Scientific Computing

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

Markov random fields. The Markov property

Gaussian Processes 1. Schedule

9. Numerical linear algebra background

5.6. PSEUDOINVERSES 101. A H w.

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

R-INLA. Sam Clifford Su-Yun Kang Jeff Hsieh. 30 August Bayesian Research and Analysis Group 1 / 14

2 Statistical Estimation: Basic Concepts

Bayesian spatial hierarchical modeling for temperature extremes

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Default Priors and Effcient Posterior Computation in Bayesian

Nonparameteric Regression:

1 Positive definiteness and semidefiniteness

Probabilistic Graphical Models

Fast algorithms that utilise the sparsity of Q exist (c-package ( Q 1. GMRFlib) Seattle February 3, 2009

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Introduction to Probabilistic Graphical Models

LU Factorization. A m x n matrix A admits an LU factorization if it can be written in the form of A = LU

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

AMS 209, Fall 2015 Final Project Type A Numerical Linear Algebra: Gaussian Elimination with Pivoting for Solving Linear Systems

NORGES TEKNISK-NATURVITENSKAPELIGE UNIVERSITET

Solution of Linear Equations

Elementary Linear Algebra

Chris Bishop s PRML Ch. 8: Graphical Models

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6

Review of Linear Algebra

Mobile Robot Localization

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

Stochastic Proximal Gradient Algorithm

1 Data Arrays and Decompositions

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

Practical Linear Algebra: A Geometry Toolbox

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

9. Numerical linear algebra background

Probabilistic Graphical Models

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

Linear Algebra (Review) Volker Tresp 2018

Gaussian Markov Random Fields: Theory and Applications

Quantum Computing Lecture 2. Review of Linear Algebra

Bayes: All uncertainty is described using probability.

Probabilistic Graphical Models

Robust Principal Component Analysis

Calculation in the special cases n = 2 and n = 3:

Numerical Linear Algebra

Numerical Analysis: Solving Systems of Linear Equations

Lecture 3: Pattern Classification. Pattern classification

Matrix decompositions

Fisher Information in Gaussian Graphical Models

Review Questions REVIEW QUESTIONS 71

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Lecture 4.2 Finite Difference Approximation

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

GRAPH SIGNAL PROCESSING: A STATISTICAL VIEWPOINT

6. Vector Random Variables

Perturbation Theory for Variational Inference

EC5555 Economics Masters Refresher Course in Mathematics September 2014

Linear Least-Squares Data Fitting

How to solve the stochastic partial differential equation that gives a Matérn random field using the finite element method

(1) for all (2) for all and all

PCA and admixture models

Lecture 2 INF-MAT : , LU, symmetric LU, Positve (semi)definite, Cholesky, Semi-Cholesky

CSCI-6971 Lecture Notes: Probability theory

Exact and Approximate Numbers:

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Graphical Model Selection

Multi-resolution models for large data sets

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti

Spatial point processes in the modern world an

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Choosing among models

CS281 Section 4: Factor Analysis and PCA

University of Alabama Department of Physics and Astronomy. PH 125 / LeClair Spring A Short Math Guide. Cartesian (x, y) Polar (r, θ)

An EM algorithm for Gaussian Markov Random Fields

November 2002 STA Random Effects Selection in Linear Mixed Models

E(x i ) = µ i. 2 d. + sin 1 d θ 2. for d < θ 2 0 for d θ 2

The Bayesian approach to inverse problems

1.Chapter Objectives

Linear Algebra: Matrix Eigenvalue Problems

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Factor Analysis. Qian-Li Xue

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Approximate Message Passing

Nearest Neighbor Gaussian Processes for Large Spatial Data

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Transcription:

Computation fundamentals of discrete GMRF representations of continuous domain spatial models Finn Lindgren September 23 2015 v0.2.2 Abstract The fundamental formulas and algorithms for Bayesian spatial statistics using continuous domain GMRF models are presented. The construction of the models as approimate SPDE solutions is not covered. The results are applicable to non-spatial GMRF models as well but the important property of observations being local and non-local is only discussed in the spatial contet. Contents 1 Basic spatial model in GMRF representation 2 1.1 Bayesian hierarchical formulation............................. 2 1.2 Joint distribution...................................... 2 1.3 Conditional/posterior distribution (or Kriging)....................... 3 2 Epectations covariances and samples 3 2.1 Solving with the posterior precision............................ 3 2.2 Prior and posterior variances................................ 4 2.3 Leave-one-out cross-validation............................... 4 2.4 Sampling from the prior................................... 5 2.5 Sampling from the posterior................................ 5 2.5.1 Direct method.................................... 5 2.5.2 Least squares.................................... 5 2.5.3 Sampling with conditioning by Kriging...................... 5 3 Likelihood evaluation 6 3.1 Conditional marginal data likelihood............................ 6 3.2 Posterior parameter likelihood and total marginal data likelihood............. 7 3.3 A note on non-gaussian data and INLA.......................... 7 4 Linear constraints 7 4.1 Non-interacting hard constraints.............................. 8 4.2 Conditioning by Kriging.................................. 8 4.3 Conditional covariances................................... 9 4.4 Constrained likelihood evaluation............................. 10 1

1 Basic spatial model in GMRF representation Basic linear spatial model: Spatial basis epansion model u(s) = m k=1 ψ k(s)u k with m basis functions ψ k (s) with local (compact) support and GMRF coefficients (u 1... u m ) = u N(µ u Q 1 u ) where Q u is derived from an SPDE construction with parameters θ u and µ u is usually 0. Covariate matri B p coefficients b N(µ b Q 1 b ). Typically µ b = 0 and Q b = τ b I p with τ b 0. n observations y = (y 1... y n ) at locations s i and additive zero mean Gaussian noise e i with (e 1... e n ) = ɛ N(0 Q 1 ɛ ). Typically Q ɛ = τ ɛ I. Define A so that A ij = ψ j (s i ). The observation model is then y = Bb + Au + ɛ. Since the covariate effects and the spatial field are jointly Gaussian it s often practical to join them into a single vector = (u b) and [ ] µu µ = µ b [ ] Qu 0 Q = 0 Q b 1.1 Bayesian hierarchical formulation A = [ A B ] y = A + ɛ. Written as a full Bayesian hierarchical model the model is θ = (θ u τ b τ ɛ ) π(θ) ( θ) N(µ Q 1 ) (y θ ) N(A Q 1 ɛ ). It will be important to note that for local basis functions (and at most Markov dependent noise ɛ) A Q ɛ A is sparse and that is therefore also sparse if p is small. 1.2 Joint distribution [ A A Q Q ɛ A = ɛ A A ] Q ɛ B B Q ɛ A B Q ɛ B The joint distribution for the observations and the latent variables (y ) is given by [ ( [A ] [ ] 1 ) y µ N Q ɛ Q ɛ A ] µ A Q ɛ Q + A Q ɛ A 2

1.3 Conditional/posterior distribution (or Kriging) The conditional distribution for given y and θ is given by ( θ y) N(µ y Q 1 y ) Q y = Q + A Q ɛ A µ y = µ + Q 1 y A Q ɛ (y A µ ). The proof is by completing the square in the eponent of the Gaussian conditional density epression π( θ y) = π( y θ) π(y θ) π( θ)π(y θ ). Note that the elements of µ y are the basis function coefficients and covariate effect estimates in the universal Kriging predictor and that we ve arrived at it without going the long way around via the traditional covariance based Kriging equations and matri inversion lemmas. 2 Epectations covariances and samples Direct calculations and simulation (for models smaller than approimately m = 10 6 ) is typically done with sparse Cholesky decomposition. Reordering methods are required to achieve sparsity (CAMD nested dissection etc.). Here assume for ease of notation that the nodes are already in a suitable order and that both the prior and posterior precision decompositions are sparse: Q = L L Q y = L y L y where all L are lower triangular sparse matrices. See Sections 2.5.2 and 4 for eamples of how to handle models where only the prior precision has a sparse Cholesky decomposition. In several instances the needed adjustments for the use of iterative equation solvers are also mentioned. 2.1 Solving with the posterior precision Kriging requires solving a linear system Q y z = w z = L y ( L 1 y w ) requiring one forward solve and one backward solve. Several systems with the same posterior precision can be solved simultaneously by having a multi-column w. Often practical implementations avoid actually transposing the large sparse matrices effectively doing z = { ( ) } L 1 y w L 1 y instead. The Kriging predictor (the posterior epectation) is obtained for w = A Q ɛ (y A µ ) with µ y = µ + z. 3

2.2 Prior and posterior variances Even though the full covariance matri is too large to be stored the variances and covariances of neighbouring Markov nodes can be calculated using an algorithm related to the Cholesky decomposition. In R-INLA this is available as S=inla.qinv(Q) where {( Q 1 ) S ij = when Q ij ij 0 or (i j) is in the Cholesky infill and (1) 0 otherwise. This means that the posterior neighbour covariances Cov( i j θ y) for neighbours (i j) are obtained by calling inla.qinv() with the posterior precision Q y. Provided that the desired posterior predictions A pred the prior model itself e.g. when A pred marginal posterior predictive variances as the diagonal of A pred ( ) m+p Var (A pred ) i = have the same sparse connectivity structure as = A the sparse symmetric S matri is sufficient to evaluate S(A pred ) : j=1 (A pred ) ij m+p k=1 S jk (A pred ) ik which in R would typically be implemented as rowsums(a * (A %*% S)). When the desired predictions are dense relative to the model (e.g. when predicting the overall spatial average) solves with L y are required. A small number of predictions (i.e. the number of rows of A pred is small) can be evaluated simultaneously with LAt <- solve(l t(a)) colsums(lat * LAt) Data predictive marginal variances for A pred + ɛ pred are obtained by adding τɛ 1 to the variances for A pred. 2.3 Leave-one-out cross-validation In the case of conditionally independent observations epressions for epectations and variances of leaveone-out cross-validation predictive distributions can be obtained via rank one modification of the full posterior precision matri and are obtained for free by manipulating the above epressions. Let S y be the neighbour covariance matri obtained from (1) applied to Q y and define E i = (A µ y ) i V i = (A ) i S y (A ) i. Then the predictive distribution for i based on the complete data is ((A ) i θ y) N(E i V i ) and the leave-one-out predictive distribution is ( (A ) i θ y j j i ) N(E (i) V (i) ) where E (i) = y i y i E i 1 q i V i V (i) = V i 1 q i V i. where q i is the conditional precision of observation i from Q ɛ = diag(q 1... q n ). 4

2.4 Sampling from the prior Let w N(0 I m+p ) then = µ + L w is a sample from the prior model for. Multiple samples are obtained by repeated forward solves. 2.5 Sampling from the posterior 2.5.1 Direct method First compute µ y as in Section 2.1 and then let w N(0 I m+p ). Then = µ y + L y w is a sample from the posterior for given y. Multiple samples are obtained by repeated forward solves. 2.5.2 Least squares In some models the posterior precision matri does not have a sparse Cholesky decomposition but iterative linear algebra methods may still be used to solve systems with Q y. If the prior precision can be decomposed as Q = H H for some sparse matri H (not necessarily a Cholesky factor) then one approach to sampling is least squares. Given that the posterior epectation has already been calculated using an iterative solution to the equations in Section 1.3 a sample of conditionally on y is given by the solution to where w N(0 I m+p+n ). 2.5.3 Sampling with conditioning by Kriging Q y ( µ y ) = [ H A L ɛ ] w The conditioning by Kriging approach to sampling is to construct samples from the prior and then correct them to become samples from the posterior. For GMRF models this is less efficient than the direct approach in Section 2.5.1 ecept when some observations are non-local which leads to a dense posterior precision. In its traditional form the samples are constructed as follows: = µ + L w w N(0 I m+p ) y = y + L ɛ w y w y N(0 I n ) ( ) 1 = Q 1 A A Q 1 A + Q 1 (A y ) where the final epression is the covariance based Kriging equation. For anything but very small n the inner matri is dense. Using the Woodbury identity the epression can be rewritten to recover the precision based equation from Section 1.3 ( ) 1 = + Q + A Q ɛ A A Q ɛ (y A ) = + Q 1 y A Q ɛ (y A ) ɛ 5

which is calculated as before with forward and backward solves on L y. If N samples as well as the posterior mean are desired conditioning by Kriging requires N solves with L and L ɛ (to obtain 0 and y 0 ) and 2 + 2N solves with L y. The direct method only requires 2 + N solves with L y and no solves with L or L ɛ which means that it is nearly always preferable to conditioning by Kriging. The eception is non-local observations which can be handled separately as soft constraints see Section 4. 3 Likelihood evaluation The epensive part of evaluating likelihoods is calculating precision determinants but they come at no etra cost given the Cholesky factors: log π( θ) = m + p log(2π) + 1 2 2 log det(q ) 1 2 ( µ ) Q ( µ ) m+p log det(q ) = 2 log(l ) jj. j=1 Corresponding epressions hold for π(y θ ) and π( θ y) using L ɛ and L y. Marginal data likelihoods require more effort. 3.1 Conditional marginal data likelihood For any arbitrary π(y θ) = = π(θ y) π(θ) = π( θ)π(y θ ) π( θ y) π(θ y)π( θ y) π(θ)π( θ y) = =. = π(θ y ) π(θ)π( θ y) = In practice this is evaluated for = µ y which gives better numerical stability than = 0. Written eplicitly log π(y θ) = n 2 log(2π) + 1 2 log det(q ) + 1 2 log det(q ɛ) 1 2 log det(q y) 1 2 (µ y µ ) Q (µ y µ ) 1 2 (y A µ y ) Q ɛ (y A µ y ) where an anticipated third quadratic form from π( θ y) has been eliminated due to the choice of. Comparing with a standard marginal Gaussian log-likelihood epression the log-determinant and quadratic form can be identified log det(q y ) = log det(q ) + log det(q ɛ ) log det(q y ) (y A µ ) Q y (y A µ ) = (µ y µ ) Q (µ y µ ) + (y A µ y ) Q ɛ (y A µ y ) even though the marginal precision matri itself Q y is dense and cannot be stored. 6

3.2 Posterior parameter likelihood and total marginal data likelihood Using the result for the conditional marginal data likelihood we can epress the posterior parameter likelihood as π(θ y) = π(θ y) π(y) = π(θ)π(y θ) π(y) where the only unknown is the total marginal data likelihood π(y) which can be approimated with numerical integration over θ π(y) = π(θ)π(y θ) dθ. 3.3 A note on non-gaussian data and INLA For non-gaussian data Laplace approimation can be used which in essence involves replacing the non- Gaussian conditional posterior π( θ y) with a Gaussian approimation in the epression for π(y θ) taking as the mode of π( θ y). The INLA implementation finds the mode of π(θ y) using numerical optimisation and further approimates the marginal posteriors for the latent variables k π( k y) with numerical integration of π( k y) = π( k θ y)π(θ y) dθ where a further laplace approimation of π( k θ y) is made (as well as skweness corrections not discussed here) π( k θ y) = π( k (k) θ y). π( (k) θ y k ) (k) = (k) This combination of Laplace approimations is the nested Laplace part of INLA. In the case of Gaussian data the only approimation in the method is the numerical integration itself. 4 Linear constraints There are two main classes of linear constraints: 1. Hard constraints which fall into two subclasses (a) Interacting hard constraints where linear combinations of nodes are constrained (b) Non-interacting hard constraints where each constraint acts only on a single node in effect specifying those nodes eplicitly 2. Soft constraints where linear combinations are specified with some uncertainty equivalent to posterior conditioning on observations of these linear combinations. Observations are treated under this framework when they break the Markov structure in the posterior distribution e.g. by affecting all the nodes in the field directly. All these cases can be written on matri form as A c + ɛ c = e c 7

where ɛ c = 0 in the hard constraint cases and ɛ c N(0 Q 1 c ) in the soft constraint case. The aim is to sample from and evaluate properties of the conditional distribution for ( e c ) when N(µ Q 1 ) (e c ) N(A c Q 1 c ) where comes from a known GMRF model typically either a prior or a posterior distribution. Let r denote the number of constraints i.e. the number of rows in A c and the number of elements in e c. 4.1 Non-interacting hard constraints Non-interacting hard constraints are easiest to handle as the constrained nodes can be completely removed from all calculations in a preprocessing step as they are in effect known deterministic values and the conditional precision for the remaining nodes can be trivially etracted as a block from the joint precision matri. Without loss of practical generality we can assume that all scaling is done in e c so that each row of A c has a single non-zero entry that is equal to 1. For ease of notation assume that the nodes are ordered so that the unconstrained nodes U are followed by the constrained C so that the joint distribution is [ ] ( [µu ] [ ] ) 1 U QUU Q N UC. C µ C Q CU Q CC The conditional distribution for the unconstrained nodes is given by ( ) ( U C = e c ) N µ U C Q 1 UU µ U C = µ U Q 1 UU Q UC(e c µ C ). The constrained mean and constrained samples are obtained with the same forward and backward solving techniques as in Section 1.3 and Section 2.5.1 using the Cholesky decomposition of Q UU. The joint constrained distribution can be written as a degenerate Normal distribution ([ ] ) ([ ] [ ]) U µu C Q 1 C = e c N UU 0. C e c 0 0 4.2 Conditioning by Kriging Conditioning on local observations preserves the Markov structure and Markov breaking conditioning is therefore done as a final step when computing posterior means and sampling from the posterior as well as from the prior. Using the conditioning by Kriging introduced in Section 2.5.3 conditioning on linear constraints or on non-local observations can be written in a common framework. Let Q c = L c L c be the Cholesky decomposition of Q c in the soft constraint case and for notational convenience define Q 1 c = 0 and L 1 c = 0 for hard constraints. 8

The method proceeds by constructing a sample from the unconstrained model and then correcting for the constraints = µ + L w w N(0 I m+p ) e c = e c + L c w c w c N(0 I r ) ( ) 1 = Q 1 A c A c Q 1 A c + Q 1 c (Ac e c). In order to obtain the conditional mean instead of a sample set = µ and e c = e c. The dense but small inner r r matri can be evaluated using a single forward solve with L A c Q 1 A c ( ) + Q 1 c = L 1 A c L 1 A c + Q 1 c. Defining à c = L 1 A c the full epression becomes = L à c (Ãc à c ) 1 + Q 1 c (Ac e c) where the remaining inner r r solve is evaluated using e.g. Cholesky decomposition and the result fed into a final backward solve with L. In order to avoid unnecessary recalculations for multiple samples further precomputation can be used. A fully Cholesky based approach that also simplifies later calulations is L c L c = Ãcà c + L c L 1 c B = L à c = B L c 1 L c (A c e c) where L c is a dense r r Cholesky factor and B is dense (m + p) r. For models that require iterative solves for Q instead of Cholesky decomposition the pre-calculations are modified slightly Ãc L c L c = Q 1 A c = A c Ãc B = à c L c but the final step for constructing remains unchanged. 4.3 Conditional covariances From the conditioning by Kriging algorithm we obtain + L c L 1 c Cov( e c ) = Q 1 BB which can be combined with the neighbour covariance algorithm from Section 2.2 to yield the conditional covariances between nodes that are neighbours in the unconstrained model { S ij k (S c ) ij = B ikb jk when Q ij 0 or (i j) is in the Cholesky infill and 0 otherwise. 9

4.4 Constrained likelihood evaluation The constrained likelihood is given by π( e c ) = π()π(e c ) π(e c ) where π(e c ) is a degenerate density in the hard constraint case. The marginal distributions for and e c are N(µ Q 1 ) e c N(A c µ L c L c ) and the densities can be evaluated with the usual approach with log π() = m + p log(2π) + log det(l) 1 2 2 ( µ) Q( µ) log π(e c ) = r 2 log(2π) log det( L c ) 1 2 (e c A c µ) L c L 1 c (e c A c µ). In the hard constraint case Section 2.3.3 of (Rue and Held 2005) provides the degenerate conditional density through log π(e c ) = log π(a c ) = 1 2 log det(a ca c ) which can be evaluated using the Cholesky decomposition of the r r matri A c A c. In the soft constraint case log π(e c ) = r 2 log(2π) + log det(l c) 1 2 (e c A c ) Q c (e c A c ). References Rue H. and L. Held (2005). Gaussian Markov Random Fields; Theory and Applications. Chapman & Hall/CRC. 10