Computation fundamentals of discrete GMRF representations of continuous domain spatial models

Computation fundamentals of discrete GMRF representations of continuous domain spatial models Finn Lindgren September 23 2015 v0.2.2 Abstract The fundamental formulas and algorithms for Bayesian spatial statistics using continuous domain GMRF models are presented. The construction of the models as approimate SPDE solutions is not covered. The results are applicable to non-spatial GMRF models as well but the important property of observations being local and non-local is only discussed in the spatial contet. Contents 1 Basic spatial model in GMRF representation 2 1.1 Bayesian hierarchical formulation............................. 2 1.2 Joint distribution...................................... 2 1.3 Conditional/posterior distribution (or Kriging)....................... 3 2 Epectations covariances and samples 3 2.1 Solving with the posterior precision............................ 3 2.2 Prior and posterior variances................................ 4 2.3 Leave-one-out cross-validation............................... 4 2.4 Sampling from the prior................................... 5 2.5 Sampling from the posterior................................ 5 2.5.1 Direct method.................................... 5 2.5.2 Least squares.................................... 5 2.5.3 Sampling with conditioning by Kriging...................... 5 3 Likelihood evaluation 6 3.1 Conditional marginal data likelihood............................ 6 3.2 Posterior parameter likelihood and total marginal data likelihood............. 7 3.3 A note on non-gaussian data and INLA.......................... 7 4 Linear constraints 7 4.1 Non-interacting hard constraints.............................. 8 4.2 Conditioning by Kriging.................................. 8 4.3 Conditional covariances................................... 9 4.4 Constrained likelihood evaluation............................. 10 1

1 Basic spatial model in GMRF representation Basic linear spatial model: Spatial basis epansion model u(s) = m k=1 ψ k(s)u k with m basis functions ψ k (s) with local (compact) support and GMRF coefficients (u 1... u m ) = u N(µ u Q 1 u ) where Q u is derived from an SPDE construction with parameters θ u and µ u is usually 0. Covariate matri B p coefficients b N(µ b Q 1 b ). Typically µ b = 0 and Q b = τ b I p with τ b 0. n observations y = (y 1... y n ) at locations s i and additive zero mean Gaussian noise e i with (e 1... e n ) = ɛ N(0 Q 1 ɛ ). Typically Q ɛ = τ ɛ I. Define A so that A ij = ψ j (s i ). The observation model is then y = Bb + Au + ɛ. Since the covariate effects and the spatial field are jointly Gaussian it s often practical to join them into a single vector = (u b) and [ ] µu µ = µ b [ ] Qu 0 Q = 0 Q b 1.1 Bayesian hierarchical formulation A = [ A B ] y = A + ɛ. Written as a full Bayesian hierarchical model the model is θ = (θ u τ b τ ɛ ) π(θ) ( θ) N(µ Q 1 ) (y θ ) N(A Q 1 ɛ ). It will be important to note that for local basis functions (and at most Markov dependent noise ɛ) A Q ɛ A is sparse and that is therefore also sparse if p is small. 1.2 Joint distribution [ A A Q Q ɛ A = ɛ A A ] Q ɛ B B Q ɛ A B Q ɛ B The joint distribution for the observations and the latent variables (y ) is given by [ ( [A ] [ ] 1 ) y µ N Q ɛ Q ɛ A ] µ A Q ɛ Q + A Q ɛ A 2

1.3 Conditional/posterior distribution (or Kriging) The conditional distribution for given y and θ is given by ( θ y) N(µ y Q 1 y ) Q y = Q + A Q ɛ A µ y = µ + Q 1 y A Q ɛ (y A µ ). The proof is by completing the square in the eponent of the Gaussian conditional density epression π( θ y) = π( y θ) π(y θ) π( θ)π(y θ ). Note that the elements of µ y are the basis function coefficients and covariate effect estimates in the universal Kriging predictor and that we ve arrived at it without going the long way around via the traditional covariance based Kriging equations and matri inversion lemmas. 2 Epectations covariances and samples Direct calculations and simulation (for models smaller than approimately m = 10 6 ) is typically done with sparse Cholesky decomposition. Reordering methods are required to achieve sparsity (CAMD nested dissection etc.). Here assume for ease of notation that the nodes are already in a suitable order and that both the prior and posterior precision decompositions are sparse: Q = L L Q y = L y L y where all L are lower triangular sparse matrices. See Sections 2.5.2 and 4 for eamples of how to handle models where only the prior precision has a sparse Cholesky decomposition. In several instances the needed adjustments for the use of iterative equation solvers are also mentioned. 2.1 Solving with the posterior precision Kriging requires solving a linear system Q y z = w z = L y ( L 1 y w ) requiring one forward solve and one backward solve. Several systems with the same posterior precision can be solved simultaneously by having a multi-column w. Often practical implementations avoid actually transposing the large sparse matrices effectively doing z = { ( ) } L 1 y w L 1 y instead. The Kriging predictor (the posterior epectation) is obtained for w = A Q ɛ (y A µ ) with µ y = µ + z. 3

2.2 Prior and posterior variances Even though the full covariance matri is too large to be stored the variances and covariances of neighbouring Markov nodes can be calculated using an algorithm related to the Cholesky decomposition. In R-INLA this is available as S=inla.qinv(Q) where {( Q 1 ) S ij = when Q ij ij 0 or (i j) is in the Cholesky infill and (1) 0 otherwise. This means that the posterior neighbour covariances Cov( i j θ y) for neighbours (i j) are obtained by calling inla.qinv() with the posterior precision Q y. Provided that the desired posterior predictions A pred the prior model itself e.g. when A pred marginal posterior predictive variances as the diagonal of A pred ( ) m+p Var (A pred ) i = have the same sparse connectivity structure as = A the sparse symmetric S matri is sufficient to evaluate S(A pred ) : j=1 (A pred ) ij m+p k=1 S jk (A pred ) ik which in R would typically be implemented as rowsums(a * (A %*% S)). When the desired predictions are dense relative to the model (e.g. when predicting the overall spatial average) solves with L y are required. A small number of predictions (i.e. the number of rows of A pred is small) can be evaluated simultaneously with LAt <- solve(l t(a)) colsums(lat * LAt) Data predictive marginal variances for A pred + ɛ pred are obtained by adding τɛ 1 to the variances for A pred. 2.3 Leave-one-out cross-validation In the case of conditionally independent observations epressions for epectations and variances of leaveone-out cross-validation predictive distributions can be obtained via rank one modification of the full posterior precision matri and are obtained for free by manipulating the above epressions. Let S y be the neighbour covariance matri obtained from (1) applied to Q y and define E i = (A µ y ) i V i = (A ) i S y (A ) i. Then the predictive distribution for i based on the complete data is ((A ) i θ y) N(E i V i ) and the leave-one-out predictive distribution is ( (A ) i θ y j j i ) N(E (i) V (i) ) where E (i) = y i y i E i 1 q i V i V (i) = V i 1 q i V i. where q i is the conditional precision of observation i from Q ɛ = diag(q 1... q n ). 4

2.4 Sampling from the prior Let w N(0 I m+p ) then = µ + L w is a sample from the prior model for. Multiple samples are obtained by repeated forward solves. 2.5 Sampling from the posterior 2.5.1 Direct method First compute µ y as in Section 2.1 and then let w N(0 I m+p ). Then = µ y + L y w is a sample from the posterior for given y. Multiple samples are obtained by repeated forward solves. 2.5.2 Least squares In some models the posterior precision matri does not have a sparse Cholesky decomposition but iterative linear algebra methods may still be used to solve systems with Q y. If the prior precision can be decomposed as Q = H H for some sparse matri H (not necessarily a Cholesky factor) then one approach to sampling is least squares. Given that the posterior epectation has already been calculated using an iterative solution to the equations in Section 1.3 a sample of conditionally on y is given by the solution to where w N(0 I m+p+n ). 2.5.3 Sampling with conditioning by Kriging Q y ( µ y ) = [ H A L ɛ ] w The conditioning by Kriging approach to sampling is to construct samples from the prior and then correct them to become samples from the posterior. For GMRF models this is less efficient than the direct approach in Section 2.5.1 ecept when some observations are non-local which leads to a dense posterior precision. In its traditional form the samples are constructed as follows: = µ + L w w N(0 I m+p ) y = y + L ɛ w y w y N(0 I n ) ( ) 1 = Q 1 A A Q 1 A + Q 1 (A y ) where the final epression is the covariance based Kriging equation. For anything but very small n the inner matri is dense. Using the Woodbury identity the epression can be rewritten to recover the precision based equation from Section 1.3 ( ) 1 = + Q + A Q ɛ A A Q ɛ (y A ) = + Q 1 y A Q ɛ (y A ) ɛ 5

which is calculated as before with forward and backward solves on L y. If N samples as well as the posterior mean are desired conditioning by Kriging requires N solves with L and L ɛ (to obtain 0 and y 0 ) and 2 + 2N solves with L y. The direct method only requires 2 + N solves with L y and no solves with L or L ɛ which means that it is nearly always preferable to conditioning by Kriging. The eception is non-local observations which can be handled separately as soft constraints see Section 4. 3 Likelihood evaluation The epensive part of evaluating likelihoods is calculating precision determinants but they come at no etra cost given the Cholesky factors: log π( θ) = m + p log(2π) + 1 2 2 log det(q ) 1 2 ( µ ) Q ( µ ) m+p log det(q ) = 2 log(l ) jj. j=1 Corresponding epressions hold for π(y θ ) and π( θ y) using L ɛ and L y. Marginal data likelihoods require more effort. 3.1 Conditional marginal data likelihood For any arbitrary π(y θ) = = π(θ y) π(θ) = π( θ)π(y θ ) π( θ y) π(θ y)π( θ y) π(θ)π( θ y) = =. = π(θ y ) π(θ)π( θ y) = In practice this is evaluated for = µ y which gives better numerical stability than = 0. Written eplicitly log π(y θ) = n 2 log(2π) + 1 2 log det(q ) + 1 2 log det(q ɛ) 1 2 log det(q y) 1 2 (µ y µ ) Q (µ y µ ) 1 2 (y A µ y ) Q ɛ (y A µ y ) where an anticipated third quadratic form from π( θ y) has been eliminated due to the choice of. Comparing with a standard marginal Gaussian log-likelihood epression the log-determinant and quadratic form can be identified log det(q y ) = log det(q ) + log det(q ɛ ) log det(q y ) (y A µ ) Q y (y A µ ) = (µ y µ ) Q (µ y µ ) + (y A µ y ) Q ɛ (y A µ y ) even though the marginal precision matri itself Q y is dense and cannot be stored. 6

3.2 Posterior parameter likelihood and total marginal data likelihood Using the result for the conditional marginal data likelihood we can epress the posterior parameter likelihood as π(θ y) = π(θ y) π(y) = π(θ)π(y θ) π(y) where the only unknown is the total marginal data likelihood π(y) which can be approimated with numerical integration over θ π(y) = π(θ)π(y θ) dθ. 3.3 A note on non-gaussian data and INLA For non-gaussian data Laplace approimation can be used which in essence involves replacing the non- Gaussian conditional posterior π( θ y) with a Gaussian approimation in the epression for π(y θ) taking as the mode of π( θ y). The INLA implementation finds the mode of π(θ y) using numerical optimisation and further approimates the marginal posteriors for the latent variables k π( k y) with numerical integration of π( k y) = π( k θ y)π(θ y) dθ where a further laplace approimation of π( k θ y) is made (as well as skweness corrections not discussed here) π( k θ y) = π( k (k) θ y). π( (k) θ y k ) (k) = (k) This combination of Laplace approimations is the nested Laplace part of INLA. In the case of Gaussian data the only approimation in the method is the numerical integration itself. 4 Linear constraints There are two main classes of linear constraints: 1. Hard constraints which fall into two subclasses (a) Interacting hard constraints where linear combinations of nodes are constrained (b) Non-interacting hard constraints where each constraint acts only on a single node in effect specifying those nodes eplicitly 2. Soft constraints where linear combinations are specified with some uncertainty equivalent to posterior conditioning on observations of these linear combinations. Observations are treated under this framework when they break the Markov structure in the posterior distribution e.g. by affecting all the nodes in the field directly. All these cases can be written on matri form as A c + ɛ c = e c 7

where ɛ c = 0 in the hard constraint cases and ɛ c N(0 Q 1 c ) in the soft constraint case. The aim is to sample from and evaluate properties of the conditional distribution for ( e c ) when N(µ Q 1 ) (e c ) N(A c Q 1 c ) where comes from a known GMRF model typically either a prior or a posterior distribution. Let r denote the number of constraints i.e. the number of rows in A c and the number of elements in e c. 4.1 Non-interacting hard constraints Non-interacting hard constraints are easiest to handle as the constrained nodes can be completely removed from all calculations in a preprocessing step as they are in effect known deterministic values and the conditional precision for the remaining nodes can be trivially etracted as a block from the joint precision matri. Without loss of practical generality we can assume that all scaling is done in e c so that each row of A c has a single non-zero entry that is equal to 1. For ease of notation assume that the nodes are ordered so that the unconstrained nodes U are followed by the constrained C so that the joint distribution is [ ] ( [µu ] [ ] ) 1 U QUU Q N UC. C µ C Q CU Q CC The conditional distribution for the unconstrained nodes is given by ( ) ( U C = e c ) N µ U C Q 1 UU µ U C = µ U Q 1 UU Q UC(e c µ C ). The constrained mean and constrained samples are obtained with the same forward and backward solving techniques as in Section 1.3 and Section 2.5.1 using the Cholesky decomposition of Q UU. The joint constrained distribution can be written as a degenerate Normal distribution ([ ] ) ([ ] [ ]) U µu C Q 1 C = e c N UU 0. C e c 0 0 4.2 Conditioning by Kriging Conditioning on local observations preserves the Markov structure and Markov breaking conditioning is therefore done as a final step when computing posterior means and sampling from the posterior as well as from the prior. Using the conditioning by Kriging introduced in Section 2.5.3 conditioning on linear constraints or on non-local observations can be written in a common framework. Let Q c = L c L c be the Cholesky decomposition of Q c in the soft constraint case and for notational convenience define Q 1 c = 0 and L 1 c = 0 for hard constraints. 8

The method proceeds by constructing a sample from the unconstrained model and then correcting for the constraints = µ + L w w N(0 I m+p ) e c = e c + L c w c w c N(0 I r ) ( ) 1 = Q 1 A c A c Q 1 A c + Q 1 c (Ac e c). In order to obtain the conditional mean instead of a sample set = µ and e c = e c. The dense but small inner r r matri can be evaluated using a single forward solve with L A c Q 1 A c ( ) + Q 1 c = L 1 A c L 1 A c + Q 1 c. Defining Ã c = L 1 A c the full epression becomes = L Ã c (Ãc Ã c ) 1 + Q 1 c (Ac e c) where the remaining inner r r solve is evaluated using e.g. Cholesky decomposition and the result fed into a final backward solve with L. In order to avoid unnecessary recalculations for multiple samples further precomputation can be used. A fully Cholesky based approach that also simplifies later calulations is L c L c = ÃcÃ c + L c L 1 c B = L Ã c = B L c 1 L c (A c e c) where L c is a dense r r Cholesky factor and B is dense (m + p) r. For models that require iterative solves for Q instead of Cholesky decomposition the pre-calculations are modified slightly Ãc L c L c = Q 1 A c = A c Ãc B = Ã c L c but the final step for constructing remains unchanged. 4.3 Conditional covariances From the conditioning by Kriging algorithm we obtain + L c L 1 c Cov( e c ) = Q 1 BB which can be combined with the neighbour covariance algorithm from Section 2.2 to yield the conditional covariances between nodes that are neighbours in the unconstrained model { S ij k (S c ) ij = B ikb jk when Q ij 0 or (i j) is in the Cholesky infill and 0 otherwise. 9

4.4 Constrained likelihood evaluation The constrained likelihood is given by π( e c ) = π()π(e c ) π(e c ) where π(e c ) is a degenerate density in the hard constraint case. The marginal distributions for and e c are N(µ Q 1 ) e c N(A c µ L c L c ) and the densities can be evaluated with the usual approach with log π() = m + p log(2π) + log det(l) 1 2 2 ( µ) Q( µ) log π(e c ) = r 2 log(2π) log det( L c ) 1 2 (e c A c µ) L c L 1 c (e c A c µ). In the hard constraint case Section 2.3.3 of (Rue and Held 2005) provides the degenerate conditional density through log π(e c ) = log π(a c ) = 1 2 log det(a ca c ) which can be evaluated using the Cholesky decomposition of the r r matri A c A c. In the soft constraint case log π(e c ) = r 2 log(2π) + log det(l c) 1 2 (e c A c ) Q c (e c A c ). References Rue H. and L. Held (2005). Gaussian Markov Random Fields; Theory and Applications. Chapman & Hall/CRC. 10