Gaussian Markov Random Fields: Theory and Applications

Size: px

Start display at page:

Download "Gaussian Markov Random Fields: Theory and Applications"

Hubert Houston
6 years ago
Views:

1 Gaussian Markov Random Fields: Theory and Applications Håvard Rue 1 1 Department of Mathematical Sciences NTNU, Norway September 26, 2008

2 Part I Definition and basic properties

3 Outline I Introduction Why? Outline of this course What is a GMRF? The precision matrix Definition of a GMRF Example: Auto-regressive process Why are GMRFs important? Main features of GMRFs Properties of GMRFs Interpretation of elements of Q Markov properties Conditional density Specification through full conditionals Proof via Brook s lemma

4 Introduction Why? Why is it a good idea to learn about Gaussian Markov random fields (GMRFs)? That is a good question! What s there to learn? Isn t all just a Gaussian? That is also a good question!

5 Introduction Why? Why? Gaussians (and most often in the form of a GMRF) are extensively used in statistical models. However, its excellent computational properties are not often explored. Lack of knowledge of general purpose and near optimal numerical algorithms

6 Introduction Why? Why? Fast(er) MCMC based inference Recent developments (the really nice stuff...) Near instant (compare to MCMC) approximate Bayesian inference is often possible Deterministic Relative error In practice: no error

7 Introduction Why? Longitudinal mixed effects model: Epil-example from BUGS

8 Introduction Why? Longitudinal mixed effects model: Epil-example from BUGS... In this example x = (a 0, a Base, a Trt, a BT, a Age, a V4, {b1 j }, {b jk }) is a GMRF. Two (hyper-)parameters Precision(b1) and Precision(b) Poisson-observations

9 Outline of this course Outline of this course: Day I Lecture 1 GMRFs: definition and basic properties Lecture 2 Simulation algorithms for GMRFs Lecture 3 Numerical methods for sparse matrices Lecture 4 Intrinsic GMRFs Lecture 5 Hierarchical GMRF models and MCMC algorithms for such models Lecture 6 Case-studies

10 Outline of this course Outline of this course: Day II Lecture 1 Motivation for Approximate Bayesian inference Lecture 2 INLA: Integrated Nested Laplace Approximations Lecture 3 The inla-program: examples Lecture 4 Using R-interface to the inla-program (Sara Martino) Lecture 5 Case studies with R (Sara Martino) Lecture 6 Discussion

11 Outline of this course What is a GMRF? What is a Gaussian Markov random field (GMRF)? A GMRF is a simple construct A normal distributed random vector x = (x 1,..., x n ) T Additional Markov properties: x i x j x ij x i and x j are conditional independent (CI).

12 Outline of this course The precision matrix If x i x j x ij for a set of {i, j}, then we need to constrain the parametrisation of the GMRF. Covariance matrix: difficult Precision matrix: easy

13 Outline of this course The precision matrix Conditional independence and the precision matrix The density of a zero mean Gaussian ( π(x) Q 1/2 exp 1 ) 2 xt Qx Constraining the parametrisation to obey CI properties Theorem x i x j x ij Q ij = 0

14 Outline of this course The precision matrix Use a (undirected) graph G = (V, E) to represent the CI properties, V Vertices: 1, 2,..., n. E Edges {i, j} No edge between i and j if x i x j x ij. An edge between i and j if x i x j x ij.

15 Outline of this course Definition of a GMRF Definition of a GMRF Definition (GMRF) A random vector x = (x 1,..., x n ) T is called a GMRF wrt the graph G = (V = {1,..., n}, E) with mean µ and precision matrix Q > 0, iff its density has the form ( π(x) = (2π) n/2 Q 1/2 exp 1 ) 2 (x µ)t Q(x µ) and Q ij 0 {i, j} E for all i j.

16 Outline of this course Example: Auto-regressive process Simple example of a GMRF Auto-regressive process of order 1 x t x t 1,..., x 1 N (φx t 1, 1), t = 2,..., n and x 1 N (0, (1 φ 2 ) 1 ). Tridiagonal precision matrix 1 φ φ 1 + φ 2 φ Q = φ 1 + φ 2 φ φ 1

17 Outline of this course Why are GMRFs important? Usage of GMRFs (I) Structural time-series analysis Autoregressive models. Gaussian state-space models. Computational algorithms based on the Kalman filter and its variants. Analysis of longitudinal and survival data temporal GMRF priors state-space approaches spatial GMRF priors used to analyse longitudinal and survival data.

18 Outline of this course Why are GMRFs important? Usage of GMRFs (II) Graphical models A key model Estimate Q and its (associated) graph from data. Often used in a larger context. Semiparametric regression and splines Model a smooth curve in time or a surface in space. Intrinsic GMRF models and random walk models Discretely observed integrated Wiener processes are GMRFs GMRFs models for coefficients in B-splines.

19 Outline of this course Why are GMRFs important? Usage of GMRFs (III) Image analysis The first main area for spatial models Image restoration using the Wiener filter. Texture modelling and texture discrimination. Segmentation Deformable templates Object identification 3D reconstruction Restoring ultrasound images

20 Outline of this course Why are GMRFs important? Usage of GMRFs (IV) Spatial statistics Latent GMRF model analysis of spatial binary data Geostatistics using GMRFs Analysis of data in social sciences and spatial econometrics Spatial and space-time epidemiology Environmental statistics Inverse problems

21 Outline of this course Main features of GMRFs Main features Analytical tractable Modelling using conditional independence Merging GMRFs using conditioning (hierarchical models) Unified framework for understanding representation computation using numerical methods for sparse matrices Fits nicely into the MCMC world Can construct fast and reliable block-mcmc algorithms.

22 Properties of GMRFs Constraining the parametrisation to obey CI properties Theorem x i x j x ij Q ij = 0

23 Properties of GMRFs Proof Start with the factorisation theorem. Theorem x y z π(x, y, z) = f (x, z)g(y, z) (1) for some functions f and g, and for all z with π(z) > 0. Example For π(x, y, z) exp(x + xz + yz) on some bounded region, we see that x y z. However, this is not the case for π(x, y, z) exp(xyz)

24 Properties of GMRFs Proof... π(x i, x j, x ij ) exp ( 1 ) x k Q kl x l 2 ( exp k,l 1 2 x ix j (Q ij + Q ji ) 1 }{{} 2 term 1 {k,l} ={i,j} x k Q kl x l } {{ } term 2 Term 1 involves x i x j iff Q ij 0. Term 2 does not involve x i x j. We now see that π(x i, x j, x ij ) = f (x i, x ij )g(x j, x ij ) for some functions f and g, iff Q ij = 0. Use the factorisation theorem. ).

25 Properties of GMRFs Interpretation of elements of Q Interpretation of elements of Q Let x be a GMRF wrt G = (V, E) with mean µ and precision matrix Q > 0, then E(x i x i ) = µ i 1 Q ij (x j µ j ), Q ii Prec(x i x i ) = Q ii and j:j i Corr(x i, x j x ij ) = Q ij Qii Q jj, i j.

26 Properties of GMRFs Markov properties Markov properties Let x be a GMRF wrt G = (V, E). Then the following are equivalent. The pairwise Markov property: x i x j x ij if {i, j} E and i j.

27 Properties of GMRFs Markov properties Markov properties Let x be a GMRF wrt G = (V, E). Then the following are equivalent. The local Markov property: x i x {i,ne(i)} x ne(i) for every i V.

28 Properties of GMRFs Markov properties Markov properties Let x be a GMRF wrt G = (V, E). Then the following are equivalent. The global Markov property: x A x B x C for all disjoint sets A, B and C where C separates A and B, and A and B are non-empty.

29 Properties of GMRFs Conditional density (Induced) subgraph Let A V G A denote the graph restricted to A. remove all nodes not belonging to A, and all edges where at least one node does not belong to A Example A = {1, 2}, then V A = {1, 2} and E A = {{1, 2}}

30 Properties of GMRFs Conditional density Conditional density I Let V = A B where A B =, and ( ) ( ) xa µa x =, µ =, Q = x B µ B ( QAA Q AB Q BA Q BB ). Result: x A x B is then a GMRF wrt the subgraph G A with parameters µ A B and Q A B > 0, where µ A B = µ A Q 1 AA Q AB(x B µ B ) and Q A B = Q AA.

31 Properties of GMRFs Conditional density Conditional density II µ A B = µ A Q 1 AA Q AB(x B µ B ) and Q A B = Q AA. Have explicit knowledge Q A B as the principal matrix Q AA. The subgraph G A does not change the structure, only removes nodes and edges in A. The conditional mean only depends on nodes in A ne(a). If Q AA is sparse, then µ A B is the solution of a sparse linear system Q AA (µ A B µ A ) = Q AB (x B µ B )

32 Properties of GMRFs Conditional density Conditional density III

33 Properties of GMRFs Conditional density Conditional density III

34 Properties of GMRFs Conditional density Conditional density III

35 Properties of GMRFs Specification through full conditionals Specification through full conditionals An alternative to specifying a GMRF by its mean and precision matrix, is to specify it implicitly through the full conditionals {π(x i x i ), i = 1,..., n} This approach was pioneered by Besag (1974,1975) Also known as conditional autoregressions or CAR-models.

36 Properties of GMRFs Specification through full conditionals Example: AR-process Specify full conditionals E(x 1 x 1 ) = 0.3x 2 Prec(x 1 x 1 ) = 1 E(x 3 x 3 ) = 0.2x x 4 Prec(x 3 x 3 ) = 2 and so on.

37 Properties of GMRFs Specification through full conditionals Example: Spatial model Specify full conditionals for each i (assuming zero mean) E(x i x i ) = j w ij x j Prec(x i x i ) = κ i

38 Properties of GMRFs Specification through full conditionals Potential problems Even though the full conditionals are well defined, are we sure that the joint density exists and is unique? What are the compatibility conditions required for a joint GMRF to exists with the prescribed Markov properties.

39 Properties of GMRFs Specification through full conditionals In general Specify the full conditionals as normals with E(x i x i ) = µ i β ij (x j µ j ) j:j i and (2) Prec(x i x i ) = κ i > 0 (3) for i = 1,..., n, for some {β ij, i j}, and vectors µ and κ. Clearly, is defined implicitly by the nonzero terms of {β ij }.

40 Properties of GMRFs Specification through full conditionals E(x i x i ) = µ i j:j i β ij (x j µ j ) and Prec(x i x i ) = κ i > 0 Must be a joint density π(x) with these as the full conditionals. Comparing term by term with (2): Q ii = κ i, and Q ij = κ i β ij Q is symmetric, i.e., κ i β ij = κ j β ji, We have a candidate for a joint density provided Q > 0.

41 Properties of GMRFs Proof via Brook s lemma Lemma (Brook s lemma) Let π(x) be the density for x R n and define Ω = {x R n : π(x) > 0}. Let x, x Ω, then π(x) π(x ) = = n π(x i x 1,..., x i 1, x i+1,..., x n) i=1 n i=1 π(x i x 1,..., x i 1, x i+1,..., x n) π(x i x 1,..., x i 1, x i+1,..., x n ) π(x i x 1,..., x i 1, x i+1,..., x n ).

42 Properties of GMRFs Proof via Brook s lemma Assume µ = 0 and fix x = 0. Then and log π(x) π(0) = 1 2 log π(x) π(0) = 1 2 n κ i xi 2 i=1 n κ i xi 2 i=1 n i 1 κ i β ij x i x j. (4) i=2 j=1 n 1 n κ i β ij x i x j. (5) i=1 j=i+1 Since (4) and (5) must be identical then κ i β ij = κ j β ji for i j, and log π(x) = const 1 n κ i xi 2 1 κ i β ij x i x j ; 2 2 i=1 hence x is zero mean multivariate normal provided Q > 0. i j

43 Part II Simulation algorithms for GMRFs

44 Outline I Introduction Simulation algorithms for GMRFs Task Summary of the simulation algorithms Basic numerical linear algebra Cholesky factorisation and the Cholesky triangle Solving linear equations Avoid computing the inverse Unconditional sampling Simulation algorithm Evaluating the log-density Conditional sampling Canonical parameterisation Conditional distribution in the canonical parameterisation Sampling from a canonical parameterised GMRF

45 Outline II Sampling under hard linear constraints Introduction General algorithm (slow) Specific algorithm with not to many constraints (fast) Example Evaluating the log-density Sampling under soft linear constraints Introduction Specific algorithm with not to many constraints (fast) The algorithm Evaluate the log-density Example

46 Introduction Sparse precision matrix Recall that E(x i µ i x i ) = 1 Q ij (x i µ j ), Q ii j i Prec(x i x i ) = Q ii In most cases: Total number of neighbours is O(n). Only O(n) of the n 2 terms in Q will be non-zero. Use this to construct exact simulation algorithms for GMRFs, using numerical algorithms for sparse matrices.

47 Introduction Example of a typical precision matrix

48 Introduction Example of a typical precision matrix

49 Introduction Example of a typical precision matrix Each node have on the average 7 neighbours.

50 Simulation algorithms for GMRFs Task Simulation algorithms for GMRFs Can we take advantage of the sparse structure of Q? It is faster to factorise a sparse Q compared to a dense Q. The speedup depends on the pattern in Q, not only the number of non-zero terms. Our task Formulate all algorithms to use only sparse matrices. Unconditional simulation Conditional simulation Condition on a subset of variables Condition on linear constraints Condition on linear constraints with normal noise Evaluation of the log-density in all cases.

51 Simulation algorithms for GMRFs Summary of the simulation algorithms The result In most cases, the cost is O(n) for temporal GMRFs O(n 3/2 ) for spatial GMRFs O(n 2 ) for spatio-temporal GMRFs including evaluation of the log-density. Condition on k linear constraints, add O(k 3 ). These are general algorithms only depending on the graph G not the numerical values in Q. The core is numerical algorithms for sparse matrices.

52 Basic numerical linear algebra Cholesky factorisation and the Cholesky triangle Cholesky factorisation If A > 0 be a n n positive definite matrix, then there exists a unique Cholesky triangle L, such that L is a lower triangular matrix, and A = LL T Computing L costs n 3 /3 flops. This factorisation is the basis for solving systems like Ax = b or AX = B for k right hand sides, or equivalently, computing x = A 1 b or X = A 1 B

53 Basic numerical linear algebra Solving linear equations Algorithm 1 Solving Ax = b where A > 0 1: Compute the Cholesky factorisation, A = LL T 2: Solve Lv = b 3: Solve L T x = v 4: Return x Step 2 is called forward-substitution and cost O(n 2 ) flops. = The solution v is computed in a forward-loop v i = 1 i 1 (b i L ij v j ), i = 1,..., n (6) L ii j=1

Basic numerical linear algebra Solving linear equations Algorithm 1 Solving Ax = b where A > 0 1: Compute the Cholesky factorisation, A = LL T 2: Solve Lv = b 3: Solve L T x

54 Basic numerical linear algebra Solving linear equations Algorithm 1 Solving Ax = b where A > 0 1: Compute the Cholesky factorisation, A = LL T 2: Solve Lv = b 3: Solve L T x = v 4: Return x Step 3 is called back-substitution and costs O(n 2 ) flops. The solution x is computed in a backward-loop x i = 1 n (v i L ji x j ), i = n,..., 1 (7) L ii j=i+1

55 Basic numerical linear algebra Avoid computing the inverse To compute A 1 B where B is a n k matrix, we do this by computing the solution X of for each of the k columns of X. AX j = B j Algorithm 2 Solving AX = B where A > 0 1: Compute the Cholesky factorisation, A = LL T 2: for j = 1 to k do 3: Solve Lv = B j 4: Solve L T X j = v 5: end for 6: Return X

56 Unconditional sampling Simulation algorithm Sample x N (µ, Q 1 ) If Q = LL T and z N (0, I), then x defined by has covariance L T x = z Cov(x) = Cov(L T z) = (LL T ) 1 = Q 1 Algorithm 3 Sampling x N (µ, Q 1 ) 1: Compute the Cholesky factorisation, Q = LL T 2: Sample z N (0, I) 3: Solve L T v = z 4: Compute x = µ + v 5: Return x

57 Unconditional sampling Evaluating the log-density The log-density is log π(x) = n 2 log 2π + n i=1 log L ii 1 2 (x µ)t Q(x µ) }{{} =q If x is sampled, then q = z T z otherwise, compute this term as u = x µ v = Qu q = u T v

58 Conditional sampling Canonical parameterisation The canonical parameterisation A GMRF x wrt G having a canonical parameterisation (b, Q), has density ( π(x) exp 1 ) 2 xt Qx + b T x (8) ie, precision matrix Q and mean µ = Q 1 b. Write this as x N C (b, Q). (9) The relation to the Gaussian distribution, is that N (µ, Q 1 ) = N C (Qµ, Q)

59 Conditional sampling Conditional distribution in the canonical parameterisation Conditional simulation of a GMRF Decompose x as ( xa x B ) ( (µa ) ( ) ) 1 QAA Q N, AB µ B Q BA Q BB Then x A x B has canonical parameterisation x A µ A x B N C ( Q AB (x B µ B ), Q AA ) (10) Simulate using an algorithm for a canonical parameterisation.

60 Conditional sampling Sampling from a canonical parameterised GMRF Sample x N C (b, Q 1 ) Recall that N C (b, Q) = N (Q 1 b, Q 1 ) so we need to compute the mean as well. Algorithm 4 Sampling x N C (b, Q) 1: Compute the Cholesky factorisation, Q = LL T 2: Solve Lw = b 3: Solve L T µ = w 4: Sample z N (0, I) 5: Solve L T v = z 6: Compute x = µ + v 7: Return x

61 Sampling under hard linear constraints Introduction Sampling from x Ax = e A is a k n matrix, 0 < k < n, with rank k, e is a vector of length k. This case occurs quite frequently: A sum-to-zero constraint corresponds to k = 1, A = 1 T and e = 0. We will denote this problem sampling under a hard constraint. Note that π(x Ax) is singular with rank n k.

62 Sampling under hard linear constraints Introduction The linear constraint makes the conditional distribution Gaussian but it is singular as the rank of the constrained covariance matrix is n k more care must be exercised when sampling from this distribution. We have that E(x Ax = e) = µ AQ 1 (AQ 1 A T ) 1 (Aµ e) (11) Cov(x Ax = e) = Q 1 Q 1 A T (AQ 1 A T ) 1 AQ 1 (12) This is typically a dense-matrix case, which must be solved using general O(n 3 ) algorithms (see next frame for details).

63 Sampling under hard linear constraints General algorithm (slow) We can sample from this distribution as follows. As the covariance matrix is singular we compute the eigenvalues and eigenvectors, and write the covariance matrix as VΛV T where V have the eigenvectors on each row and Λ is a diagonal matrix with the eigenvalues on the diagonal. This corresponds to a different factorisation than the Cholesky triangle, but any matrix C which satisfy CC T = Σ, will do. Note that k of the eigenvalues are zero. We can produce a sample by computing v = Cz, where C = VΛ 1/2, z N (0, I), and then add the conditional mean. We can compute the log-density as log(π(x Ax = e)) = n k 2 log 2π 1 2 i:λ ii >0 log Λ ii 1 2 (x µ ) T Σ (x µ ) (13) where µ is (11), and Σ = VΛ V T where (Λ ) ii is Λ 1 ii if Λ ii > 0 and zero otherwise. In total, this is a quite computational demeaning procedure, as the algorithm is not able to take advantage of the sparse structure of Q.

64 Sampling under hard linear constraints Specific algorithm with not to many constraints (fast) Conditioning via Kriging There is an alternative algorithm which correct for the constraints, at nearly no costs if k n. Let x N (µ, Q), then compute x = x Q 1 A T (AQ 1 A T ) 1 (Ax e). (14) Now x has the correct conditional distribution! AQ 1 A T is a k k matrix, hence its factorisation is fast to compute for small k.

65 Sampling under hard linear constraints Specific algorithm with not to many constraints (fast) Algorithm 5 Sampling x Ax = e when x N (µ, Q 1 ) 1: Compute the Cholesky factorisation, Q = LL T 2: Sample z N (0, I) 3: Solve L T v = z 4: Compute x = µ + v 5: Compute V n k = Q 1 A T with Algorithm 2 6: Compute W k k = AV 7: Compute U k n = W 1 V T using Algorithm 2 8: Compute c = Ax e 9: Compute x = x U T c 10: Return x If z = 0 in Algorithm 5, then x is the conditional mean. Extra cost is only O(k 3 ) for large k!

66 Sampling under hard linear constraints Example Example Let x 1,..., x n be independent Gaussian variables with variance σi 2 mean µ i. To sample x conditioned on and xi = 0 we sample first x i N (µ i, σ 2 i ), and compute the constrained sample x, as i = 1,..., n x i = x i c σ 2 i (15) where c = j x j j σ2 j

67 Sampling under hard linear constraints Evaluating the log-density The log-density can be rapidly computed using the following identity, π(x)π(ax x) π(x Ax) =, (16) π(ax) We can compute each term on the rhs easier than the lhs. π(x)-term. This is a GMRF and the log-density is easy to compute using L computed in Algorithm 5 step 1.

68 Sampling under hard linear constraints Evaluating the log-density The log-density can be rapidly computed using the following identity, π(x)π(ax x) π(x Ax) =, (16) π(ax) We can compute each term on the rhs easier than the lhs. π(ax x)-term. This is a degenerate density, which is either zero or a constant, log π(ax x) = 1 2 log AAT (17) The determinant of a k k matrix can be found its Cholesky factorisation.

69 Sampling under hard linear constraints Evaluating the log-density The log-density can be rapidly computed using the following identity, π(x)π(ax x) π(x Ax) =, (17) π(ax) We can compute each term on the rhs easier than the lhs. π(ax)-term. Ax is Gaussian with mean Aµ and covariance matrix AQ 1 A T with Cholesky triangle L available from Algorithm 5 step 7.

70 Sampling under soft linear constraints Introduction Sampling from x Ax = ɛ Let x be a GMRF which is observed by e, where e x N (Ax, Σ ɛ ). e is a vector of length k < n A a k n matrix rank k Σ ɛ > 0 is the covariance matrix for the noise. The conditional for x, is log π(x e). = 1 2 (x µ)t Q(x µ) 1 2 (e Ax)T Σ 1 ɛ (e Ax) (18) x e N C (Qµ + A T Σ 1 ɛ e, Q + A T Σ 1 ɛ A) (19) This precision matrix is most often a full matrix and the nice sparse structure of Q is lost.

71 Sampling under soft linear constraints Introduction Example Observe the sum of x i with unit variance noise, the posterior precision is Q + 11 T which is a dense matrix. In general, we have to sample from (18) using a general algorithm which is computational expensive for large n.

72 Sampling under soft linear constraints Specific algorithm with not to many constraints (fast) Alternative approach Similar problem to sampling under a hard constraint Ax = e, but now e is stochastic Ax = ɛ where ɛ N (e, Σ ɛ ). (20) We denote this case sampling under a soft constraint. If we extend (14) to x = x Q 1 A T (AQ 1 A T + Σ ɛ ) 1 (Ax ɛ), (21) where ɛ and x are samples from their respective distributions,...then x has the correct conditional distribution.

73 Sampling under soft linear constraints The algorithm Algorithm 6 Sampling x Ax = ɛ when ɛ N (e, Σ ɛ) and x N (µ, Q 1 ) 1: Compute the Cholesky factorisation, Q = LL T 2: Sample z N (0, I) 3: Solve L T v = z 4: Compute x = µ + v 5: Compute V n k = Q 1 A T using Algorithm 2 using L 6: Compute W k k = AV + Σ ɛ 7: Compute U k n = W 1 V T using Algorithm 2 8: Sample ɛ N (e, W) using the factorisation from step 7 9: Compute c = Ax ɛ 10: Compute x = x U T c 11: Return x When z = 0 and ɛ = e then x is the conditional mean.

74 Sampling under soft linear constraints Evaluate the log-density The log-density is computed via π(x e) = π(x)π(e x) π(e) (22) All Cholesky triangles required are available from the simulation algorithm. π(x)-term. This is a GMRF. π(e x)-term. e x is Gaussian with mean Ax and covariance Σ ɛ. π(e)-term. e is Gaussian with mean Aµ and covariance matrix AQ 1 A T + Σ ɛ.

75 Sampling under soft linear constraints Example Example Let x 1,..., x n be independent Gaussian variables with variance σi 2 mean µ i. We now observe and e N ( x i, σ 2 ɛ) To sample from π(x e) we sample first x i N (µ i, σ 2 i ), i = 1,..., n and then ɛ N (e, σ 2 ɛ) A conditional sample x, is then xi = x i c σi 2, where j c = j ɛ j σ2 j + σɛ 2

76 Part III Numerical methods for sparse matrices

77 Outline I Introduction Cholesky factorisation The Cholesky triangle Interpretation The zero-pattern in L Example: A simple graph Example: auto-regressive processes Band matrices Bandwidth is preserved Cholesky factorisation for band-matrices Reordering schemes Introduction Reordering to band-matrices Reordering using the idea of nested dissection What to do with sparse matrices?

78 Outline II A numerical case-study Introduction GMRF-models in time GMRF-models in space GMRF-models in time space

79 Introduction Numerical methods for sparse matrices Have shown that computations on GMRFs can be expressed such that the main tasks are 1. compute the Cholesky factorisation of Q = LL T, and 2. solve Lv = b and L T x = z. The second task is much-faster than the first, but sparsity will be of advantage also here.

80 Introduction The goal is to explain why a sparse Q allow for fast factorisation how we can take advantage of it, why we gain if we permute the vertics before factorising the matrix, how statisticians can benefit for recent research in this area by the numerical mathematicians. At the end we present a small case study factorising some typical matrices for GMRFs, using a classical and more recent methods for factorising matrices.

81 Cholesky factorisation How to compute the Cholesky factorisation j Q ij = L ik L jk, i j. k=1 j 1 v i = Q ij L ik L jk, i j, k=1 Then L 2 jj = v j, and L ij L jj = v i for i > j. If we know {v i } for fixed j, then L jj = v j and L ij = v i / v j, for i = j + 1,..., n. This gives the jth column in L.

82 Cholesky factorisation Cholesky factorization of Q > 0 Algorithm 7 Computing the Cholesky triangle L of Q 1: for j = 1 to n do 2: v j:n = Q j:n,j 3: for k = 1 to j 1 do v j:n = v j:n L j:n,k L jk 4: L j:n,j = v j:n / v j 5: end for 6: Return L The overall process involves n 3 /3 flops.

83 The Cholesky triangle Interpretation Interpretation of L (I) Let Q = LL T, then the solution of L T x = z where z N (0, I) = is N (0, Q 1 ) distributed. Since L is lower triangular then x n = 1 z n L nn 1 x n 1 = (z n 1 L n,n 1 x n ) L n 1,n 1...

84 The Cholesky triangle Interpretation Interpretation of L (II) Theorem Let x be a GMRF wrt to the labelled graph G, with mean µ and precision matrix Q > 0. Let L be the Cholesky triangle of Q. Then for i V, E(x i x (i+1):n ) = µ i 1 L ii Prec(x i x (i+1):n ) = L 2 ii. n j=i+1 L ji (x j µ j ) and

85 The Cholesky triangle The zero-pattern in L Determine the zero-pattern in L (I) Theorem Let x be a GMRF wrt G, with mean µ and precision matrix Q > 0. Let L be the Cholesky triangle of Q and define for 1 i < j n the set F (i, j) = {i + 1,..., j 1, j + 1,..., n}, which is the future of i except j. Then x i x j x F (i,j) L ji = 0. If we can verify that L ji is zero, we do not have to compute it when factorising Q

86 The Cholesky triangle The zero-pattern in L Proof Assume µ = 0 and fix 1 i < j n. Theorem 11 gives that π(x i:n ) exp 1 n L 2 kk x k k=i = exp ( 12 ) xti:nq (i:n) x i:n, n L kk j=k+1 2 L jk x j where Q (i:n) ij = L ii L ji. Then x i x j x F (i,j) L ii L ji = 0, which is equivalent to L ji = 0 since L ii > 0 as Q (i:n) > 0.

87 The Cholesky triangle The zero-pattern in L Determine the zero-pattern in L (II) The global Markov property provide a simple and sufficient criteria for checking if L ji = 0. Corollary If F (i, j) separates i < j in G, then L ji = 0. Corollary If i j then F (i, j) does not separates i < j. The idea is simple Use the global Markov property to check if L ji = 0. Compute only the non-zero terms in L, so that Q = LL T.

88 The Cholesky triangle Example: A simple graph Example Q = L =??

89 The Cholesky triangle Example: A simple graph Example Q = L =

90 The Cholesky triangle Example: auto-regressive processes Example: AR(1)-process x t x 1:(t 1) N (φx t 1, σ 2 ), t = 1,..., n Q = L =

91 Band matrices Bandwidth is preserved Bandwidth is preserved Similarly, for an AR(p)-process Q have bandwidth p. L have lower-bandwidth p. Theorem Let Q > 0 be a band matrix with bandwidth p and dimension n, then the Cholesky triangle of Q has (lower) bandwidth p. =...easy to modify existing Cholesky-factorisation code to use only entries where i j p.

92 Band matrices Cholesky factorisation for band-matrices Avoid computing L ij and reading Q ij for i j > p. Algorithm 8 Band-Cholesky factorization of Q with bandwidth p 1: for j = 1 to n do 2: λ = min{j + p, n} 3: v j:λ = Q j:λ,j 4: for k = max{1, j p} to j 1 do 5: i = min{k + p, n} 6: v j:i = v j:i L j:i,k L jk 7: end for 8: L j:λ,j = v j:λ / v j 9: end for 10: Return L Cost is now n(p 2 + 3p) flops assuming n p.

93 Reordering schemes Introduction Reorder the vertices We can permute the vertexes; select one of the n! possible permutations, define the corresponding permutation matrix P, such that i P = Pi, where i = (1,..., n) T, is the new ordering of the vertexes. Chose P, if possible, such that is a band-matrix with a small bandwidth. Q P = PQP T (23)

94 Reordering schemes Introduction Impossible in general to obtain the optimal permutation, n! is to large! A sub-optimal ordering will do as well. Solve Qµ = b as follows: b P = Pb. Solve Q P µ P = b P Map the solution back, µ = P T µ P.

95 Reordering schemes Reordering to band-matrices Reordering to band-matrices

96 Reordering schemes Reordering to band-matrices Reordering to band-matrices

97 Reordering schemes Reordering to band-matrices Reordering to band-matrices factorisation of Q P required about seconds Solving Qµ = b required about seconds. on a 1 200MHz laptop.

98 Reordering schemes Reordering using the idea of nested dissection More optimal reordering schemes

99 Reordering schemes Reordering using the idea of nested dissection Nested dissection reordering (I) The idea generalise as follows. Select a (small) set of nodes whose removal divides the graph into two disconnected subgraphs of almost equal size. Order the nodes chosen after ordering all the nodes in both subgraphs. Apply this procedure recursively to the nodes in each subgraph. Costs in the spatial case Factorisation O(n 3/2 ) Fill-in O(n log n) Optimal in the order sense.

100 Reordering schemes Reordering using the idea of nested dissection Nested dissection reordering (II)

101 What to do with sparse matrices? Statisticians and numerical methods for sparse matrices Gupta (2002) summarises his findings on recent advances for sparse linear solvers:... recent sparse solvers have significantly improved the state of the art of the direct solution of general sparse systems.... recent years have seen some remarkable advances in the general sparse direct-solver algorithms and software. Good news We just use their software

102 A numerical case-study Introduction A numerical case-study of typical GMRFs Divide GMRF models into three categories. 1. GMRF models in time or on a line; auto-regressive models and models for smooth functions. Neighbours to x t are then those {x s } such that s t p. 2. Spatial GMRF models; regular lattice, or irregular lattice induced by a tessellation or regions of land. Neighbours to x i (spatial index i), are those j spatially close to i, where close is defined from its context. 3. Spatio-temporal GMRF models. Often an extension of spatial models to also include dynamic changes.

103 A numerical case-study Introduction Also include some global nodes: Let x be a GMRF with a common mean µ Assume µ N (0, σ 2 ) x µ N (µ1, Q 1 ) (24)...then (x, µ) is also a GMRF where the node µ is neighbour with all x i s.

104 A numerical case-study Introduction Two different algorithms 1. The band-cholesky factorisation (BCF) as in Algorithm 8. Here we use the LAPACK-routines DPBTRF and DTBSV, for the factorisation and the forward/back-substitution, respectively, and the Gibbs-Poole-Stockmeyer algorithm for bandwidth reduction. 2. The Multifrontal Supernodal Cholesky factorisation (MSCF) implementation in the library TAUCS using the nested dissection reordering from the library METIS. Both solvers are available in the GMRFLib-library

105 A numerical case-study Introduction The tasks we want to investigate are 1. factorising Q into LL T, and 2. solving LL T µ = b. Producing a random sample from the GMRF, is half the cost of solving the linear system in step 2. All tests reported here, we conducted on a 1 200MHz laptop with 512Mb memory running Linux. New machines are nearly 3 10 times as fast...

106 A numerical case-study GMRF-models in time GMRF models in time Let Q be a band-matrix with bandwidth p and dimension n. For such a problem, using BCF will be (theoretically) optimal, as the fillin will be be zero. n = 10 3 n = 10 4 n = 10 5 CPU-time p = 5 p = 25 p = 5 p = 25 p = 5 p = 25 Factorise Solve Long and thin are fast! The MSCF is less optimal for band-matrices: fillin and more complicated data-structures.

107 A numerical case-study GMRF-models in time Add 10 global nodes: n = 10 3 n = 10 4 n = 10 5 CPU-time p = 5 p = 25 p = 5 p = 25 p = 5 p = 25 Factorise Solve The fillin pn. The nested dissection ordering give good results in all cases considered so far.

108 A numerical case-study GMRF-models in space Spatial GMRF models Regular lattice

109 A numerical case-study GMRF-models in space Spatial GMRF models Irregular lattice

110 A numerical case-study GMRF-models in space n = n = n = CPU-time Method Factorise BCF MSCF Solve BCF MSCF For the largest lattice the MSCF really outperform the BCF. The reason is the O(n 3/2 ) cost for MSCF compared to O(n 2 ) for the BCF.

111 A numerical case-study GMRF-models in time space Spatio-temporal GMRF models Spatio-temporal GMRF models is often an extension of spatial GMRF models to account to time variation. time Use this model and the graph of Germany, for T = 10 and T = 100.

112 A numerical case-study GMRF-models in time space 10 global nodes CPU-time T = 10 T = 100 T = 10 T = 100 Factorise Solve The results shows a quite heavy dependency on T. Spatio-Temporal GMRFs are more demanding than spatial ones, O(n 2 ) for a n 1/3 n 1/3 n 1/3 -cube.

113 Part IV Intrinsic GMRFs

114 Outline I Introduction Definition of IGMRFs Improper GMRFs IGMRFs of first order Forward differences On the line with regular locations On the line with irregular locations Why IGMRFs are useful in applications IGMRFs of first order on irregular lattices IGMRFs of first order on regular lattices IGMRFs of second order RW2 model for regular locations The CRWk model for regular and irregular locations IGMRFs of general order on regular lattices IGMRFs of second order on regular lattices

115 Introduction Intrinsic GMRFs (IGMRF) IGMRFs are improper, i.e., they have precision matrices not of full rank. Often used as prior distributions in various applications. Of particular importance are IGMRFs that are invariant to any trend that is a polynomial of the locations of the nodes up to a specific order.

116 Definition of IGMRFs Improper GMRFs Improper GMRF Definition Let Q be an n n SPSD matrix with rank n k > 0. Then x = (x 1,..., x n ) T is an improper GMRF of rank n k with parameters (µ, Q), if its density is ( π(x) = (2π) (n k) 2 ( Q ) 1/2 exp 1 ) 2 (x µ)t Q(x µ). (25) Further, x is an improper GMRF wrt to the labelled graph G = (V, E), where Q ij 0 {i, j} E for all i j. denote the generalised determinant: product of the non-zero eigenvalues.

117 Definition of IGMRFs Improper GMRFs Markov properties The Markov properties are interpreted as those obtained from the limit of a proper density. Let the columns of A T span the null space of Q Q(γ) = Q + γa T A. (26) Q(γ) tends to the corresponding one in Q as γ 0. is interpreted as γ 0. E(x i x i ) = µ i 1 Q ij (x j µ j ) Q ii j i

118 IGMRFs of first order IGMRF of first order An intrinsic GMRF of first order is an improper GMRF of rank n 1 where Q1 = 0. The condition Q1 = 0 means that Q ij = 0, i = 1,..., n j

119 IGMRFs of first order Let µ = 0 so E(x i x i ) = 1 Q ij x j (27) Q ii j:j i j:j i Q ij /Q ii = 1 The conditional mean of x i is a weighted mean of its neighbours. No shrinking towards an overall level. Many IGMRFs are constructed such that the deviation from the overall level is a smooth curve in time or a smooth surface in space.

120 IGMRFs of first order Forward differences Definition (Forward difference) Define the first-order forward difference of a function f ( ) as f (z) = f (z + 1) f (z). Higher-order forward differences are defined recursively: k f (z) = k 1 f (z) so 2 f (z) = f (z + 2) 2f (z + 1) + f (z) (28) and in general for k = 1, 2,..., k f (z) = ( 1) k k ( k ( 1) j j j=0 ) f (z + j).

121 IGMRFs of first order Forward differences For a vector z = (z 1, z 2,..., z n ) T, z has elements z i = z i+1 z i, i = 1,..., n 1 The forward difference of kth order as an approximation to the kth derivative of f (z), f (z) = lim h 0 f (z + h) f (z) h

122 IGMRFs of first order On the line with regular locations IGMRFs of first order on the line Location of the node i is i (think time ). Assume independent increments x i iid N (0, κ 1 ), i = 1,..., n 1, so (29) x j x i N (0, (j i)κ 1 ) for i < j. (30) If the intersection between {i,..., j} and {k,..., l} is empty for i < j and k < l, then Cov(x j x i, x l x k ) = 0. (31) Properties coincide with those of a Wiener process.

123 IGMRFs of first order On the line with regular locations The density for x is ( ) π(x κ) κ (n 1)/2 exp κ n 1 ( x i ) 2 (32) 2 i=1 ( ) = κ (n 1)/2 exp κ n 1 (x i+1 x i ) 2, (33) 2 or i=1 ( π(x) κ (n 1)/2 exp κ ) 2 xt Rx (34)

124 IGMRFs of first order On the line with regular locations The structure matrix R = (35) The n 1 independent increments ensure that the rank of Q is n 1. Denote this model by RW1(κ) or short RW1.

125 IGMRFs of first order On the line with regular locations Full conditionals x i x i, κ N ( 1 2 (x i 1 + x i+1 ), 1/(2κ)), 1 < i < n, (36) Alternative interpretation: fit a first order polynomial p(j) = β 0 + β 1 j, (37) locally through the points (i 1, x i 1 ) and (i + 1, x i+1 ). The conditional mean is p(i).

126 IGMRFs of first order On the line with regular locations Example Samples with n = 99 and κ = 1 by conditioning on the constraint xi = 0.

127 IGMRFs of first order On the line with regular locations Example Marginal variances Var(x i ) for i = 1,..., n.

128 IGMRFs of first order On the line with regular locations Example Corr(x n/2, x i ) for i = 1,..., n

129 IGMRFs of first order On the line with irregular locations The first order RW for irregular locations Location of x i is s i but s i+1 s i is not constant. Assume s 1 < s 2 < < s n and let δ i def = s i+1 s i. (38)

130 IGMRFs of first order On the line with irregular locations Wiener process Consider x i as the realisation of an integrated Brownian motion in continuous time, i.e. a Wiener process W (t), at time s i. Definition (Wiener process) A Wiener process with precision κ is a continuous-time stochastic process W (t) for t 0 with W (0) = 0 and such that the increments W (t) W (s) are Gaussian with mean 0 and variance (t s)/κ for any 0 s < t. Furthermore, increments for non overlapping time intervals are independent. For κ = 1, this process is called a standard Wiener process.

131 IGMRFs of first order On the line with irregular locations Brownian bridge The full conditional is in this case known as the the Brownian bridge: and E(x i x i, κ) = κ is a precision parameter. δ i x i 1 + δ i 1 x i+1 (39) δ i 1 + δ i δ i 1 + δ i ( 1 Prec(x i x i, κ) = κ + 1 ). (40) δ i 1 δ i

132 IGMRFs of first order On the line with irregular locations The precision matrix is now 1 δ i δ i j = i Q ij = κ 1 δ i j = i otherwise (41) for 1 < i < n. A proper correction at the boundary gives the remaining diagonal terms Q 11 = κ/δ 1, Q nn = κ/δ n 1.

133 IGMRFs of first order On the line with irregular locations The joint density is ( ) π(x κ) κ (n 1)/2 exp κ n 1 (x i+1 x i ) 2 /δ i, (42) 2 i=1 and is invariant to the addition of a constant.

134 IGMRFs of first order On the line with irregular locations The interpretation of a RW1 model as a discretely observed Wiener-process, justifies the corrections needed. The underlying model is the same, it is only observed differently

135 IGMRFs of first order Why IGMRFs are useful in applications When the mean is only locally constant Alternative IGMRF ( π(x) κ (n 1)/2 exp κ 2 where x is the empirical mean of x. ) n (x i x) 2 i=1 (43)

136 IGMRFs of first order Why IGMRFs are useful in applications Assume n is even and x i = Thus x is locally constant with two levels. { 0, 1 i n/2 1, n/2 < i n. (44) Evaluate the density at this configuration under the RW1 model and the alternative (43), we obtain ( κ (n 1)/2 exp κ ) ( and κ (n 1)/2 exp n κ ) (45) 2 8 The log ratio of the densities is then of order O(n).

137 IGMRFs of first order Why IGMRFs are useful in applications The RW1 model only penalises the local deviation from a constant level. The alternative penalises the global deviation from a constant level. This local behaviour is advantageous in applications if the mean level of x is approximately or locally constant. A similar argument also apply for polynomial IGMRFs of higher order constructed using forward differences of order k as independent Gaussian increments.

138 IGMRFs of first order IGMRFs of first order on irregular lattices First order IGMRFs on irregular lattices Consider the map of the 544 regions in Germany. Two regions are neighbours if they share a common border.

139 IGMRFs of first order IGMRFs of first order on irregular lattices Between neighbouring regions i and j, say, we define a independent Gaussian increment x i x j N (0, κ 1 ) (46) Assume independent increments yields π(x) κ (n 1)/2 exp κ (x i x j ) 2. (47) 2 i j denotes the set of all unordered pairs of neighbours. Number of increments i j is larger than n, but the rank of the corresponding precision matrix is still n 1. i j

140 IGMRFs of first order IGMRFs of first order on irregular lattices There are hidden constraints in the increments due to the more complicated geometry on a lattice than on the line. Example Let n = 3 where all nodes are neighbours. Then x 1 x 2 = ɛ 1, x 2 x 3 = ɛ 2, and x 3 x 1 = ɛ 3, where ɛ 1, ɛ 2 and ɛ 3 are the increments. This implies that ɛ 1 + ɛ 2 + ɛ 3 = 0 which is the hidden linear constraint.

141 IGMRFs of first order IGMRFs of first order on irregular lattices Let n i denote the number of neighbours of region i. The precision matrix Q is n i i = j, Q ij = κ 1 i j, 0 otherwise, (48) x i x i, κ N ( 1 x j, n i j:j i 1 ). (49) n i κ

142 IGMRFs of first order IGMRFs of first order on irregular lattices Example

143 IGMRFs of first order IGMRFs of first order on irregular lattices In general there is no longer an underlying continuous stochastic process that we can relate to this density. If we change the spatial resolution or split a region into two new ones, we change the model.

144 IGMRFs of first order IGMRFs of first order on irregular lattices Weighted variants Incorporate symmetric weights w ij for each pair of adjacent nodes i and j. For example, w ij = 1/d(i, j) Assuming independent increments x i x j N (0, 1/(w ij κ)), (50) then π(x) κ (n 1)/2 exp κ w ij (x i x j ) 2. (51) 2 i j

145 IGMRFs of first order IGMRFs of first order on irregular lattices The precision matrix is k:k i w ik i = j, Q ij = κ 1/w ij i j, 0 otherwise, (52) and with mean and precision j:j i x jw ij j:j i w ij and κ j:j i w ij, (53) respectively.

146 IGMRFs of first order IGMRFs of first order on regular lattices First order IGMRFs on regular lattices For a lattice I N with n = n 1 n 2 nodes, let i = (i 1, i 2 ) denote the node in the i 1 th row and i 2 th column. Use the nearest four sites of i as its neighbours (i 1 + 1, i 2 ), (i 1 1, i 2 ), (i 1, i 2 + 1), (i 1, i 2 1). The precision matrix is n i i = j, Q ij = κ 1 i j, 0 otherwise, and the full conditionals for x i are x i x i, κ N ( 1 x j, n i j:j i (54) 1 ). (55) n i κ

147 IGMRFs of first order IGMRFs of first order on regular lattices Limiting behaviour This model have an important property The process converge to the de Wijs-process: a Gaussian process with variogram This is important there is a limiting process log(distance) The de Wijs process has dense precision matrix whereas the IGMRF is very sparse!

148 IGMRFs of second order RW2 model for regular locations The RW2 model for regular locations Let s i = i for i = 1,..., n, with a constant distance between consecutive nodes. Use the second order increments 2 x i iid N (0, κ 1 ) (56) for i = 1,..., n 2, to define the joint density of x ( ) π(x) κ (n 2)/2 exp κ n 2 (x i 2x i+1 + x i+2 ) 2 2 i=1 ( = κ (n 2)/2 exp κ ) 2 xt Rx (57) (58)

149 IGMRFs of second order RW2 model for regular locations R = (59) Verify directly that QS 1 = 0 and that the rank of Q is n 2. IGMRF of second order: invariant to the adding line to x. Known as the second order random walk model, denoted by RW2(κ) or simply RW2 model.

150 IGMRFs of second order RW2 model for regular locations Remarks This is the RW2 model defined and used in the literature. We cannot extend it consistently to the case where the locations are irregular. Similar problems occur if we increase the resolution from n to 2n locations, say. This is in contrast to the RW1 model. There exists an alternative (somewhat more involved) formulation with the desired continuous time interpretation.

151 IGMRFs of second order RW2 model for regular locations The conditional mean and precision is E(x i x i, κ) = 4 6 (x i+1 + x i 1 ) 1 6 (x i+2 + x i 2 ), (60) Prec(x i x i, κ) = 6κ, (61) respectively for 2 < i < n 2.

152 IGMRFs of second order RW2 model for regular locations Consider the second order polynomial p(j) = β 0 + β 1 j β 2j 2 (62) and compute the coefficients by a local least squares fit to the points (i 2, x i 2 ), (i 1, x i 1 ), (i + 1, x i+1 ), (i + 2, x i+2 ). (63) Turns out that E(x i x i, κ) = p(i)

153 IGMRFs of second order RW2 model for regular locations Samples with n = 99 and κ = 1 by conditioning on the constraints.

154 IGMRFs of second order RW2 model for regular locations Marginal variances Var(x i ) for i = 1,..., n.

155 IGMRFs of second order RW2 model for regular locations Corr(x n/2, x i ) for i = 1,..., n

156 The CRWk model for regular and irregular locations Random walk models: irregular locations Locations s 1 < s 2 <... < s n. Discretely observed (integrated) Wiener process Augment with the velocities Valid for any order Connection to spline-models

157 The CRWk model for regular and irregular locations Random walk models: irregular locations Locations s 1 < s 2 <... < s n. Discretely observed (integrated) Wiener process Augment with the velocities Valid for any order Connection to spline-models A 1 B 1 B T 1 A 2 + C 1 B 2 B T 2 A 3 + C 2 B Bn 1 B T n 1 C n 1

158 The CRWk model for regular and irregular locations Random walk models: irregular locations Locations s 1 < s 2 <... < s n. Discretely observed (integrated) Wiener process Augment with the velocities Valid for any order Connection to spline-models ( ) 12/δ 3 A i = i 6/δi 2 6/δi 2 4/δ i where ( ) 12/δ 3 B i = i 6/δi 2 6/δi 2 2/δ i δ i = s i+1 s i. ( 12/δ 3 C i = i 6/δi 2 6/δi 2 4/δ i )

159 IGMRFs of general order on regular lattices IGMRFs of higher order on regular lattices The precision matrix is orthogonal to a polynomial design matrix of a certain degree. Definition (IGMRFs of order k in dimension d) An IGMRF of order k in dimension d, is an improper GMRF of rank n m k 1,d where QS k 1,d = 0.

160 IGMRFs of second order on regular lattices A second order IGMRF in two dimensions Consider a regular lattice I N in d = 2 dimensions where s i1 = i 1 and s i2 = i 2. Choose the independent increments ( ) x(i1 +1,i 2 ) + x (i1 1,i 2 ) + x (i1,i 2 +1) + x (i1,i 2 1) 4xi1,i 2 (64) The motivation for this choice is that (64) is ( 2 i1 + 2 i 2 ) xi1 1,i 2 1 (65) Invariant to adding a first order polynomial p 1,2 (i 1, i 2 ) = β 00 + β 10 i 1 + β 01 i 2,

161 IGMRFs of second order on regular lattices The precision matrix (apart from boundary effects) should have non-zero elements ( 2 i i 2 ) 2 = ( 4 i i 1 2 i i 2 ) (66) which is a negative difference approximation to the biharmonic differential operator ( ) 2 2 x y 2 = 4 x x 2 y y 4. (67) The fundamental solution of the biharmonic equation ( 4 x ) x 2 y y 4 φ(x, y) = 0 (68) is the thin plate spline.

162 IGMRFs of second order on regular lattices Full conditionals in the interior ( E(x i x i ) = ) (69) and Prec(x i x i ) = 20κ. (70)

163 IGMRFs of second order on regular lattices Alternative IGMRFs in two dimensions Using only the terms to obtain an difference approximation to (71) 2 x y 2 (72) is not optimal. The discretization error different 45 degrees to the main directions, hence we should expect a directional effect.

164 IGMRFs of second order on regular lattices Use an isotropic approximations (ex. Mehrstellen-stencil ) (73) ( E(x i x i ) = (74) Prec(x i x i ) = 13κ. (75) )

165 IGMRFs of second order on regular lattices Samples.

166 IGMRFs of second order on regular lattices Samples.

167 Part V Hierarchical GMRF models

168 Outline I Hierarchical GMRFs models A simple example Classes of hierarchical GMRF models A brief introduction to MCMC Blocking Rate of convergence Example One-block algorithm Block algorithms for hierarchical GMRF models General setup The one-block algorithm The sub-block algorithm GMRF approximations Merging GMRFs Introduction

169 Outline II Why is it so?

170 Hierarchical GMRFs models Hierarchical GMRFs models Characterised through several stages of observables and parameters. A typical scenario is as follows. Stage 1 Formulate a distributional assumption for the observables, dependent on latent parameters. Time series of binary observations y, we may assume y i, i = 1,..., n : y i B(p i ) We assume the observations to be conditionally independent

171 Hierarchical GMRFs models Hierarchical GMRFs models Characterised through several stages of observables and parameters. A typical scenario is as follows. Stage 2 Assign a prior model, i.e. a GMRF, for the unknown parameters, here p i. Chose an autoregressive model for the logit-transformed probabilities x i = logit(p i ).

172 Hierarchical GMRFs models Hierarchical GMRFs models Characterised through several stages of observables and parameters. A typical scenario is as follows. Stage 3 Assign to unknown parameters (or hyperparameters) of the GMRF precision parameter κ strength of dependency. Further stages if needed.

173 Hierarchical GMRFs models A simple example Tokyo rainfall data Stage 1 Binomial data y i { Binomial(2, p(x i )) Binomial(1, p(x i ))

174 Hierarchical GMRFs models A simple example Tokyo rainfall data Stage 2 Assume a smooth latent x, x RW 2(κ), logit(p i ) = x i

175 Hierarchical GMRFs models A simple example Tokyo rainfall data Stage 3 Gamma(α, β)-prior on κ

176 Hierarchical GMRFs models Classes of hierarchical GMRF models Classes of hierarchical GMRF models Normal data Block-MCMC Non-normal data that allows for a normal-mixture representation Student-t distribution Logistic and Laplace (Binary regression) Block-MCMC with auxillary variables Non-normal data Poisson and others... Block-MCMC with GMRF-approximations

177 A brief introduction to MCMC MCMC Construct a Markov chain θ (1), θ (2),..., θ (k),... that converges (under some conditions) to π(θ). 1. Start with θ (0) where π(θ (0) ) > 0. Set k = Generate a proposal θ from some proposal kernel q(θ θ (k 1) ). Set θ (k) = θ with probability { } π(θ ) q(θ (k 1) θ ) α = min 1, π(θ (k 1) ) q(θ θ (k 1) ; ) otherwise set θ (k) = θ (k 1). 3. Set k = k + 1 and go back to 2

178 A brief introduction to MCMC Depending on the specific choice of the proposal kernel q(θ θ), very different algorithms result. When q(θ θ) does not depend on the current value of θ proposal is called an independence proposal. When q(θ θ) = q(θ θ ) we have a Metropolis proposal. These includes so-called random-walk proposals. The rate of convergence toward π(θ) and the degree of dependence between successive samples of the Markov chain (mixing) will depend on the chosen proposal.

179 A brief introduction to MCMC Single-site algorithms Most MCMC algorithms have been based on updating each scalar component θ i, i = 1,..., p of θ conditional on θ i. Apply the MH-algorithm in turn to every component θ i of θ with arbitrary proposal kernels q i (θ i θ i, θ i ) As long as we update each component of θ, this algorithm will converge to the target distribution π(θ).

180 A brief introduction to MCMC However... Such single-site updating can be disadvantageous if parameters are highly dependent in the posterior distribution π(θ). The problem is that the Markov chain may move around very slowly in its target (posterior) distribution. A general approach to circumvent this problem is to update parameters in larger blocks, θ j : a vector of components of θ. The choice of blocks are often controlled by what is possible to do in practise. Ideally, we should choose a small number of blocks with large dependence within the blocks but with less dependence between blocks.

181 Blocking Blocking Assume the following θ = (κ, x) where κ is the precision and x is a GMRF. Two-block approach sample x π(x κ), and sample κ π(κ x) Often strong dependence between κ and x in the posterior; resolved using a joint update of (x, κ). Why this modification is important and why it works is discussed next.

182 Blocking Rate of convergence Rate of convergence Let θ (1), θ (2),... denote a Markov chain with target distribution π(θ) and initial value θ (0) π(θ). Rate of convergence ρ: how quickly E(h(θ (t) ) θ (0) ) approaches the stationary value E(h(θ)) for all square π-integrable functions h( ). Let ρ be the minimum number such that for all h( ) and for all r > ρ [ ( ( lim E E h(θ (k) ) θ (0)) ) 2 E (h(θ)) r 2k] = 0. (76) k

183 Blocking Example Example Let x be a first-order autoregressive process x t µ = γ(x t 1 µ) + ν t, t = 2,..., n, (77) where γ < 1, {ν t } are iid normals with zero mean and variance σ 2, and x 1 N (µ, σ 2 1 γ 2 ). Let γ, σ 2, and µ be fixed parameters. At each iteration a single-site Gibbs sampler will sample x t from the full conditional π(x t x t ) for t = 1,..., n, N (µ + γ(x 2 µ), σ 2 ) t = 1, x t x t N (µ + γ σ 1+γ (x 2 t 1 + x t+1 2µ), 2 1+γ ) t = 2,..., n 1, 2 N (µ + γ(x n 1 µ), σ 2 ) t = n.

184 Blocking Example For large n, the rate of convergence is γ 2 ρ = 4 (1 + γ 2 ) 2. (78) For γ close to one the rate of convergence can be slow: If γ = 1 δ for small δ > 0, then ρ = 1 δ 2 + O(δ 3 ). To circumvent this problem, we may update x in one block. This is possible as x is a GMRF. This yields immediate convergence.

185 Blocking Example Relax the assumptions of fixed hyperparameters. Consider a hierarchical formulation where the mean of x t, µ, is unknown and assigned with a standard normal prior, µ N (0, 1) and x µ N (µ1, Q 1 ), where Q is the precision matrix of the GMRF x µ. The joint density of (µ, x) is normal. We have two natural blocks, µ and x.

186 Blocking Example A two-block Gibbs sampler update µ and x with samples from their full conditionals, ( ) µ (k) x (k) 1 T Qx (k 1) N T Q1, (1 + 1T Q1) 1 (79) x (k) µ (k) N (µ (k) 1, Q 1 ). The presence of the hyperparameter µ will slow down the convergence compared to the case when µ is fixed. Due to the nice structure of (79) we can characterise explicitly the marginal chain of {µ (k) }.

187 Blocking Example Theorem The marginal chain µ (1), µ (2),... from the two-block Gibbs sampler defined in (79) and started in equilibrium, is a first-order autoregressive process µ (k) = φµ (k 1) + ɛ k, where and ɛ k iid N (0, 1 φ 2 ). φ = 1T Q T Q1

188 Blocking Example Proof. It follows directly that the marginal chain µ (1), µ (2),... is a first-order autoregressive process. The coefficient φ is found by computing the covariance at lag 1, Cov(µ (k), µ (k+1) ) = E (µ (k) µ (k+1)) ( = E µ (k) E (µ (k+1) µ (k))) ( ( = E µ (k) E E (µ (k+1) x (k)) µ (k))) ( ( )) = E µ (k) 1 T Qx (k) E T Q1 µ(k) = 1 T Q T Q1 Var(µ(k) ), which is known to be φ times the variance Var(µ (k) ) for a first-order autoregressive process.

189 Blocking Example For our model φ = n(1 γ)2 /σ n(1 γ) 2 /σ 2 = 1 Var(x t) 1 γ 2 n (1 γ) 2 + O(1/n2 ). When n is large, φ is close to 1 and the chain will both mix and converge slowly even though we use a two-block Gibbs sampler. Note that increasing variance of x t improves the convergence. However, φ 1 as n, which is bad.

190 Blocking Example What is going on? (a) Trace of µ (k), and (b) the pairs (µ (k), 1 T Qx (k) ), with µ (k) on the horizontal axis.

191 Blocking One-block algorithm One-block algorithm So far: blocking improves mainly within the block. If there is strong dependence between blocks, the MCMC algorithm may still suffer from slow convergence. Update (µ, x) jointly by delaying the accept/reject step until x also is updated. µ q(µ µ (k 1) ) x µ N (µ 1, Q 1 ) (80) then accept/reject (µ, x ) jointly. Here, q(µ µ (k 1) ) can be a simple random-walk proposal or some other suitable proposal distribution.

192 Blocking One-block algorithm What is going on? (a) Trace of µ (k), and (b) the pairs (µ (k), 1 T Qx (k) ), with µ (k) on the horizontal axis.

193 Blocking One-block algorithm With a symmetric µ-proposal, { α = min 1, exp( 1 } 2 ((µ ) 2 (µ (k 1) ) 2 )). (81) Only the marginal density of µ is needed in (81): we effectively integrate x out of the target density. The minor modification to delay the accept/reject step until x is updated as well can give a large improvement. In this case: random walk on a one-dimensional density.

194 Block algorithms for hierarchical GMRF models General setup A more general setup Hyperparameters θ (low dimension) GMRF x θ of size n Observe x with data y. The posterior is π(x, θ y) π(θ) π(x θ) π(y x, θ). Assume we are able to sample from π(x θ, y), i.e., the full conditional of x is a GMRF.

195 Block algorithms for hierarchical GMRF models The one-block algorithm The one-block algorithm The following proposal update (θ, x) in one block: θ q(θ θ (k 1) ) x π(x θ, y). (82) The proposal (θ, x ) is then accepted/rejected jointly. We denote this as the one-block algorithm.

196 Block algorithms for hierarchical GMRF models The one-block algorithm Consider the θ-chain, then we are in fact sampling from the posterior marginal π(θ y) using the proposal θ q(θ θ (k 1) ) The dimension of θ is typically low (1-5, say). The proposed algorithm should not experience any serious mixing problems.

197 Block algorithms for hierarchical GMRF models The one-block algorithm The one-block algorithm is not always feasible The one-block algorithm is not always feasible for the following reasons: 1. The full conditional of x can be a GMRF with a precision matrix that is not sparse. This will prohibit a fast factorisation, hence a joint update is feasible but not computationally efficient. 2. The data can be non-normal so the full conditional of x is not a GMRF and sampling x using (82) is not possible (in general).

198 Block algorithms for hierarchical GMRF models The sub-block algorithm...not sparse precision matrix These cases can often be approached using sub-blocks of (θ, x), the sub-block algorithm. Assume a natural splitting exists for both θ and x into (θ a, x a ), (θ b, x b ) and (θ c, x c ), (83) The sets a, b, and c do not need to be disjoint. One class of examples where such an approach is fruitful is (geo-)additive models where a, b, and c represent three different covariate effects with their respective hyperparameters.

199 Block algorithms for hierarchical GMRF models GMRF approximations...non-gaussian full conditionals Auxillary variables can help achieving Gaussian full conditionals. logit and probit regression models for binary and multi-categorical data, and Student-t ν distributed observations. GMRF approximations can approximate π(x θ, y) using a second-order Taylor expansion. Prominent example: Poisson-regression. Such approximations can be surprisingly accurate in many cases and can be interpreted as integrating x approximately out of π(x, θ y).

200 Merging GMRFs Introduction Merging GMRFs using conditioning (I) µ N (0, 1) x µ µ AR(1) z x N (x, I) y z N (z, I) x = (µ, x, z, y) is a GMRF x y is a GMRF

201 Merging GMRFs Introduction Merging GMRFs using conditioning (I) µ N (0, 1) x µ µ AR(1) z x N (x, I) y z N (z, I) x = (µ, x, z, y) is a GMRF x y is a GMRF Additional hyperparameters θ

202 Merging GMRFs Introduction Merging GMRFs using conditioning (II)

203 Merging GMRFs Why is it so? Merging GMRFs using conditioning (III) If then x N (0, Q 1 ) y x N (x, K 1 ) z x, y N (y, H 1 ) Q + K K 0 Prec(x, y, z) = K + H H H which is sparse if Q, K and H are.

204 Part VI Applications

205 Outline I Normal data Munich rental data Normal-mixture response models Introduction Hierarchical-t formulations Binary regression Summary Non-normal response models GMRF approximation Joint analysis of diseases

206 Normal data Normal data: Munich rental guide Response variable y i : rent pr m 2 location floor space year of construction various indicator variables, such as no central heating no bathroom large balcony German law: increase in the rent is based on an average rent of a comparable flat.

207 Normal data Munich rental data Spatial regression model y i ) N (µ + x S (i) + x C (i) + x L (i) + z T i β, 1/κ y x S (i) Floor space (continuous RW2) x C (i) Year of construction (continuous RW2) x L (i) Location (first order IGMRF) β Parameter for indicator variables 380 spatial locations 2035 observations sum-to-zero constraint on spatial IGMRF model and the year of construction covariate

208 Normal data Munich rental data Inference using MCMC Sub-block approach (x S, κ S ), (x C, κ C ), (x L, κ L ), and (β, µ, κ y ). Update each block at a time, using κ L q(κ L κ L) x L, π(x L, the rest) and then accepts/rejects (κ L, xl, ) jointly.

209 Normal data Munich rental data Log-RW proposal for precisions It s convenient to propose new precisions using κ new = κ old f f π(f ) 1 + 1/f for f in [1/F, F ] and F > 1. With this choice q(κ old κ new ) q(κ new κ old ) = 1 Var(κ new κ old ) (κ old ) 2

210 Normal data Munich rental data Full conditional π(x L, the rest) Introduce fake data ỹ ỹ i = y i The full conditional of x L is π(x L the rest) exp( κ L 2 ( ) µ + x S (i) + x C (i) + z T i β, exp( κ y 2 (xi L xj L ) 2 ) i j k ( ỹ k x L (k)) 2). The data ỹ do not introduce extra dependence between the x L i s, as ỹ i acts as a noisy observation of x L i.

211 Normal data Munich rental data Denote by n i the number of neighbors to location i and let L(i) be where its size is L(i). L(i) = {k : x L (k) = x L i }, The full conditional of x L is a GMRF with parameters (Q 1 b, Q), where = κ y and b i k L i ỹ k κ L n i + κ y L(i) if i = j Q ij = κ L if i j 0 otherwise.

212 Normal data Munich rental data Results Floor space

213 Normal data Munich rental data Results Year of construction

214 Normal data Munich rental data Results Location

215 Normal-mixture response models Introduction Normal-mixture representation Theorem (Kelker, 1971) If x has density f (x) symmetric around 0, then there exist independent random variables z and v, with z standard normal such that x = z/v iff the derivatives of f (x) satisfy ( d dy ) k f ( y) 0 for y > 0 and for k = 1, 2,.... Student-t Logistic and Laplace

216 Normal-mixture response models Introduction Corresponding mixing distribution for the precision parameter λ that generates these distributions as scale mixtures of normals. Distribution of x Student-t ν Logistic Laplace Mixing distribution of λ G(ν/2, ν/2) 1/(2K) 2 where K is Kolmogorov-Smirnov distributed 1/(2E) where E is exponential distributed

217 Normal-mixture response models Hierarchical-t formulations RW1 with t ν increments Replace the assumption of normally distributed increments by a Student-t ν distribution to allow for larger jumps in the sequence x. Introduce n 1 independent G(ν/2, ν/2) scale mixture variables λ i : x i λ i iid N (0, (κλi ) 1 ), i = 1,..., n 1.

218 Normal-mixture response models Hierarchical-t formulations Observe data y i N (x i, κ 1 y ) for i = 1,..., n The posterior density for (x, λ) is Note that π(x, λ y) π(x λ) π(λ) π(y x). x (y, λ) is now a GMRF, while λ 1,..., λ n 1 (x, y) are conditionally independent gamma distributed with parameters (ν + 1)/2 and (ν + κ( x i ) 2 )/2. Use the sub-block algorithm with blocks (θ, x), λ

219 Normal-mixture response models Binary regression Example: Binary regression GMRF x and Bernoulli data y i B(g 1 (x i )) { g(p) = log(p/(1 p)) Φ(p) logit link probit link Equivalent representation using auxiliary variables w for the probit-link. ɛ i iid N (0, 1) w i = x i + ɛ { i 1 if w i > 0 y i = 0 otherwise.

220 Normal-mixture response models Binary regression MCMC Inference Full conditional for the GMRF prior is a GMRF Full conditional for the auxiliary variables π(w...) = i π(w i...) Use the sub-block algorithm with blocks (θ, x), w

221 Normal-mixture response models Binary regression Simple example: Tokyo rainfall data Stage 1 Binomial data y i { Binomial(2, p(x i )) Binomial(1, p(x i ))

222 Normal-mixture response models Binary regression Simple example: Tokyo rainfall data Stage 2 Assume a smooth latent x, x RW 2(κ), logit(p i ) = x i

223 Normal-mixture response models Binary regression Simple example: Tokyo rainfall data Stage 3 Gamma(α, β)-prior on κ

224 Normal-mixture response models Binary regression Simple example: Tokyo rainfall data Block Single-site Precision

225 Normal-mixture response models Binary regression Simple example: Tokyo rainfall data Block Single-site Probability

226 Normal-mixture response models Summary These auxiliary variables methods works better than one might think. Not that strong dependency between the blocks: (θ, x), w Nearly as random normal data.

227 Non-normal response models GMRF approximation Non-normal response models Example of full conditional ( π(x y) exp 1 2 xt Qx exp(x i ) i ( exp 1 ) 2 (x µ)t (Q + diag(c i ))(x µ) Construct GMRF approximation Locate the mode x Expand to second order To obtain the GMRF approximation The graph is unchanged )

228 Non-normal response models GMRF approximation Why use the mode?

229 Non-normal response models GMRF approximation How to find the mode? Numerical gradient-search methods evaluate the gradient and hessian in O(n) flops. faster for large GMRFs Newton-Raphson method Current x is initial value Taylor-expand around x Compute the mean in the GMRF-approximation Repeat this until convergence Further complications with linear constraints Ax = b.

230 Non-normal response models GMRF approximation Taylor or not? (I) Consider a non-quadratic function g(x) g(x) = g(x 0 ) + (x x 0 )g (x 0 ) (x x 0) 2 g (x 0 ) +... error is zero in x 0 error typically increase with x x 0

231 Non-normal response models GMRF approximation Taylor or not? (II) Alternative approximation where a, b and c are such that g(x) ĝ(x) = a + b cx 2 (84) g(x 0 ) = ĝ(x 0 ), g(x 0 ± h) = ĝ(x 0 ± h) More uniformly distributed error Better for distribution approximation...if h is a rough guess of ROI. Preference to numerical approxmations of derivatives compared to analytical expressions.

232 Non-normal response models Joint analysis of diseases Joint analysis of diseases SMR Oral SMR Lung

233 Non-normal response models Joint analysis of diseases Separate analysis: model y ij : number of cases in area i for cancer type j. y ij P(e ij exp(η ij )), η ij : log relative risk in area i for disease j. e ij : constants Decompose the log-relative risk η as µ is the overall mean η = µ1 + u + v u a spatially structured component (IGMRF) v an unstructured component (random effects)

234 Non-normal response models Joint analysis of diseases Separate analysis: model Posterior π(x, κ y) ( κ n/2 v κ (n 1)/2 u exp 1 ) 2 xt Qx ( ) exp y i η i e i exp(η i ) π(κ) Q = i κ µ + nκ v κ v 1 T κ v 1 T κ v 1 κ u R + κ v I κ v I κ v 1 κ v I κ v I Q is a 2n + 1 2n + 1 matrix where n = 544. The spatially structured term u has a sum-to-zero constraint, 1 T u = 0.

235 Non-normal response models Joint analysis of diseases Separate analysis: MCMC Joint update of all parameters: κ u q(κ u κ u ) κ v q(κ v κ v ) x π(x κ, y)

236 Non-normal response models Joint analysis of diseases Separate analysis: Results Oral Lung

237 Non-normal response models Joint analysis of diseases Joint analysis: Model Oral cavity and lung cancer relates to tobacco smoking (u 1 ) Oral cancer relates to alcohol consumption (u 2 ) Latent shared and specific spatial components: η 1 u 1, u 2, µ, κ N (µ δu 1 + u 2, κ 1 η 1 I) η 2 u 1, u 2, µ, κ N (µ δ 1 u 1, κ 1 η 2 I),

238 Non-normal response models Joint analysis of diseases Joint analysis: MCMC Update all parameters in one block Simple (log-)rw for the hyperparamteters Sample (µ 1, η 1, u 1, µ 2, η 2, u 2 ) from the GMRF approximation. Accepts/rejects jointly.

239 Non-normal response models Joint analysis of diseases Joint analysis: Results tobacco alcohol

Gaussian Markov Random Fields: Theory and Applications. Part I. Definition and basic properties

Gaussian Markov Random Fields: Theory and Applications. Part I. Definition and basic properties : Theory and Applications Håvard Rue 1 Part I Definition and basic properties 1 Department of Mathematical Sciences NTNU, Norway September 26, 2008 Outline I Why? Outline of this course What is a GMRF?