Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Similar documents
Areal data models. Spatial smoothers. Brook s Lemma and Gibbs distribution. CAR models Gaussian case Non-Gaussian case

Areal Unit Data Regular or Irregular Grids or Lattices Large Point-referenced Datasets

random fields on a fine grid

Bayesian spatial hierarchical modeling for temperature extremes

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Markov random fields. The Markov property

Spatial bias modeling with application to assessing remotely-sensed aerosol as a proxy for particulate matter

GRAPH SIGNAL PROCESSING: A STATISTICAL VIEWPOINT

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

Modelling geoadditive survival data

Web Appendices: Hierarchical and Joint Site-Edge Methods for Medicare Hospice Service Region Boundary Analysis

Spatial smoothing using Gaussian processes

An Introduction to Spatial Statistics. Chunfeng Huang Department of Statistics, Indiana University

Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

STA 294: Stochastic Processes & Bayesian Nonparametrics

Computation fundamentals of discrete GMRF representations of continuous domain spatial models

Models for spatial data (cont d) Types of spatial data. Types of spatial data (cont d) Hierarchical models for spatial data

Bayesian data analysis in practice: Three simple examples

arxiv: v4 [stat.me] 14 Sep 2015

Approaches for Multiple Disease Mapping: MCAR and SANOVA

Aggregated cancer incidence data: spatial models

An EM algorithm for Gaussian Markov Random Fields

Testing Restrictions and Comparing Models

Lecture 5: Spatial probit models. James P. LeSage University of Toledo Department of Economics Toledo, OH

BAYESIAN MODEL FOR SPATIAL DEPENDANCE AND PREDICTION OF TUBERCULOSIS

A short introduction to INLA and R-INLA

Cross-sectional space-time modeling using ARNN(p, n) processes

Spatial Process Estimates as Smoothers: A Review

Statistics for extreme & sparse data

Gaussian Process Regression Model in Spatial Logistic Regression

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Data are collected along transect lines, with dense data along. Spatial modelling using GMRFs. 200m. Today? Spatial modelling using GMRFs

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

Lecture 7 Autoregressive Processes in Space

MCMC algorithms for fitting Bayesian models

A spatio-temporal model for extreme precipitation simulated by a climate model

A Bayesian perspective on GMM and IV

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

A general mixed model approach for spatio-temporal regression data

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

BAYESIAN HIERARCHICAL MODELS FOR MISALIGNED DATA: A SIMULATION STUDY

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

Probabilistic Graphical Models

Spatial Misalignment

GAUSSIAN PROCESS TRANSFORMS

spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models

Effective Sample Size in Spatial Modeling

Point-Referenced Data Models

R-INLA. Sam Clifford Su-Yun Kang Jeff Hsieh. 30 August Bayesian Research and Analysis Group 1 / 14

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

TAKEHOME FINAL EXAM e iω e 2iω e iω e 2iω

Multivariate Statistical Analysis

Introduction to Maximum Likelihood Estimation

Overall Objective Priors

P -spline ANOVA-type interaction models for spatio-temporal smoothing

STA414/2104 Statistical Methods for Machine Learning II

Spatial Misalignment

Modeling Real Estate Data using Quantile Regression

Default Priors and Effcient Posterior Computation in Bayesian

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

LECTURE NOTE #11 PROF. ALAN YUILLE

eqr094: Hierarchical MCMC for Bayesian System Reliability

Optimization Problems

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University

Markov Chain Monte Carlo (MCMC)

Penalized Loss functions for Bayesian Model Choice

Analysing geoadditive regression data: a mixed model approach

Space-time data. Simple space-time analyses. PM10 in space. PM10 in time

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

1 Data Arrays and Decompositions

Multi-resolution models for large data sets

7. Symmetric Matrices and Quadratic Forms

Scaling intrinsic Gaussian Markov random field priors in spatial modelling

CS281 Section 4: Factor Analysis and PCA

Multivariate Time Series: VAR(p) Processes and Models

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Nearest Neighbor Gaussian Processes for Large Spatial Data

Properties of Matrices and Operations on Matrices

Bayesian Linear Regression

Introduction to Spatial Data and Models

Effects of residual smoothing on the posterior of the fixed effects in disease-mapping models

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements

Factor Analysis (10/2/13)

Probabilistic Graphical Models

Introduction to Spatial Data and Models

Analysis of Marked Point Patterns with Spatial and Non-spatial Covariate Information

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models

Linear Algebra and Robot Modeling

A THEORETICAL INTRODUCTION TO NUMERICAL ANALYSIS

. Find E(V ) and var(v ).

Introduction to Gaussian Processes

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

(I AL BL 2 )z t = (I CL)ζ t, where

High-dimensional Ordinary Least-squares Projection for Screening Variables

Asymptotic standard errors of MLE

Beyond MCMC in fitting complex Bayesian models: The INLA method

Transcription:

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Christopher Paciorek, Department of Statistics, University of California, Berkeley, and Department of Biostatistics, Harvard School of Public Health Version 1.0: December 2009 Spatial models for areal data, such as disease mapping based on aggregated disease counts in administrative districts, commonly employ Markov random field (MRF) models, in particular the conditional specification of a Markov random field known as the conditional autoregressive (CAR) model. While CAR and MRF refer to the same model structure, one often thinks of CAR models in terms of the conditional distributions for each of the random variables given the random variables in neighboring areas. A comprehensive reference for Gaussian MRFs (GMRFs) is Rue and Held (2005). The text below also draws from Banerjee, Carlin, and Gelfand (2004, Sections 3.2, 3.3, 5.4). One of the major appeals of the MRF specification is that the precision matrix for the random field is very sparse, which enables efficient computation through well-developed sparse matrix routines, such as in the spam package in R. Here I highlight some of the features of GMRFs in spatial modeling, in particular technical aspects related to specification of intrinsic GMRFs, which are improper models. I ve found that impropriety in the intrinsic GMRF can be hard to grasp and I hope to lay it out clearly and relatively briefly here in one place. Basic spatial model structure (standard CAR model) A basic GMRF model for a spatial collection of random variables used to model areal data, with each random variable representing one of n spatial areas that partition a given domain, θ = {θ(s 1 ),..., θ(s n )}, can be specified in terms of the (scaled) precision, Q, θ N n (0, τ 2 Q 1 ). Here I use the inverse of Q conceptually because in most cases Q is of less than full rank, discussed below. Non-zero mean, including the effect of covariates can be specified separately as part of the larger model as needed. A standard form for a MRF model for areal data is to specify that any pair of elements of θ, say θ i and θ j, be independent given the remaining elements, if the areas represented by the elements do not share a spatial border. This conditional independence corresponds to the ijth element of Q being zero. Then one usually takes a weight of one for each pair that share a common border. The resulting Q has Q ij = 1 for areas that share a common border (or are considered neighbors under some other criterion) and Q ii equal to the number of neighbors for 1

area i, m i. The resulting full conditionals specify the standard CAR model: θ i θ i N ( j N(i) θ j /m i, τ 2 /m i ) where N(i) is the set of neighbors of area i. Note that Q can be expressed as Q = D 1 (I B) = D 1 W where D is a diagonal matrix with elements 1/m i and B is related to the proximity (or adjacency) matrix, W, by B ij = W ij /W i+ where W i+ is the row sum for the ith row. W is simply a matrix of all zeroes, except ones for W ij where areas i and j share a border, so W i+ = m i. The diagonal of W is all zeroes. More generally, whether two areas are neighbors can be determined by another rule, such as based on distance and the weights in W do not have to be ones; these are subjective user choices. MRF approximation to a thin plate spline Rue and Held (2005) and Yue and Speckman (2008) describe a more flexible MRF model that approximate a thin plate spline, specified on a regular grid. Rather than using a proximity matrix with ones only for nearest, cardinal neighbors, the proximity matrix, W, has the value 8 for pairs that are nearest (first order) neighbors in the cardinal directions, the value -2 for pairs that are nearest neighbors on the diagonal and the value -1 for pairs that are second order neighbors in the cardinal directions. For this model the generalization of the number of neighbors is W i+ = = Q ii = 20. Boundary corrections are needed for grid cells on the boundary or one cell away from the boundary. For such interior cells the model s full conditionals are such that the conditional mean of θ i given the neighbors above is 8/20 times the nearest cardinal neighbors, minus 2/20 times the nearest diagonal neighbors, minus 1/20 times the second nearest cardinal neighbors, with conditional variance τ 2 /20. Rue and Held (2005, p. 114) describe the derivation of this model based on the forward difference analogue of penalizing the derivatives of a surface to derive the thin plate spline. Note that this approach generalizes a second order random walk model in one dimension. In one dimension, this gives us a model specified in time order as θ t θ past N (2θ t 1 θ t 2, τ 2 ) or based on the full conditionals as θ t θ t N ( 4(θ 6 t+1 + θ t 1 ) 1(θ 6 t+2 + θ t 2 ), τ 2 /6) for times at least two away from the first or last time and necessary boundary corrections otherwise (see Rue and Held (2005, p. 110)). The elements of the tth row of Q (for t > 1 and t < n 1) are zeroes except Q t,t 2 = Q t,t+2 = 1, Q t,t 1 = Q t,t+1 = 4, and Q tt = 6. This model is also given in the Ice example in the BUGS manual. D 1 ii Impropriety The matrix Q is of less than full rank. For the simple CAR model one eigenvalue of the precision matrix is zero, corresponding to the linear combination, θ i. For the simple CAR model, the model can be seen as a joint distribution, P (θ) exp( 1 2τ 2 i j w ij (θ i θ j ) 2 ), which is invariant to the addition of a constant to all the θs (Q1 = 0), and therefore improper, not giving a finite variance for the linear combination, θ i. For the thin plate spline MRF, there are three zero eigenvalues, corresponding to the sum and to linear terms in the cardinal directions, with the prior being invariant to addition of a plane to the θs. This means that the prior does not constrain these linear combinations of the θs, with infinite variance for these linear combinations. The solution 2

to this issue is to use the pseudoinverse when one needs to compute solutions to linear systems of the form, Q 1 x. In this setting, the following generalized inverse satisfies the Moore-Penrose conditions and is the unique pseudoinverse. First, compute the eigendecomposition, Q = ΓΛΓ T. Then the pseudoinverse is Q 1 = Q + = ΓΛ + Γ T where Λ + is a diagonal matrix with λ + i = 1/λ i for λ i 0 and λ + i = 0 when λ i = 0. Numerically, one takes the diagonal element to be zero whenever λ i is within some tolerance of zero. What are we doing statistically? We are forcing the appropriate linear combinations to have zero variance, thereby giving us a proper distribution on the reduced dimension subspace. To allow flexibility in modeling, we need only add additional parameters to the mean model for θ to take the place of the omitted linear combinations. This gives us a proper prior without sacrificing flexibility. The implicit model using the pseudoinverse is θ N n c (0, τ 2 Q + ), where c is the number of constraints (the number of zero eigenvalues of Q). What about the normalizing constant in this context? When we enforce the constraint(s), we make use of Λ +, so Q + = ΓΛ + Γ n c i=1 λ + i, which makes sense because we are working in a reduced dimension subspace and ignore the eigenvalues corresponding to the constrained portion of the space. Accordingly, the appropriate normalizing constant for these specifications involves (τ 2 ) (n c)/2 rather than (τ 2 ) n/2, since the prior is proper in the subspace. Of course, since Q contains no parameters, we need not compute Q +, but only make use of (τ 2 ) n c 2, but this reasoning justifies setting Q + / Q + = 1 in MCMC acceptance ratio calculations. I ll say a few more words on singular precision and covariance matrices. With a singular precision matrix, there is at least one linear combination, Γ T nθ, where Γ n is the last eigenvector, that has zero contribution to the prior density, because λ n = 0. The linear combination(s) is ignored. Thus we have a density function but it cannot be integrated because the normalizing constant is infinity, and the prior has nothing to say about the probability density of at least one linear combination. In contrast, with a singular covariance matrix, there is at least one linear combination that has zero variance, thereby imposing a constraint on at least one linear combination of any realization, Γ T nθ 0. The constraint on the linear combination must be satisfied, and any θ that does not satisfy this constraint has zero density. This impropriety carries over into the marginalized model for data. For example, suppose we have Y N (Xβ +θ, V Y ), with an MRF prior for θ, θ N (0, τ 2 Q 1 ), again using Q 1 only conceptually, and where Xβ contains terms that take the place of the constrained linear combinations. Marginalizing over θ, we have an improper prior predictive distribution for Y, Y N (Xβ, Σ) where Σ = V Y + τ 2 Q 1, but Σ does not exist because the inverse of Q does not exist, with c linear combinations having infinite variance. This improper prior predictive distribution makes sense: we cannot generate from the prior on θ, so we cannot generate from the predictive distribution for the data. However, the posterior for θ will be proper for n > c, as will the posterior predictive distribution. Note that in the constrained space, the prior for Y is proper, so we can generate Y with constraints on the appropriate linear combinations by generating realizations of θ with the constraints. This involves using the pseudoinverse to get the predictive distribution, Y N n (Xβ, V Y + τ 2 Q + ), which is the prior predictive when the linear constraints on θ are imposed in its prior through the pseudoinverse. 3

Two solutions to identifiability Solution 1 In most cases we are not concerned with generating from the prior predictive and attention is focused on the posterior for θ and the other parameters. Here it is simplest to work on the precision scale, with zero prior precision (infinite prior variance) on the c linear combinations, and with the linear combinations not included in Xβ, including these linear combinations implicitly as part of θ. The conditional posterior for θ has variance V φ Y = (VY 1 + τ 2 Q) 1, which is full rank, with the prior contributing no information about the c linear combinations. This is equivalent to putting flat, improper priors on those linear combinations. One can sample from the conditional for θ and the c linear combinations will be unconstrained by the prior, but informed by the data. In many cases we would integrate over θ. We can express the marginal precision for Y based on completing the square as Σ 1 = VY 1 VY 1 (VY 1 +τ 2 Q) 1 VY 1. This precision matrix has c zero eigenvalues, inherited from the improper prior for θ. The normalizing constant for the marginal density of Y can be expressed in terms of Σ 1/2 = V Y 1/2 V 1 φ Y 1/2 (τ 2 ) (n c)/2 / Q 1/2 where Q is a constant and need not be computed. Thus the marginalized likelihood can be computed even though the distribution for Y is improper (but proper in a reduced dimension subspace), since we do not need to compute Σ. Note that the marginal likelihood ignores c linear combinations of Y because their precisions are zero, thereby taking these combinations to have no information about the remaining parameters after marginalization. Solution 2 In some cases, one may wish to enforce the constraint(s) in the prior and then include the constrained linear combinations as parameters in Xβ, which is an alternative that also gives identifiability. When sampling θ in an MCMC (potentially off-line if it has been integrated out), one can use one of two approaches to sample θ such that the constraints are obeyed. One solution is to set the last c eigenvalues of V φ Y to be zero, while a potentially more efficient approach that allows one to use sparse matrix routines and avoid the eigendecomposition is to use conditioning by kriging as described in Rue and Held (2005, p. 37). A common ad hoc approach is to instead sample θ without constraint and then impose the constraint empirically, for example, setting θ new θ θ1, for the constraint on the mean. Unfortunately, this is not the same as sampling under the constraint (see Rue and Held (2005, p. 36)), so it does not preserve the posterior as the stationary distribution of the MCMC, although in practice the departure may be minimal. Enforcing full propriety via an autoregressive parameter As discussed in Banerjee et al. (2004), one can instead enforce propriety by taking Q = D 1 ρw with constraints on ρ (the inverses of the largest and smallest eigenvalues of D 1/2 W D 1/2 ). ρ is analogous to the AR(1) autocorrelation parameter, and its presence allows us to generate from the MRF distribution with nonzero θ in the simple CAR setting. However, the interpretation of this model structure is troublesome as the resulting full conditional mean relates to ρ times the average of the neighbors in the simple CAR setting, ρ j N(i) θ j /m i, causing shrinkage toward the overall mean (zero in my simplified setting here). In a spatial setting, this is generally unappealing, in contrast to the time series context. Also, in many empirical examples the estimate of ρ ends up being nearly one. Hence to my mind, the intrinsic model is more appealing and has a clean interpretation as a proper prior in a subspace, either with enforced constraints or taking an improper prior on the unconstrained linear combinations. 4

References Banerjee, S., Carlin, B., and Gelfand, A. (2004), Hierarchical Modeling and Analysis for Spatial Data, Boca Raton, Florida: Chapman & Hall. Rue, H. and Held, L. (2005), Gaussian Markov Random Fields: Theory and Applications, Boca Raton: Chapman & Hall. Yue, Y. and Speckman, P. (2008), Nonstationary spatial Gaussian Markov random fields Technical report, University of Missouri, Department of Statistics. 5