Bayesian hierarchical models for spatially misaligned data in R

Similar documents
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models

Hierarchical Modeling for Multivariate Spatial Data

Hierarchical Modeling for non-gaussian Spatial Data

Modelling Multivariate Spatial Data

Hierarchical Modelling for non-gaussian Spatial Data

Bayesian Dynamic Modeling for Space-time Data in R

Some notes on efficient computing and setting up high performance computing environments

Hierarchical Modelling for Multivariate Spatial Data

Hierarchical Modelling for non-gaussian Spatial Data

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Hierarchical Modeling for Spatio-temporal Data

BAYESIAN HIERARCHICAL MODELS FOR MISALIGNED DATA: A SIMULATION STUDY

Bayesian Modeling and Inference for High-Dimensional Spatiotemporal Datasets

Hierarchical Modeling and Analysis for Spatial Data

Aggregated cancer incidence data: spatial models

Gaussian Process Regression Model in Spatial Logistic Regression

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Principles of Bayesian Inference

Disease mapping with Gaussian processes

Hierarchical Modelling for Univariate Spatial Data

Nearest Neighbor Gaussian Processes for Large Spatial Data

Bayesian Linear Regression

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

The Use of Spatial Exposure Predictions in Health Effects Models: An Application to PM Epidemiology

Bayesian Linear Models

Principles of Bayesian Inference

Journal of Statistical Software

Models for spatial data (cont d) Types of spatial data. Types of spatial data (cont d) Hierarchical models for spatial data

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland

Rejoinder. Peihua Qiu Department of Biostatistics, University of Florida 2004 Mowry Road, Gainesville, FL 32610

Gaussian predictive process models for large spatial data sets.

A Spatio-Temporal Downscaler for Output From Numerical Models

Hierarchical Modelling for Univariate and Multivariate Spatial Data

Approaches for Multiple Disease Mapping: MCAR and SANOVA

Hierarchical Modelling for Univariate Spatial Data

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

eqr094: Hierarchical MCMC for Bayesian System Reliability

Bayesian Linear Models

BAYESIAN MODEL FOR SPATIAL DEPENDANCE AND PREDICTION OF TUBERCULOSIS

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Estimating Timber Volume using Airborne Laser Scanning Data based on Bayesian Methods J. Breidenbach 1 and E. Kublin 2

A short introduction to INLA and R-INLA

Analysing geoadditive regression data: a mixed model approach

Bayesian data analysis in practice: Three simple examples

Bayesian Methods for Machine Learning

Bayesian Inference for the Multivariate Normal

A Note on Bayesian Inference After Multiple Imputation

Statistics for extreme & sparse data

Introduction to Geostatistics

Combining Incompatible Spatial Data

Principles of Bayesian Inference

Nonparametric Bayesian Methods (Gaussian Processes)

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Markov Chain Monte Carlo methods

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

Statistical Practice

Non-parametric Bayesian Modeling and Fusion of Spatio-temporal Information Sources

Reconstruction of individual patient data for meta analysis via Bayesian approach

A Geostatistical Approach to Linking Geographically-Aggregated Data From Different Sources

Plausible Values for Latent Variables Using Mplus

On the change of support problem for spatio-temporal data

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Introduction to Spatial Data and Models

Fusing point and areal level space-time data. data with application to wet deposition

spbayes: an R package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models

MEASUREMENT UNCERTAINTY AND SUMMARISING MONTE CARLO SAMPLES

Model Assessment and Comparisons

Bayesian Areal Wombling for Geographic Boundary Analysis

SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA

Represent processes and observations that span multiple levels (aka multi level models) R 2

Introduction to Spatial Data and Models

Modelling Replicated Weed Growth Data Using Spatially-Varying Growth Curves

Introduction. Chapter 1

Bayesian Hierarchical Models

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Lecture 5: Spatial probit models. James P. LeSage University of Toledo Department of Economics Toledo, OH

STAT 518 Intro Student Presentation

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

STA 4273H: Statistical Machine Learning

BAYESIAN ESTIMATION OF LINEAR STATISTICAL MODEL BIAS

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Bayesian Inference. Chapter 9. Linear models and regression

ASA Section on Survey Research Methods

Odds ratio estimation in Bernoulli smoothing spline analysis-ofvariance

Restricted spatial regression in practice: geostatistical models, confounding, and robustness under model misspecification

Default Priors and Effcient Posterior Computation in Bayesian

STA 4273H: Statistical Machine Learning

Quantile POD for Hit-Miss Data

The STS Surgeon Composite Technical Appendix

Summary STK 4150/9150

FastGP: an R package for Gaussian processes

Statistícal Methods for Spatial Data Analysis

Transcription:

Methods in Ecology and Evolution 24, 5, 54 523 doi:./24-2x.29 APPLICATION Bayesian hierarchical models for spatially misaligned data in R Andrew O. Finley *, Sudipto Banerjee 2 and Bruce D. Cook 3 Department of Forestry, Michigan State University, 26 Natural Resources Building, East Lansing, MI 424-222, USA; 2 Division of Biostatistics, School of Public Health, University of Minnesota, A46 Mayo Building, MMC 33, 42 Delaware Street S.E., Minneapolis, MN 55455, USA; and 3 Biospheric Sciences Laboratory, National Aeronautics and Space Administration, Goddard Space Flight Center, Code 6, Greenbelt, MD 277,USA Summary. Spatial misalignment occurs when at least one of multiple outcome variables is missing at an observed location. For spatial data, prediction of these missing observations should be informed by within location association among outcomes and by proximate locations where measurements were recorded. 2. This study details and illustrates a Bayesian regression framework for modelling spatially misaligned multivariate data. Particular attention is paid to developing valid probability models capable of estimating parameter posterior distributions and propagating uncertainty through to outcomes predictive distributions at locations where some or all of the outcomes are not observed. 3. Models and associated software are presented for both Gaussian and non-gaussian outcomes. Model parameter and predictive inference within the proposed framework is illustrated using a synthetic and forest inventory data set. 4. The proposed Markov chain Monte carlo samplers were written in C++ and leverage R s Foreign Language Interface to call FORTRAN BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) libraries for efficient matrix computations. The models are implemented in the spmisalignlm and spmisalignglm functions within the spbayes R package available via the Comprehensive R Archive Network (CRAN) (http://cran.r-project.org). Key-words: multivariate, misalignment, missingness, Gaussian spatial process, linear model of coregionalization, Markov chain Monte Carlo Introduction Investment in long-term monitoring networks and advancement in sensor technologies are creating data-rich environments that provide extraordinary opportunities to understand the complexity of large and spatially indexed ecological data. Building such understanding often requires the analysis of spatially indexed data sets with multiple variables measured at each location. In such settings, it is commonly posited that there is association between the measurements at a given location as well as association among measurements across locations. In ecological analysis, we often seek inference about the association among these multiple variables or wish to predict their values at new locations. For example, consider the analysis of (i) species co-occurrence where species presence/ absence or abundance is recorded at each location, for example Ovaskainen, Hottola & Siitonen (2); (ii) soil nutrient impact on local tree growth and competition where soil nutrient measurements coincide with tree inventory locations, for example Baribault, Kobe & Finley (22); or (iii) relationship *Correspondence author. E-mail: finleya@msu.edu between multiple environmental stressors and measures of focal species fitness, for example Swope & Parker (22). In each case, development of a statistical model typically requires the full set of outcomes, for example species presence/absence, and covariates, for example soil nutrients or environmental stressors, at a set of locations. Given such multivariate settings, it is common that different subsets of the outcome variables, or covariates, are available at different locations. In the statistical literature, this situation is referred to as spatial misalignment. Following the examples above, say observers record the presence/absence of different subsets of species at different locations, or for a subset of locations, only some of the soil nutrients or plant stressors were measured perhaps due to different sampling protocol or if data were drawn from different data bases. In such cases, it is necessary to somehow impute or predict the value of the missing observations. Note, if there is misalignment among the covariates, then we might view them as outcomes in a model used to predict their missing observations. Regardless of where the misalignment occurs, these predictions should be informed using the within location association among variables and from proximate locations where measurements were recorded. 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society

Models for spatially misaligned multivariate data 55 Further, it is common to seek prediction for the entire set of outcomes at new locations where no measurements were recorded. In both cases, an assessment of prediction uncertainty is also often desired. Here, we consider point-point misalignment to distinguish it from point-areal whichreferstothesituationwheresomevariables may be referenced by their points, while others may have been aggregated over spatial regions. Although the term pointareal misalignment is used in the literature, following Gotway & Young (22), we prefer to classify this as a change-of-support problem. See also Mugglin, Carlin & Gelfand (2), Gelfand, Zhu & Carlin (2), Zhu, Carlin & Gelfand (23) and the references therein for methods for change-of-support problems. A salient feature of our data is that every location generates, at most, only one replicate of the multiple outcomes. For empirical estimation of the association among these outcomes using sampling-based multivariate analysis methods, one must consider the observations at different spatial locations as independent replicates. This will, however, preclude estimation of the spatial associations. Under this setting, inference on associations should deploy fully model-based approaches using the flexibility of spatial stochastic processes. Existing model-based methods for handling spatial pointpoint misalignment primarily aim to align disparate variables by accounting for additional uncertainty when kriging or other smoothing methods are used to align the spatially referenced data (Madsen, Ruppert & Altman 2; Buonaccorsi 29; Gryparis et al. 29; Paciorek et al. 29; Szpiro et al. 2; Lopiano, Young & Gotway 2 23). These approaches build conditional regression-like models where the marginal distribution of the first outcome is specified, followed by the conditional distribution of the second outcome given the first and so on. This approach is easily interpretable and ensures the legality of the resulting joint distributions from the process realizations. However, the approach is more suitable when the number of outcomes is small and there is a natural ordering that would suggest the sequence for constructing the conditional distributions. Settings such as ours lack such information on ordering, so joint modelling of the outcomes is preferable to avoid the explosion in models emerging from alternate ordering schemes. Joint models attempt to directly construct cross-covariance functions that describe the covariances between different outcomes at two, possibly different, locations. A model-based Bayesian approach for point-point misalignment was presented in Banerjee & Gelfand (22). More recently, joint modelling of point patterns and misaligned covariates are considered in Illian, Sorbye & Rue (22), while Ren & Banerjee (23) considered modelling spatial misalignment using a class of spatial latent factor models. While the problem of spatial misalignment is ubiquitous, software to implement model-based analysis of such data is absent. Our current work focuses upon point-point misalignment and extends and integrates some of the aforementioned methodological work into a Bayesian hierarchical modelling framework. In addition, we demonstrate how this is implemented in our spbayes package for the R statistical programming language and environment. Multivariate spatial regression with misalignment Let S,S 2,...,S m denote sets comprising n,n 2,...,n m locations where m outcomes have been observed. We collect all observations for the first outcome into an n 9 columnvectory, those for the second outcome into an n 2 9 columnvectory 2, and so on until we collect observations corresponding to the m-th outcome into an n m 9 column vector y m.eachofthese isstackedintoann 9 columnvectory where N ¼ P m i n i. The covariates corresponding to the i-th outcome y i are collected into an n i 9 p i matrix X i,andweletb i denote the p i 9 regression slope vector associated with X i. The other key ingredient in the multivariate spatial regression model is the vector of unobserved spatial random effects. For any location s, indexed by some coordinate frame, we have a spatial random effect w i (s) associated with the i-th outcome y i (s) fori =,2,...,m. We collect the random effects corresponding to the i-th outcome into an n i 9 vectorw i so that it corresponds to y i. The multivariate spatial linear regression model is given by y i ¼ X i b i þ w i þ e i i ¼ ; 2;...; m; eqn where e i is an n i 9 column of zero-centred residual random errors corresponding to the i-th outcome such that the covariance between an element in e i andanelementine j is zero whenever i and j correspond to different outcomes. Two elements within e i represent random errors associated with the i-th outcome measured at two different locations. The covariances between any two such elements and the variances of each element in e i are placed as off-diagonal and diagonal entries in an n i 9 n i matrix Ψ i, which is the variance covariance matrix of e i.eachofthee i s is assumed to be normally distributed, independent of the others, with mean zero and variance covariance matrix Ψ i. Model () can be extended to accommodate non-gaussian outcomes such as (i) binary data modelled using logit or probit regression, and (ii) count data modelled using Poisson regression. Diggle, Tawn & Moyeed (99) unify the use of generalized linear models in spatial data contexts. See also Lin et al. (2), Kamman & Wand (23), and Banerjee, Carlin & Gelfand (24). Essentially we replace model () with the assumption that E[y i (s)] is linear on a transformed scale, that is, g(e[y i (s)]) = x i (s) b i + w i (s), where g() is a suitable link function and x i (s) isthep i 9 vector that includes outcome- and location-specific covariates. Spatial association is captured by the spatial effects, that is, the w i s in (). Any two entries in w i correspond to the spatial random effects for outcome i from two different locations. These are assumed to be associated or correlated based upon a function of the separation or distance between the two locations. The essence of multivariate spatial modelling is to prescribe these covariances in such a way that the joint distribution of the w i s, for i =,2,...,m, in () is a multivariate normal distribution. The key modelling ingredient here is a multivariate spatial process, see, for example, Chiles & Delfiner (999), Cressie & Wikle (2), and Banerjee, Carlin & Gelfand (24). In our 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society, Methods in Ecology and Evolution, 5, 54 523

56 A. O. Finley, S. Banerjee & B. D. Cook context, the multivariate spatial process is an infinite collection of m 9 vectorsw(s) indexed by spatial coordinates s residing in two or three dimensional Euclidean space. The spatial random effects arise as a finite subset of this set indexed by the locations where the outcomes have been observed. A spatial process is well-defined whenever any finite collection of random effects has a legitimate probability distribution. When these distributions always belong to a multivariate normal family, we say that the spatial process is a Gaussian process. In (), each w i is an n i 9 vector of spatial random effects collected over the locations where outcome i has been observed. The covariance among outcomes spatial random effects provides learning about missing observations. The details on constructing and estimating the covariance among spatial random effects are given in Appendix S. In brief say we wish to model the covariance between spatial random effects corresponding to two different outcomes at two different locations. That is, for outcomes i and j, and locations s k and s l,wemust specify cov{w i (s k ),w j (s l )} in a manner that will ensure a legitimate probability distribution for the joint distribution of {w i : i =,2,...,m}. This covariance is specified using a spatial cross-covariance function that is constructed using outcomespecific spatial correlation functions which include parameters to control the random effects spatial dependence, for example rate of spatial decay. Given parameter estimates, the crosscovariance functions provide inference about how outcomes covary in space, after accounting for covariates, and inform prediction. We adopt the Bayesian paradigm for inference, see, for example, Gelman et al. (24), and build hierarchical models by modelling the parameters using probability distributions. Inference about the regression slopes, the spatial random effects, and the variances and covariances is based on Markov chain Monte Carlo (MCMC) sampling from posterior distributions. As noted in Introduction, a primary aim of our analysis is interpolation and prediction. Following terminology used in Banerjee & Gelfand (22), when we estimate the value of an outcome at a location where some of the other outcomes have been observed, we call it interpolation. When we seek to estimate the value of an outcome at a new location, where none of the outcomes have been observed, we call it prediction. In sampling-based Bayesian inference, we draw samples from the posterior predictive distributions of the outcome variable at unobserved locations given the observed data. The posterior predictive distribution is in fact the posterior distribution of y i (s )giveny,wheres is the location we want to interpolate or predict. Additional details can be found in the Appendix S. Software implementation The models described in the preceding section are available in the spbayes (version.3-) R package spmisalignlm and spmisalignglm functions for Gaussian and non-gaussian outcomes, respectively. These functions are written in C++ and leverage R s Foreign Language Interface to call FORTRAN BLAS (Basic Linear Algebra Subprograms, see Blackford et al. 22) and LAPACK (Linear Algebra Package, see Anderson et al. 999) libraries for efficient matrix computations. A heavy reliance on BLAS and LAPACK functions allows the software to leverage multiprocessor/core machines via threaded implementations of BLAS and LAPACK, for example Intel s Math Kernel Library (MKL; http://software.intel.com/ en-us/intel-mkl). Use of MKL, or similar threaded libraries, can dramatically reduce sampler run-times. For example, the illustrative analyses offered in subsequent sections were conducted using R, and hence spbayes, compiled with MKL on an Intel Ivy Bridge i7 quad-core processor with hyperthreading. The use of these parallel matrix operations results in a near linear speedup in the MCMC sampler s run-time with the number of CPUs. In addition to Appendix S, Finley, Banerjee & Gelfand (23) provide specifics on efficient implementation of the multivariate Gaussian process parameter estimation. Illustrative analyses SYNTHETIC DATA We consider a synthetic data set comprising three outcome variables observed over unique and common locations within a unit square domain. The analysis of these data demonstrates how the strength of correlation between outcomes spatial random effects and range of spatial dependence influences the accuracy and precision of prediction and interpolation. The R code to reproduce this and subsequent analyses is available in Finley, Banerjee & Cook (24). Following model () and using the true parameter values given in the first column of Table, we generated outcomes at all locations in Fig. (a). These outcomes are shown in Fig. (b d). Outcome observations were then subsampled to create misalignment following the design in Fig. (a). Here, each circle contains those locations where the given outcome identified by the circles number is observed. Regions where Table. Parameter values used to generated the synthetic data in the column labelled True along with spmisalignlm estimated parameter posterior distribution 5 (25, 975) percentiles. The b, correspond to outcomes regression intercepts, q is the cross-correlation between the outcomes spatial random effects, / is the spatial cross-correlation decay parameter, and Ψ is the non-spatial residual variances associated with each outcome. Subscripts indicate the associated outcome variable True Estimate b, (67, 57) b 2, 5 496 (333, 639) b 3, 9 (99, 29) q,2 6 ( 97, 66) q,3 9 7 (63, 97) q 2,3 4 62 ( 7, 7) / 6 73 (439, 42) / 2 6 223 (472, 2947) / 3 6 297 (534, 252) Ψ 5 (2, 2) Ψ 2 4 (, 2) Ψ 3 5 (2, 27) 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society, Methods in Ecology and Evolution, 5, 54 523

Models for spatially misaligned multivariate data 57 (a) (b) 2 5 6 4 2 9 4 2 3 6 4 2 2 5 Fig.. (a) Locations of observed and unobserved outcome variables. Data associated with each outcome are observed within its respective circle, indicated by numbers, 2 and 3. Intersecting regions contain locations where two or more outcomes are observed. Surfaces for outcomes, 2 and 3 are given in (b), (c) and (d), respectively. 2 4 6 (c) 6 6 4 4 2 2 2 4 6 2 4 6 (d) 6 4 6 2 4 2 6 2 4 6 the circles overlap identify those locations where multiple outcomes were observed. The true spatial cross-covariances used to generate the data can be converted to cross-correlations to facilitate interpretation. These correlations are provided in Table and also displayed in their respective regions of overlap in Fig. (a). Given spatially misaligned data, the spmisalignlm function called in the R code below generates posterior samples from the parameters of the posited model. This function takes each outcome s symbolic regression model and locations where data are observed. Additionally, parameter starting values, prior distributions, MCMC Metropolis algorithm proposal distribution variances, spatial correlation function and number ofdesiredmcmcsamplesarealsopassedtothespmisalignlm function. A full explanation of argument syntax and output is available in the function s manual available via CRAN. 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society, Methods in Ecology and Evolution, 5, 54 523

5 A. O. Finley, S. Banerjee & B. D. Cook (a) 2 5 2 6 4 2 5 2 4 6 (b) 6 4 6 4 2 (c) 6 4 6 4 2 2 2 2 4 6 2 4 6 Fig. 2. Misalignment model posterior predictive distribution median surfaces for outcomes, 2 and 3 in (a), (b) and (c), respectively. (a) 2 6 4 2 5 2 4 6 (b) (c) 6 6 4 4 6 4 6 4 2 2 2 4 6 2 2 2 4 6 Fig. 3. Misalignment model posterior predictive distribution uncertainty surfaces for outcomes, 2 and 3 in (a), (b) and (c), respectively. The resulting MCMC samples were summarized using functions in the coda package and displayed in Table. Here, we can see that parameters estimated 95% credible intervals include the true parameter values. As we will see in the subsequent data analysis, Penobscot Experimental Forest LiDAR and biomass data, the parameter estimates associated with the spatial random effects cross-correlations can be used to explore hypotheses about association after accounting for the impact of covariates. Also, one can look to the spatial crosscorrelation decay parameters to make inference about the geographical range of dependence among observations. Given the spmisalignlm object m.miss and spatial coordinates with associated covariates, one can interpolate and predict using the sppredict function. In the code 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society, Methods in Ecology and Evolution, 5, 54 523

Models for spatially misaligned multivariate data 59 Table 2. Univariate spatial regression model prediction and misalignment model interpolation performance. Performance metrics are (i) root mean squared error (RMSE) between the observed and predicted or interpolated outcomes; (ii) mean width between the lower and upper 95% posterior predictive distribution credible intervals (CI width); and (iii) the percentage of observations covered by their respective 95% credible interval (CI cover) Univariate outcome Misalignment outcome 2 3 2 3 RMSE 34 6 6 22 5 2 CI width 79 63 679 9 422 53 CI cover 9459 9369 9652 9369 92 below, sppredict is used to generate posterior predictive samples for all three outcomes at all locations in Fig. (a). Figures 2 and 3 show the median and dispersion of the resulting posterior predictive distributions. The interpolated and predicted outcomes shown in Fig. 2(a c) closely approximate the observed data Fig. (b d). We summarize the prediction uncertainty using the width between the lower and upper 95% posterior predictive credible intervals; given in Fig. 3(a), (b) and (c) for outcomes, 2 and 3, respectively. These surfaces show that stronger cross-correlation between outcomes result in more precise interpolation. For example, the spatial random effects associated with outcome are strongly correlated with those of outcomes 2 and 3, that is, estimated cross-correlation of 6 and 7, respectively. As a result, Fig. 3(a) shows greater precision in interpolation of outcomes 2 and 3 when outcome is observed (notice the lighter colours in circles 2 and 3). In contrast, when the cross-correlation is weak, there is less information available to inform interpolation. For example, a cross-correlation of 62betweenoutcomes2and3resultsinonly marginal narrowing of the interpolation precision in either Fig. 3(b) or (c). (a) (km) 5 PEF study area LVIS Lp25 and Lp95 extent G LiHT Gp95 observations Sample plot BIO observations 5 2 (km) (b) 25 2 (c) 2 Fitted Gp95 5 Fitted BIO 5 Fig. 4. Penobscot Experimental Forest LiDAR and sample plot data extent and locations (a). Misalignment model posterior distribution median (black point symbol) and 95% credible intervals for Gp95 and BIO in (b) and (c), respectively. 5 5 5 2 25 Observed Gp95 5 5 5 2 Observed BIO 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society, Methods in Ecology and Evolution, 5, 54 523

52 A. O. Finley, S. Banerjee & B. D. Cook We are in a prediction setting when none of the outcomes are observed at a given location. In Fig. (a), prediction occurs for all locations outside of the three circles. In the absence of covariates, prediction is only informed by proximate observed locations. The stronger the spatial dependence, the more information for prediction is gleaned from observed locations. For example, the spatial decay point estimate for outcome is 73, which corresponds to an effective spatial range 3 domain distance units (where we define effective spatial range as the distance at which the spatial correlation drops to 5). The result of this relatively long spatial range is that predictions made just outside of circle show more precise posterior predictive intervals (notice the halo around the circle). To assess the usefulness of estimating the covariance among the outcomes spatial random effects for interpolation, we Table 3. Estimated parameter posterior distribution 5 (25, 975) percentiles for Penobscot Experimental Forest misalignment model. The b, correspond to outcomes regression intercepts, q Gp95,BIO is the cross-correlations between the outcomes spatial random effects, and Ψ is the non-spatial residual variances associated with each outcome. Estimates for / Gp95 and / BIO have been transformed to their respective effective spatial range in km Estimate b Gp95, 6 (245, 99) b Gp95,Lp25 45 (2, 6) b Gp95,Lp95 4 ( 3, 5) b Bio, 93 (95, 53) Ψ Gp95 75 (2, 9) Ψ BIO 4 (26, 9) q Gp95,BIO 36 (6, 9) Gp95 eff. range (km) 299 (3, 46) BIO eff. range (km) 49 (7, 54) compare the misalignment model results to predictions generated by outcome-specific univariate spatial regressions. The univariate models are equivalent to model () but assume thereisnocovarianceamongtheoutcomes randomeffects. These univariate models can be fit using the splm function in spbayes. Summaries of prediction performance are given in Table 2 and show the misalignment model improves prediction accuracy and precision for each outcome, as reflected by lower RMSE and narrower 95% credible intervals compared to those of the univariate model. PENOBSCOT EXPERIMENTAL FOREST LIDAR AND BIOMASS DATA This illustrative analysis considers data from a 6-ha area within the US Forest Service Penobscot Experimental Forest (PEF; http://www.fs.fed.us/ne/durham/455/penobsco.htm), ME, USA. The PEF has been studied extensively beginning in the 95s and is under active forest management as part of several long-term silvicultural experiments. A variety of forest variables are recorded on over 6 permanent georeferenced sample plots across the PEF. Light Detection and Ranging (LiDAR) data from the National Aeronautics and Space Administration (NASA) airborne Laser Vegetation Imaging Sensor (LVIS; http://lvis.gsfc.nasa.gov), and LiDAR, hyperspectral and thermal (G-LiHT; Cook et al. 23) sensors are also available for the PEF. The objectives of this illustrative analysis are to produce predictive maps, with associated uncertainty, of (i) forest canopy height metrics from sparsely sampled LiDAR, for example, G- LiHT, and (ii) forest variables measured at forest sample plots. For brevity, we consider only a subset of the available PEF data. The location and extent of these data are show in Fig. 4(a) and include: (a) 2 (b) 6 (km) 5 5 5 (km) 5 4 2 6 5 2 5 2 4 (km) (km) (c) 5 (d) 4 2 (km) 5 5 2 (km) 5 (km) 5 5 2 (km) 6 4 2 Fig. 5. Penobscot Experimental Forest misalignment model posterior predictive distribution summary surfaces. Posterior median for Gp95 and BIO given in (a) and (b), respectively. Range between the lower and upper 95% credible intervals for Gp95 and BIO given in (c) and (d), respectively. 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society, Methods in Ecology and Evolution, 5, 54 523

Models for spatially misaligned multivariate data 52 forest canopy height 25th and 95th percentiles, labelled Lp25 and Lp95, respectively, measured in 23 using LVIS at a 25-m-diameter footprint across the extent of the study area; forest canopy height 95th percentile, labelled Gp95, measured in 22 using the G-LiHT sensor at a 25-m-diameter footprint along a single transect across the study area; metric tons of live above-ground tree biomass per ha, BIO, estimated at each of the 7 permanent sample plots between 2 and 22. Here, we are interested in predicting both Gp95 and BIO at a fine spatial resolution across the study area. We expect a positive relationship between Gp95, which is a proxy for canopy height, and BIO. Further, although the forest structure has changed since 23 due to timber harvesting, the complete coverage LVIS Lp25 and Lp95 variables might explain some variability in the more current G-LiHT Gp95, and therefore, we use these metrics as covariates in the subsequent regression. This model is specified in the code below, along with parameter starting values, prior distributions, MCMC algorithm specifics and the spatial correlation function. Although not shown, variogram analysis of univariate non-spatial model residuals and other exploratory data analysis tools can help guide choice of prior distributions and associated hyperparameters for the spatial and non-spatial covariances. Again, a full explanation of argument syntax and output is available in the function s manual available via CRAN. The resulting MCMC samples were summarized using functions in the coda package and displayed in Table 3. Here, we see the LVIS Lp25 covariate explains a substantial portion of variability in G-LiHT Gp95, that is, the 95% credible intervals of the b Gp95,Lp25 do not include zero. Given timber harvesting activity in the study area over the 9 years between the LiDAR measurements, the lack of relationship between the sensors 95th canopy height percentiles is not too surprising. The long effective spatial ranges estimated for Gp95 and BIO suggest there is substantial spatial structure among the residuals. The effective spatial ranges are calculated using the cross-covariance and spatial correlation functions parameter estimates, see Finley, Banerjee & Cook (24) and Gelfand et al. (24, p. 292). Further, Gp95 s and BIO s spatial random effects are moderately correlated q Gp95,BIO 36. Estimating this crosscorrelation is useful for exploring hypotheses about strength and direction of association among the outcomes residual spatial structure, perhaps after accounting for some covariates. In this analysis, we could say there is a positive and significant, that is, credible intervals do not include zero, correlation between the residual spatial structure of Gp95 and BIO. Given the spmisalignlm object m.miss and spatial coordinates with associated covariates, one can interpolate and predict using the sppredict function. In the supplemental analysis code (Finley, Banerjee & Cook 24), sppredict is used to generate posterior predictive samples for Gp95 and BIO at all 226 locations where Lp25 and Lp95 were observed. Surfaces of the resulting posterior predictive distributions median and width between the lower and upper 95% credible intervals are given for Gp95 and BIO in Fig. 5. The posterior predictive medians shown in Fig. 5(a) and (b) closely approximate the observed data, see, for example, model fitted versus observed values in Fig. 4(b) and (c). However, more pertinent to this illustration, Fig. 5(c) and (d) 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society, Methods in Ecology and Evolution, 5, 54 523

522 A. O. Finley, S. Banerjee & B. D. Cook shows narrowing of the posterior predictive distribution at and near locations of interpolation for the respective outcome. For example, the narrowing of the posterior predictive distributions for predicted Gp95 at and near observed BIO locations is clearly seen in Fig. 5(c). Similarly, Fig. 5(d) shows the posterior predictive distributions for BIO narrow within and adjacent to the G-LiHT transect where Gp95 is observed. Discussion and summary Themultivariatemodelshouldyieldimprovedpredictiveinference, over univariate models, in settings where there is moderate-to-strong covariance among outcomes spatial random effects and where the spatial range of dependence is sufficiently long as to allow observations to contribute information across locations. The development in the section Multivariate spatial regression with misalignment, and subsequent analyses, assumes a constant covariance among outcomes over the domain. This assumption might be reasonable in many settings. However, a more flexible model would pursue a non-stationary formulation of the cross-covariance matrix, see, for example, Guhaniyogi et al. (23). Such non-stationary cross-covariance models could improve inference about changing patterns in the strength and direction of the correlation between outcomes at broad spatial scales. In addition to improving prediction and interpolation in some settings, the multivariate misalignment model could be useful in designing efficient monitoring efforts. For example, if one had an a-priori estimate of the covariance among outcomes, or could learn about this covariance through an initial sampling effort, then resources could be used for an appropriate level of sampling of outcome subsets. This represents a very active area of work that builds upon a rich literature on sampling designs for spatiotemporal environmental data, see, for example, Mateu & M uller (23). Further development of the multivariate misalignment model for inference about spatiotemporal processes is a logical next step and would likely find application for exploring complex and dynamic ecological processes. Acknowledgements This work was supported by National Science Foundation Grants DMS- 669, EF-3739, EF-2474 and EF-253225, as well as NASA Carbon Monitoring System grants. Data accessibility Data deposited in the Dryad repository: http://datadryad.org/resource/doi:. 56dryad.3g9s2 References Anderson,E.,Bai,Z.,Bischof,C.,Blackford,S.,Demmel,J.,Dongarra,J.,et al. (999) LAPACK Users Guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA. ISBN -97-447-. Banerjee, S. & Gelfand, A.E. (22) Prediction, interpolation and regression for spatially misaligned data sets. Sankhya Series A, 64, 227 245. Banerjee, S., Carlin, B.P. & Gelfand, A.E. (24) Hierarchical Modeling and Analysis for Spatial Data. Chapman and Hall/CRC Press, Boca Raton, FL. Baribault,T.,Kobe,R.K.&Finley,A.O.(22)Tropicaltreegrowthiscorrelated with soil phosphorus, potassium, and calcium, though not for legumes. Ecological Monographs, 2, 9 23. Blackford, S.L., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., et al. (22) An Updated Set of Basic Linear Algebra Subprograms (BLAS). Transactions on Mathematical Software, 2, 35 5. Buonaccorsi, J.P. (29) Measurement Error: Models, Methods and Applications. Chapman & Hall/CRC, Boca Raton, FL. Chiles, J.P. & Delfiner, P. (999) Geostatistics: Modelling Spatial Uncertainty. Wiley, New York. Cook, B.D., Corp, L.W., Nelson, R.F., Middleton, E.M., Morton, D.C., McCorkel, J.T., et al. (23) NASA Goddard s Lidar, Hyperspectral and Thermal (G-LiHT) airborne imager. Remote Sensing, 5, 445 466. Cressie, N.A.C. & Wikle, C.K. (2) Statistics for Spatio-Temporal Data. Wiley, New York. Diggle, P.J., Tawn, J.A. & Moyeed, R.A. (99) Model-based geostatistics (with discussion). Journal of the Royal Statistical Society, Series C (Applied Statistics), 47, 299 35. Finley, A.O., Banerjee, S. & Gelfand, A.E. (23) spbayes for large univariate and multivariate point-referenced spatio-temporal data models. arxiv:3. 92[stat.CO]. Finley, A.O., Banerjee, S. & Cook, B.D. (24) Data from: Bayesian hierarchical models for spatially misaligned data in R. Methods in Ecology and Evolution. doi:.56/dryad.3g9s2 Gelfand, A.E., Zhu, L. & Carlin, B.P. (2) On the change of support problem for spatio-temporal data. Biostatistics, 2, 3 45. Gelfand, A.E., Schmidt, A.M., Banerjee, S. & Sirmans, C.F. (24) Nonstationary multivariate process modelling through spatially varying coregionalization (with discussion). TEST, 3, 263 32. Gelman, A., Carlin, J.B., Stern, H.S. & Rubin, D.B. (24) Bayesian Data Analysis, 2nd edn. Chapman and Hall/CRC Press, Boca Raton, FL. Gotway, C.A. & Young, L.J. (22) Combining incompatible spatial data. Journal of the American Statistical Association, 97, 632 64. Gryparis, A., Paciorek, C.J., Zeka, A., Schwartz, J. & Coull, B.A., (29) Measurement error caused by spatial misalignment in environmental epidemiology. Biostatistics,, 25 274. Guhaniyogi, R., Finley, A.O., Banerjee, S. & Kobe, R.K. (23) Modeling complex spatial dependencies: low-rank spatially-varying cross-covariances with application to soil nutrient data. Journal of Agricultural, Biological, and Environmental Statistics,, 274 29. Illian, J.B., Sorbye, S.H. & Rue, H. (22) A toolbox for fitting complex spatial point process models using integrated nested Laplace approximation (INLA). The Annals of Applied Statistics, 6, 499 53. Kamman, E.E. & Wand, M.P. (23) Geoadditive models. Applied Statistics, 52,. Lin, X., Wahba, G., Xiang, D., Gao, F., Klein, R. & Klein, B. (2) Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV. Annals of Statistics, 2, 57 6. Lopiano, K.K., Young, L.J. & Gotway, C.A. (2) A comparison of errors in variables methods for use in regression models with spatially misaligned data. Statistical Methods in Medical Research, 2, 29 47. Lopiano, K.K., Young, L.J. & Gotway, C.A. (23) Estimated generalized least squares in spatially misaligned regression models with berkson error. Biostatistics, 4, 737 75. Madsen, L., Ruppert, D. & Altman, N.S. 2. Regression with spatially misaligned data. Environmetrics, 9, 453 467. Mateu, J. & M uller, W.G. (23) Spatio-Temporal Design: Advances in Efficient Data Acquisition. John Wiley & Sons, Ltd., West Sussex. Mugglin, A.S., Carlin, B.P. & Gelfand, A.E. (2) Fully model-based approaches for spatially misaligned data. Journal of the American Statistical Association, 95, 77 7. Ovaskainen, O., Hottola, J. & Siitonen, J. (2) Modeling species co-occurrence by multivariate logistic regression generates new hypotheses on fungal interactions. Ecology, 2, 254 252. Paciorek, C.J., Yanosky, J.D., Puett, R.C., Laden, F. & Suh, H.H. (29) Practical large-scale spatio-temporal modeling of particulate matter concentrations. The Annals of Applied Statistics, 3, 37 397. Ren, Q. & Banerjee, S. (23) Hierarchical factor models for large spatially misaligned data: a low-rank predictive process approach. Biometrics, 69, 9 3. Swope, S.M. & Parker, I.M. (22) Complex interactions among biocontrol agents, pollinators, and an invasive weed: a structural equation modeling approach. Ecology, 22, 222 234. 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society, Methods in Ecology and Evolution, 5, 54 523

Models for spatially misaligned multivariate data 523 Szpiro, A.A., Sheppard, L. & Lumley, T. (2) Efficient measurement error correction with spatially misaligned data. Biostatistics, 2, 6 623. Zhu, L., Carlin, B.P. & Gelfand, A.E. (23) Hierarchical regression with misaligned spatial data: relating ambient ozone and pediatric asthma er visits in atlanta. Environmetrics, 4, 537 557. Received 5 November 23; accepted 26 February 24 Handling Editor: Bob O Hara Supporting Information Additional Supporting Information may be found in the online version of this article. Appendix S. Misalignment model specification. 24 The Authors. Methods in Ecology and Evolution 24 British Ecological Society, Methods in Ecology and Evolution, 5, 54 523