Spatial Inference of Nitrate Concentrations in Groundwater

Similar documents
Physician Performance Assessment / Spatial Inference of Pollutant Concentrations

Efficient Posterior Inference and Prediction of Space-Time Processes Using Dynamic Process Convolutions

Bayesian Hierarchical Models

Non-parametric Bayesian Modeling and Fusion of Spatio-temporal Information Sources

False Discovery Control in Spatial Multiple Testing

A Process over all Stationary Covariance Kernels

STATISTICAL MODELS FOR QUANTIFYING THE SPATIAL DISTRIBUTION OF SEASONALLY DERIVED OZONE STANDARDS

Models for spatial data (cont d) Types of spatial data. Types of spatial data (cont d) Hierarchical models for spatial data

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Resolving GRB Light Curves

A Framework for Daily Spatio-Temporal Stochastic Weather Simulation

On dealing with spatially correlated residuals in remote sensing and GIS

Statistics for extreme & sparse data

Bayesian spatial quantile regression

Introduction to Gaussian Processes

Calibrating Environmental Engineering Models and Uncertainty Analysis

Kernels for Automatic Pattern Discovery and Extrapolation

Multivariate Bayesian Linear Regression MLAI Lecture 11

Introduction to Probabilistic Machine Learning

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Disease mapping with Gaussian processes

Learning Outbreak Regions in Bayesian Spatial Scan Statistics

Fusing point and areal level space-time data. data with application to wet deposition

Measuring Uncertainty in Spatial Data via Bayesian Melding

STAT 518 Intro Student Presentation

Bayesian inference & process convolution models Dave Higdon, Statistical Sciences Group, LANL

Some Notes on Gamma Processes

Hierarchical Modelling for Univariate Spatial Data

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Probabilistic numerics for deep learning

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland

Bayesian Nonparametric Regression for Diabetes Deaths

Density Estimation. Seungjin Choi

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

ESTIMATING THE MEAN LEVEL OF FINE PARTICULATE MATTER: AN APPLICATION OF SPATIAL STATISTICS

MULTIDIMENSIONAL COVARIATE EFFECTS IN SPATIAL AND JOINT EXTREMES

A spatially explicit modelling framework for assessing ecotoxicological risks at the landscape scale

Gaussian Processes (10/16/13)

Hierarchical Modeling for Univariate Spatial Data

Data Integration Model for Air Quality: A Hierarchical Approach to the Global Estimation of Exposures to Ambient Air Pollution

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Spatio-Temporal Modelling of Credit Default Data

Analysing geoadditive regression data: a mixed model approach

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Statistical Analysis of Spatio-temporal Point Process Data. Peter J Diggle

Hierarchical Modelling for Multivariate Spatial Data

Kriging by Example: Regression of oceanographic data. Paris Perdikaris. Brown University, Division of Applied Mathematics

Uncertainty Quantification and Validation Using RAVEN. A. Alfonsi, C. Rabiti. Risk-Informed Safety Margin Characterization.

Quasi-likelihood Scan Statistics for Detection of

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

A STATISTICAL TECHNIQUE FOR MODELLING NON-STATIONARY SPATIAL PROCESSES

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Gaussian Process Regression Model in Spatial Logistic Regression

Probabilistic assessment of danger zones using a surrogate model of CFD simulations

BME STUDIES OF STOCHASTIC DIFFERENTIAL EQUATIONS REPRESENTING PHYSICAL LAWS -PART II

Kriging Luc Anselin, All Rights Reserved

Estimation of Operational Risk Capital Charge under Parameter Uncertainty

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Some Topics in Convolution-Based Spatial Modeling

Wrapped Gaussian processes: a short review and some new results

Hierarchical Modeling for Multivariate Spatial Data

A new covariance function for spatio-temporal data analysis with application to atmospheric pollution and sensor networking

Spatial Regression for Marked Point Processes

Ornstein-Uhlenbeck processes for geophysical data analysis

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

SPRING 2007 EXAM C SOLUTIONS

Classical and Bayesian inference

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

A Spatio-Temporal Point Process Model for Ambulance Demand

Statistical Ecology with Gaussian Processes

Sub-kilometer-scale space-time stochastic rainfall simulation

Bayesian Inference for the Multivariate Normal

Analysis of Marked Point Patterns with Spatial and Non-spatial Covariate Information

Gaussian Process Approximations of Stochastic Differential Equations

Statistical Rock Physics

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Hyperparameter estimation in Dirichlet process mixture models

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Rate Maps and Smoothing

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Gaussian processes for inference in stochastic differential equations

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Spatio-Temporal Models for Areal Data

An Introduction to Gaussian Processes for Spatial Data (Predictions!)

Modelling Operational Risk Using Bayesian Inference

Modeling Spatial Dependence and Spatial Heterogeneity in. County Yield Forecasting Models

A spatial causal analysis of wildfire-contributed PM 2.5 using numerical model output. Brian Reich, NC State

Product Held at Accelerated Stability Conditions. José G. Ramírez, PhD Amgen Global Quality Engineering 6/6/2013

Spatio-temporal modeling of weekly malaria incidence in children under 5 for early epidemic detection in Mozambique

Spatial Statistics with Image Analysis. Outline. A Statistical Approach. Johan Lindström 1. Lund October 6, 2016

Multi-resolution models for large data sets

1 Isotropic Covariance Functions

Optimisation séquentielle et application au design

Statistics 352: Spatial statistics. Jonathan Taylor. Department of Statistics. Models for discrete data. Stanford University.

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Transcription:

Spatial Inference of Nitrate Concentrations in Groundwater Dawn Woodard Operations Research & Information Engineering Cornell University joint work with Robert Wolpert, Duke Univ. Dept. of Statistical Science and School of the Environment and Michael O Connell, Waratah Corporation, Durham, NC 1

Outline 1 Nitrates in Groundwater 2 The Data 3 Existing Approaches for Pollutant Estimation 4 Bayesian Moving-Average Models 5 Results 6 Conclusions and Future Work 2

Nitrates in Groundwater High levels of nitrates in groundwater can cause health and environmental problems Nitrate contamination in groundwater can be due to: agricultural fertilization, septic systems, etc. 4

Nitrates in Groundwater Measurements of nitrates in groundwater have been obtained over the mid-atlantic states [Ator 1998]: > 8.3 mg/l mid range < 0.75 mg/l 5

Nitrate Estimation Desire geographic interpolation of nitrate levels Distinct regulatory goals require inference at distinct geographic scales... fine-scale, regulatory units (e.g. counties), hydrologic units (e.g. watersheds)....as well as distinct risk measures average nitrate concentration, probability of exceeding a threshold, averaged by region, maximum nitrate concentration occurring in each region. 6

Nitrate Estimation 1. Wish to perform inference for multiple scales and risk measures without refitting the model or ad-hoc aggregation 2. Need the uncertainty associated with all estimates of risk measures 3. Desire a nonparametric approach 7

Model Summary We utilize a nonparametric spatial statistical model for nitrate concentrations at all locations Bayesian approach: uncertainty about the nitrate concentration and its average over various regions are all random variables......for which we can compute expected values (best overall estimates) and probabilities of exceeding specified thresholds 8

Data Summary Nitrate measurements from 929 wells in the mid-atlantic states Taken between the years of 1985 and 1996 10

Existing Approaches for Pollutant Estimation Pollutant concentrations can be estimated separately for each region. This leads to unreliable estimates for regions with few measurements When inference is desired at a single spatial partition (e.g. counties), lattice models can be used 12

Existing Approaches for Pollutant Estimation Kriging allows smooth spatial interpolation. It models the pollutant concentration Λ(x) at location x X as: log Λ(x) = J X j (x)β j + Z (x) j=1 where Z (x) is a mean-zero Gaussian process. 13

Existing Approaches for Pollutant Estimation A kriged surface with only an intercept term β 0 : > 8.3 mid range < 0.7525 Latitude 34 36 38 40 42 0 6 12 82 80 78 76 74 Longitude 14

Existing Approaches for Pollutant Estimation The interpolated surface is reasonable. > 8.3 mid range < 0.7525 Latitude 34 36 38 40 42 0 6 12 82 80 78 76 74 Longitude 15

Existing Approaches for Pollutant Estimation However, the confidence intervals are very wide in many locations, even where there is much data. Lower Bound: Upper Bound: > 8.3 mid range < 0.7525 > 8.3 mid range < 0.7525 Latitude 34 36 38 40 42 0 6 12 Latitude 34 36 38 40 42 0 6 12 + 82 80 78 76 74 82 80 78 76 74 Longitude Longitude 16

Existing Approaches for Pollutant Estimation The Gaussian process model makes strong assumptions about the distribution of the nitrate concentration The wide confidence intervals may be due to the data violating these assumptions Let s look at a nonparametric alternative 17

Moving-Average Models Ickstadt and Wolpert (1997) and Wolpert and Ickstadt (1998) introduced methods for interpolating intensities of spatial point processes by modeling the intensity Λ(x) as a moving average of an underlying stochastic process The approach has been used in non-point-process applications: identifying proteins in mass spectroscopy [House, Clyde, and Wolpert 2006] inferring temporal fluctuations in sulfur dioxide pollution [Tu 2006] 19

Moving-Average Models The concentration Λ(x) at location x X is modeled as: Λ(x) = J X j (x)β j + j=1 M m=1 k(x, s m )γ m for k(x, s) a kernel function on X S. The number M, locations s m, and magnitudes γ m of the mixture components are uncertain; so are the coefficients β j. 20

Moving-Average Models Interpretation of the spatial portion of the model, m k(x, s m)γ m, for pollutant level estimation: the pollutant surface is the sum of an unknown number of point sources with unknown locations and magnitudes......where the pollutant spreads out from each source in a manner consistent with the kernel k(, ) 21

Moving-Average Models The ith measurement Y i is assumed to have a log-normal distribution centered at Λ(x i ): log Y i Normal( log Λ(x i ),σ 2 ) The measurement variance σ 2 is unknown 22

Moving-Average Models The kernel is specified as: k(x, s) =exp where d > 0 is a constant { 1 } x s 2 2d 2 This decreases smoothly as a function of distance from the center s 23

Prior Specification The parameters are σ, β,m, {s m }, {γ m } For the nitrates analysis we do not include covariates, so β is not in the model Following common practice we assign for fixed α σ > 0 and ρ σ > 0 σ 2 Gamma( α σ,ρ σ ) 24

Prior Specification The spatial term in the model can be rewritten where M m=1 k(x, s m )γ m = Γ(ds) = is a discrete measure on S. M m=1 S k(x, s)γ(ds) γ m δ sm (ds) Under some reasonable assumptions (e.g. Γ(A) and Γ(B) are indep. for disjoint sets A, B S), our prior on Γ must be a Lévy random field We use the well-known gamma random field on a bounded set S R d 25

Prior Specification The gamma random field prior for Γ(ds) implies that for fixed α, ρ, ɛ > 0: The number of mixture components M has a Poisson distribution: M Poisson(α S E 1 (ρɛ)) where E 1 is the exponential integral function Conditional on M, the locations s m are independently uniformly distributed on S The magnitudes γ m are independently distributed with a truncated gamma distribution, having density: f (γ) γ 1 e ργ 1(γ >ɛ) 26

Prior Specification The constants d, α, ρ, ɛ, α σ, and ρ σ are given reasonable values using expert knowledge and information in the data. 27

Prior Specification These choices lead to prior surfaces Λ(x) like this one: Latitude 34 36 38 40 42 0 6 12 82 80 78 76 74 Longitude 28

Prior Specification There are some areas with high nitrate concentrations; these areas have random (unknown) locations a priori: Latitude 34 36 38 40 42 0 6 12 82 80 78 76 74 Longitude 29

Nitrate Inferences The posterior mean surface for the nitrate concentration is: Latitude 34 36 38 40 42 > 8.3 mid range < 0.7525 0 6 12 82 80 78 76 74 Longitude 31

Nitrate Inferences There are hot spots in the Chesapeake region, southeast Pennsylvania, etc. Latitude 34 36 38 40 42 > 8.3 mid range < 0.7525 0 6 12 82 80 78 76 74 Longitude 32

Nitrate Inferences There are low-nitrate areas in West Virginia, eastern North Carolina, etc. Latitude 34 36 38 40 42 > 8.3 mid range < 0.7525 0 6 12 82 80 78 76 74 Longitude 33

Nitrate Inferences Areas with sparse / no data have expected concentration close to the prior mean of 4.4 mg/l. Latitude 34 36 38 40 42 > 8.3 mid range < 0.7525 0 6 12 82 80 78 76 74 Longitude 34

Nitrate Inferences The posterior standard deviation of the nitrate concentration is: Latitude 34 36 38 40 42 > 8.3 mid range < 0.7525 0 2 4 82 80 78 76 74 Longitude 35

Nitrate Inferences This is a measure of estimation uncertainty. Latitude 34 36 38 40 42 > 8.3 mid range < 0.7525 0 2 4 82 80 78 76 74 Longitude 36

Nitrate Inferences Areas with sparse / no data have high uncertainty. Latitude 34 36 38 40 42 > 8.3 mid range < 0.7525 0 2 4 82 80 78 76 74 Longitude 37

Nitrate Inferences Most areas with numerous measurements have low uncertainty. Latitude 34 36 38 40 42 > 8.3 mid range < 0.7525 0 2 4 82 80 78 76 74 Longitude 38

Nitrate Inferences The spotty quality of the figure is due to noise from the computational method, and could be eliminated with, e.g., parallelization. Latitude 34 36 38 40 42 > 8.3 mid range < 0.7525 0 2 4 82 80 78 76 74 Longitude 39

Nitrate Inferences Average nitrate concentrations over counties can be obtained: Latitude 34 36 38 40 42 > 5 mg/l mid range < 1 mg/l 82 80 78 76 74 Longitude 40

Nitrate Inferences So can the probability that the nitrate concentration exceeds the regulatory limit, averaged by county: Latitude 34 36 38 40 42 > 8 % mid range < 2 % 82 80 78 76 74 Longitude 41

Nitrate Inferences This probability is low in most regions with a lot of data: Latitude 34 36 38 40 42 > 8 % mid range < 2 % 82 80 78 76 74 Longitude 42

Nitrate Inferences This probability is equal to its prior value of 5% in regions with sparse / no data: Latitude 34 36 38 40 42 > 8 % mid range < 2 % 82 80 78 76 74 Longitude 43

Nitrate Inferences Again there is a hot spot in the Chesapeake region: Latitude 34 36 38 40 42 > 8 % mid range < 2 % 82 80 78 76 74 Longitude 44

Nitrate Inferences The green counties around the edge are due to edge effects of the model. Latitude 34 36 38 40 42 > 8 % mid range < 2 % 82 80 78 76 74 Longitude 45

Conclusions The Bayesian moving-average model allows inference of a variety of risk measures at a variety of spatial scales. Uncertainty measures are available for all these estimates. The model is nonparametric. It has a desirable interpretation in the context of pollutant level estimation. 47

Conclusions The moving-average model has a computational advantage over kriging for large data sets Likelihood evaluation for the moving-average model is O(NM), where N is the number of data points and M is the number of mixture components. Likelihood evaluation is O(N 3 ) for kriging. 48

Future Work Covariates such as climatic, geologic, and land use factors could be added to the nitrates analysis. The fixed kernels could be replaced with kernels that have priors on the scale and eccentricity. This would allow the model to capture, e.g., pollutant point sources that spread out more in one direction than another due to flow patterns. Additional risk measures, e.g. the % of population exposed to nitrate levels above 10 mg/l 49

References Ator, S. W. (1998). Nitrate and pesticide data for waters of the mid- Atlantic region. USGS Open File Report 98-158, Reston, VA: U.S. Geological Survey. House, L. L., Clyde, M. A., and Wolpert, R. L. (2006). Nonparametric models for peak identification and quantification in mass spectroscopy, with application to MALDI-TOF. Discussion Paper 2006-24, Duke Univ. Dept. of Statistical Science. Ickstadt, K. and Wolpert, R. L. (1997). Multiresolution assessment of forest inhomogeneity, in Case Studies in Bayesian Statistics, Vol. III, NY: Springer-Verlag, pp. 371-386. Tu, C. (2006). Bayesian nonparametric modeling using Levy process priors with applications for function estimation, time series modeling, and spatio-temporal modeling. PhD thesis, Duke Univ. Dept. of Statistical Science. Wolpert, R. L. and Ickstadt, K. (1998). Poisson/gamma random field models for spatial statistics. Biometrika, 85, 251-267. 50