Approximating likelihoods for large spatial data sets

Similar documents
Approximating likelihoods for large spatial data sets

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Mixed-Models. version 30 October 2011

Mixed-Model Estimation of genetic variances. Bruce Walsh lecture notes Uppsala EQG 2012 course version 28 Jan 2012

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Covariance function estimation in Gaussian process regression

Density Estimation: ML, MAP, Bayesian estimation

STA414/2104 Statistical Methods for Machine Learning II

Theory and Computation for Gaussian Processes

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Stat 5101 Lecture Notes

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Mixtures of Gaussians with Sparse Structure

K-Means and Gaussian Mixture Models

An Introduction to Spatial Statistics. Chunfeng Huang Department of Statistics, Indiana University

PARAMETER ESTIMATION FOR FRACTIONAL BROWNIAN SURFACES

Plausible Values for Latent Variables Using Mplus

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

The Gaussian distribution

Spatial analysis of a designed experiment. Uniformity trials. Blocking

Kriging models with Gaussian processes - covariance function estimation and impact of spatial sampling

The Expectation-Maximization Algorithm

Bayesian Model Diagnostics and Checking

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Nonparametric Regression With Gaussian Processes

Algorithm-Independent Learning Issues

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Pattern Recognition. Parameter Estimation of Probability Density Functions

Machine Learning 4771

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models

Learning Gaussian Process Models from Uncertain Data

Estimation and Prediction Scenarios

On dealing with spatially correlated residuals in remote sensing and GIS

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Lecture 20: Linear model, the LSE, and UMVUE

Generalized Linear Models for Non-Normal Data

1.1 Basis of Statistical Decision Theory

Reliability Monitoring Using Log Gaussian Process Regression

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Gaussian Mixture Models

Further Remarks about Hypothesis Tests

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

Likelihood-Based Methods

1 Bayesian Linear Regression (BLR)

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

STA 4273H: Statistical Machine Learning

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Bayesian Learning (II)

Chapter 3: Maximum Likelihood Theory

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Probability Distributions: Continuous

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Non-gaussian spatiotemporal modeling

Forecasting Levels of log Variables in Vector Autoregressions

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

i=1 h n (ˆθ n ) = 0. (2)

Parametric Techniques Lecture 3

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Thomas J. Fisher. Research Statement. Preliminary Results

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Covariance Tapering for Likelihood-based Estimation in Large Spatial Datasets

MIXTURE MODELS AND EM

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

Logistic Regression: Regression with a Binary Dependent Variable

Machine Learning Linear Classification. Prof. Matteo Matteucci

Frequentist-Bayesian Model Comparisons: A Simple Example

Parametric Techniques

On prediction and density estimation Peter McCullagh University of Chicago December 2004

The Multivariate Gaussian Distribution

PS 203 Spring 2002 Homework One - Answer Key

MTTTS16 Learning from Multiple Sources

MIXED MODELS THE GENERAL MIXED MODEL

Generative Learning algorithms

On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Linear Dynamical Systems

Introduction to Maximum Likelihood Estimation

Review of probabilities

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

Exercises Chapter 4 Statistical Hypothesis Testing

STA 4273H: Statistical Machine Learning

Statistical Methods in Particle Physics

L11: Pattern recognition principles

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Limit Kriging. Abstract

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

ECE 592 Topics in Data Science

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Transcription:

Approximating likelihoods for large spatial data sets By Michael Stein, Zhiyi Chi, and Leah Welty Jessi Cisewski April 28, 2009 1 Introduction Given a set of spatial data, often the desire is to estimate its covariance structure. For practical purposes, it is often necessary to propose some parametric model for the variagram. Once the parametric model is selected, one must estimate the parameters of the model. Given a parametric structure, it naturally follows that maximum likelihood or Bayesian methods can be used in estimation. There are also nonparametric methods available, but the focus of Stein et al. [1] is on likelihood-based methods for parameter estimation of the covariance structure of a Gaussian random field. The issue with likelihood-based methods is that when the observations are irregularly sited, the computational effort can be large. Even for Gaussian random fields, the computational effort is O(n 3 ) for each calculation (Stein et al.). Vecchia (1988) [2] proposed a method to approximate the likelihood of spatial data sets using the basic fact that any joint density can be written in terms of the product of conditional densities. Let Z = (Z 1,, Z n ) have joint density p(z; φ), where p is a generic density, and φ is an unknown vector-valued parameter. Partition Z into subvectors Z 1,, Z b with Z (j) = (Z 1 Z j ). Then, p(z; φ) = p(z 1 ; φ) b p(z j z (j 1) ; φ) (1) Vecchia (1988) revised the conditioning vectors for j = 1,, b 1, to a subvector, S (j), of Z (j), then b p(z; φ) p(z 1 ; φ) p(z j s (j 1) ; φ). Stein et al. develops this approximation for restricted maximum likelihood (REML), which is used in estimation of the parameters of the covariance structure. REML is preferred 1 j=2 j=2

over ML for this type of estimation because it does not rely on estimation of the mean function. Standard ML methods use the estimated mean parameters in the estimation of the covariance parameters, but acts as if the mean parameters are known. Since the standard ML method does not account for the uncertainty in the estimation of the mean parameters, this leads to underestimation of the variance. 2 Methods The methods developed in this paper are for Gaussian random fields, and so the general framework is based on the supposition that Z N{F β, K(θ)}. Furthermore, F is considered to be a known n p matrix of rank p, and β is a p-dimensional vector of unknown coefficients. The focus of the paper is concerned with estimation the parameters, θ, of the covariance matrix K. The assumptions are that the covariance matrix K is positive definite for all θ Θ, the vector Z i has length n i, and the rank of F 1 is p. In addition, we assume that the best linear unbiased predictor (BLUP) of Z j in terms of Z (j 1) exists for all θ Θ. In the methodology section, they first develop the restricted likelihood form in terms of the errors of the BLUPs of Z j based on Z (j 1), where Z (j 1) = Z 1 + + Z (j 1). W j (θ) = B j (θ)z, where B j (θ) is the n j n matrix such that W j (θ) is the vector of the errors of the BLUP of Z j in terms of Z (j 1). The result is the following proposition. Proposition 1. The restricted log-likelihood of θ in terms of Z is given by rl(θ; Z) = n p log(2π) 1 2 2 b j=1 [log{det(v j )} + W jv 1 j W j ]. The approximation of the restricted maximum likelihood displayed in Proposition 1 has the same form, but W j for j > 1 is the error of the BLUP of Z j based on the sub-vector S (j 1) of Z (j 1). Now that the form of the approximation to the restricted log-likelihood has been developed, there are several issues to address regarding the use of this approximation. The length of the prediction vectors (Z j ) and conditioning vectors (S (j 1) ) must be decided, along with which elements of Z (j 1) should be included in S (j 1). Vecchia (1998) noted, and Stein et al. agrees that any statistical method used to make these decisions would be based on unknown parameters. Fortunately, Stein et al. came up with a way to check the efficiency of the approximation by using the well-developed theory of estimating equations. An explanation of this validation method is explained in the next section covering the numerical results. 2

3 Numerical Results As noted in the paper, there are manifold possible networks and designs that could be studied in order to observe the effects of the approximated restricted likelihood on covariance structure parameter estimation under prediction and conditioning vectors of varying lengths. They decided to consider an observation network composed of 1000 randomly selected sites (out of 10,000 points in a plane). Each selected point is then perturbed by adding a random point in [0.25, 0.25] 2. In the first cases considered, they include only one observation in the prediction vector Z j for j > 1. In the paper, they also consider designs with longer prediction vectors. In these designs, they allow the conditioning vectors to be lengths m = 8, 16, and 32, and consider three cases regarding the distance from the elements of the conditioning vector to the prediction vector: all m nearest neighbors, three-quarters of the m as nearest neighbors, and half of the m as nearest neighbors. The covariance structures for the random fields are the exponential model (cov{z(x), Z(y)} = θ 2 exp( θ 1 x y /θ 2 )) and the power law variogram model ( 1 2 var{z(x) Z(y)} = θ 2 x y θ 1 ). I have included the table of the relative efficiencies for the exponential model (Table 1). I include the table to clarify the design and procedure developed in the paper. The summary of results for the power law variogram model has the same form (except that the efficiencies for a non-constant mean function is also displayed), but I direct the reader to the paper to get the numerical results. The results displayed in Table 1 reflect the efficiency of the parameter estimation under the approximation to the restricted likelihood. As alluded to in the methodology section, using estimating equations leads to what is referred to as the robust information measure. This robust information measure provides a method for checking the efficiency of the approximation to the restricted likelihood in estimation compared to the exact restricted likelihood. It is shown that if there is no restriction on the conditioning vector S (j 1), then we can take S (j 1) = Z (j 1), in which case the robust information measure is the Fisher information matrix. Hence the robust information measure based on the approximation can be thought of as a reflection of the extent to which information is lost by using the approximation. The numbers displayed in Table 1 are the ratios displayed as percentages of the robust information measures to the Fisher information matrix. Note that the Fisher information matrix can be calculated because they consider only 1000 observations. Numbers close to 100 percent reflect more efficient estimation. As expected, designs with longer conditioning vectors performed better than designs with shorter conditioning vectors. What is important to note in the tables of results in the paper are which portion of nearest neighbors to distant observations do the best, and under which lengths of conditioning vectors. For example, for Table 1 we see that component 1 performs similarly under the various designs given the conditioning vector length. However, component 2 s performance highly depends on the particular design. The general interpretation for component 2 is that when component 1 is smaller (θ 1 = 0.02 or 0.1), it 3

law variogram model, var{z(x) - Z(y) } =02 x - yl. We mostly take the mean function to be an unknown constant, but we do consider a mean function that is linear in the co-ordinates for the power law model. We also consider a power law model observed with measurement error of variance 03: var{z(x) - Z(y)} = 03 l{ijx_yl>} + 021x - yl6. Except for the power law model, we report only on results for the designs D(m, m'). 4.2. Relative efficiencies Let doesus not first perform consider well the when results thefor conditioning the exponential vector model. is composed The parameterization only of nearestthat neighbors; is used here, however, 02 exp(-01 for larger Ix - YI/02), valueshas of component the property 1 that (θ 1 01 = describes 0.5 or 2), the it local performs variations competitively of the process or (I even var{z(x) - better when Z(y)} - only 01 nearest Ix- y as neighbors Ix- yl -> are 0), whereas considered. 02 only There substantially are many affects more interesting variations on results scales of that this are nature not small included compared in the with paper. 02/01. Table 1 gives the ratios (as percentages) of Table 1. Relative efficiencies of estimators by using approximate restricted likelihoods compared with the exact REML estimators as measured by the diagonal elements of the inverse information matrices based on 1000 observations as described in Section 4.1t 0I m'/m Relative efficiencies (o) for the following components and values of m: Component 1 Component 2 m=8 m =16 m=32 m=8 m=16 m=32 0.02 1 87.3 94.8 97.5 24.8 33.6 44.7 0.75 91.1 97.4 99.2 71.0 83.1 90.5 0.5 85.7 95.2 98.6 74.7 82.9 90.9 0.1 1 83.1 91.6 95.3 35.3 48.9 61.9 0.75 89.0 96.6 99.0 58.6 78.4 89.8 0.5 88.0 95.8 98.9 72.9 85.8 93.7 0.5 1 79.7 89.1 94.2 63.1 78.9 88.4 0.75 77.6 89.3 94.4 58.2 78.2 88.8 0.5 81.9 92.2 96.2 69.1 81.6 91.0 2 1 84.2 91.6 95.6 88.6 94.1 97.0 0.75 81.8 89.8 94.6 83.9 92.1 96.0 0.5 79.8 88.8 94.1 80.5 90.4 95.2 tthe designs that are used for approximate likelihoods are of the form D(m, m') as defined in Section 4.2. The model for the covariance function is 02 exp(-0i d/02), where d is the interpoint distance; 02 = 1 in all cases. 4 Conclusion While this paper develops a number of interesting ideas, and addresses many more concepts than I include in my presentation or this summary, the authors note that there is a lot of room for further development of these ideas. Vecchia (1988) came up with a clever way to approximate the likelihood of large spatial data sets, thus allowing for feasible likelihood-based methods for large spatial data sets. Stein et al. made a number of extensions of these methods, including applying the approximation to REML, and developed a method to verify the efficiency of these approximations under various designs using estimating equations and the robust information measure. Stein et al. also discusses at several points in the paper that designs that are good for estimation are not necessarily good for 4

prediction, and vice versa. Hence, the goal of estimation or prediction must be considered when determining the design. References [1] Stein, M. L., Chi, Z. and Welty, L. J. (2004) Approximating likelihoods for large spatial data sets. J. R. Statistic. Soc. B, 66, 275-296. [2] Vecchia, A.V. (1988) Estimation and model identification for continuous spatial processes. J. R. Statistic. Soc. B, 50, 297-312. 5