P -spline ANOVA-type interaction models for spatio-temporal smoothing

Similar documents
Flexible Spatio-temporal smoothing with array methods

Space-time modelling of air pollution with array methods

GLAM An Introduction to Array Methods in Statistics

Currie, Iain Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh EH14 4AS, UK

Multidimensional Density Smoothing with P-splines

Using P-splines to smooth two-dimensional Poisson data

A Hierarchical Perspective on Lee-Carter Models

Spatio-Temporal Expectile Regression Models

Functional SVD for Big Data

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Analysing geoadditive regression data: a mixed model approach

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Partial factor modeling: predictor-dependent shrinkage for linear regression

Linear Algebra Review

Functional responses, functional covariates and the concurrent model

Properties of Matrices and Operations on Matrices

MULTIDIMENSIONAL COVARIATE EFFECTS IN SPATIAL AND JOINT EXTREMES

Estimating prediction error in mixed models

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Recovering Indirect Information in Demographic Applications

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University

Spatial bias modeling with application to assessing remotely-sensed aerosol as a proxy for particulate matter

Bayesian covariate models in extreme value analysis

Spatiotemporal smoothing and sulphur dioxide trends over Europe

Chapter 5 Matrix Approach to Simple Linear Regression

Spatial Process Estimates as Smoothers: A Review

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

A general mixed model approach for spatio-temporal regression data

Regression. Oscar García

Optimization Problems

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Using Estimating Equations for Spatially Correlated A

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

arxiv: v4 [stat.me] 14 Sep 2015

NETLAKE toolbox for the analysis of high-frequency data from lakes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Linear Algebra (Review) Volker Tresp 2018

Consistent Bivariate Distribution

of the 7 stations. In case the number of daily ozone maxima in a month is less than 15, the corresponding monthly mean was not computed, being treated

Missing Data Issues in the Studies of Neurodegenerative Disorders: the Methodology

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

A short introduction to INLA and R-INLA

Trends in Human Development Index of European Union

Predicting Long-term Exposures for Health Effect Studies

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Math 671: Tensor Train decomposition methods

COMP 558 lecture 18 Nov. 15, 2010

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

An application of the GAM-PCA-VAR model to respiratory disease and air pollution data

Chapter 3: Regression Methods for Trends

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting

COS 424: Interacting with Data

Introduction to Smoothing spline ANOVA models (metamodelling)

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

A Short Introduction to the Lasso Methodology

Estimation of cumulative distribution function with spline functions

Spatio-temporal modelling of daily air temperature in Catalonia

2. Matrix Algebra and Random Vectors

Matrix-Product-States/ Tensor-Trains

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

More Linear Algebra. Edps/Soc 584, Psych 594. Carolyn J. Anderson

20.1. Balanced One-Way Classification Cell means parametrization: ε 1. ε I. + ˆɛ 2 ij =

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Machine Learning for OR & FE

Nonparametric Small Area Estimation via M-quantile Regression using Penalized Splines

Chapter 3 Best Linear Unbiased Estimation

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Estimating the long-term health impact of air pollution using spatial ecological studies. Duncan Lee

Spatial smoothing using Gaussian processes

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1

Modeling daily precipitation in Space and Time

Modelling spatial patterns of distribution and abundance of mussel seed using Structured Additive Regression models

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Predictive spatio-temporal models for spatially sparse environmental data. Umeå University

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

mboost - Componentwise Boosting for Generalised Regression Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Statistics for analyzing and modeling precipitation isotope ratios in IsoMAP

Generalized Functional Concurrent Model

Multiple Linear Regression

Supplementary Material to General Functional Concurrent Model

Towards a Regression using Tensors

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Functional responses, functional covariates and the concurrent model

Theorems. Least squares regression

Prediction of double gene knockout measurements

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Gibbs Sampling in Linear Models #2

Aggregated cancer incidence data: spatial models

Linear Algebra (Review) Volker Tresp 2017

Time Series Analysis -- An Introduction -- AMS 586

Graph Functional Methods for Climate Partitioning

Modelling trends in the ocean wave climate for dimensioning of ships

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Multivariate Analysis of Ecological Data

Nonparametric time series forecasting with dynamic updating

Gaussian Process Regression Model in Spatial Logistic Regression

Transcription:

P -spline ANOVA-type interaction models for spatio-temporal smoothing Dae-Jin Lee 1 and María Durbán 1 1 Department of Statistics, Universidad Carlos III de Madrid, SPAIN. e-mail: dae-jin.lee@uc3m.es and mdurban@est-econ.uc3m.es Abstract: In recent years, spatial and spatio-temporal modelling have become an important area of research in many fields (epidemiology, environmental studies, disease mapping,...). However, most of the models developed are constrained by the large amounts of data available. We propose the use of Penalized splines (P -splines) in a mixed model framework for smoothing spatio-temporal data. Our approach allows the consideration of interaction terms which can be decomposed as a sum of smooth functions similarly as an ANOVA decomposition. The properties of the basis used for regression allow the use of algorithms that can handle large amount of data. We show that imposing the same constraints as in a factorial design it is possible to avoid identifiability problems. We illustrate the methodology for Europe ozone levels in the period 1999-2005. Keywords: P -splines; Mixed Models; Spatio-temporal data; space-time interactions; Smooth-ANOVA models. 1 Spatio-temporal smoothing with P -splines Suppose we have normal spatio-temporal data which are located in geographical locations, s = (x 1, x 2 ), and measured over time periods, x t. The response y ijt is then indexed in their spatial locations and over time. A smooth model for the data would be given by: y = Bθ + ɛ, ɛ N (0, σ 2 I), (1) where θ is the vector of coefficients, and B is a regression basis constructed from B-spline basis products. Currie et al. (2006) developed an approach based on Kronecker products, known as generalized linear array methods (GLAM) for data in a grid. When data are scattered (as is the case of spatial data), Eilers et al. (2006) proposed the use of the row-wise Kronecker or box-product of individual basis (denoted as ). Most of the common approaches in spatio-temporal smoothing assume an additive function for the temporal dimension, ignoring the interaction between space and time (MacNab, 2001; Kneib, 2006). This formulation implies a spatio-temporal covariance structure given by separable terms for a

2 P -spline ANOVA models for spatio-temporal smoothing spatial and temporal components respectively. This could be too simplistic in some situations. As an alternative, we propose non-separable models of the form: ŷ = f(x 1, x 2, x t ), (2) which explicitly considers the interaction between space and time. The regression basis for a 3d interaction model (2) is: B = (B 1 B 2 ) s B t = B s B t, nt c 1 c 2 c 3, (3) where B 1, B 2 and B t are the marginal B-spline basis of dimensions n c 1, n c 2 and t c 3 respectively. Model (2) and basis given by (3) can easily be set into GLAM framework. We can express the data in a compact notation replacing y of length nt 1 by the matrix Y of dimension n t and the coefficient vector θ of length c 1 c 2 c 3 1 by an array of coefficients Θ, of dimension c 1 c 2 c 3. In matrix notation, the model can be written as E[Y ] = B t ΘB s (4) Smoothness is imposed via the penalty matrix P based on second order difference matrices D 1, D 2 and D t. The penalty term in 3-dimensions is: P = λ 1 D 1D 1 I c2 I c3 +λ 2 I c1 D 2D 2 I c3 +λ t I c1 I c2 D td t, (5) which implies placing penalties over each dimension of the array Θ. For the spatio-temporal case, the penalty (5) allows spatial anisotropy considering a different amount of smoothing for longitude and latitude (λ 1 λ 2 ) and for the temporal component (λ t ). The mixed model representation of P -splines consists in setting a new basis which allows the reparameterization of (1) and its associated penalty into a mixed model of the form: y = Xβ + Zα + ɛ, α N (0, G), and ɛ N (0, σ 2 I), (6) where G is a diagonal matrix which depends on the smoothing parameters λ 1, λ 2 and λ t. Following a similar approach to Currie et al. (2006), and using the properties of the Kronecker and row-wise Kronecker products it can be shown that using the singular value decomposition (SVD) of (5) the penalty becomes block-diagonal and basis and coefficients are reparameterized into: B [X : Z] and θ (β : α ). 2 Smooth-ANOVA decomposition models Sometimes the interest lies in fitting complex models with functional form given by ŷ = f 1 (x 1 ) + f 2 (x 2 ) + f 3 (x 3 ) + f 1,2 (x 1, x 2 ) + f 1,2,3 (x 1, x 2, x 3 ), (7)

D.-J. Lee and M. Durbán 3 where f 1, f 2 and f 3 are smooth functions for the main effects (x 1, x 2 and x 3 ), f 1,2 the 2d-interaction effects for (x 1, x 2 ) and f 1,2,3 the 3d-interaction effects. Chen (1993) proposed Smoothing Spline Analysis-of-Variance (SS- ANOVA) decompositions to model main effects and interactions which can be interpreted as in classical ANOVA. In contrast, the approach presented in this paper allow a more computationally efficient methodology based on low-rank Penalized splines. Wood (2006) also considers smooth-anova decompositions with P -splines, and notes the need of imposing constraints to maintain the model identifiability. However, the way how these constraints are imposed and how the basis for each component of the decomposition are constructed are not clear. In this paper we use the properties of the SVD of the penalty (5) and show how to fit each component of the model and establish an intuitive connection with the usual ANOVA. In the case of spatio-temporal data this interpretation may be very useful, since we can model not only main effects of latitude and longitude, (or other covariates effects) but also the spatial effects (2-way interactions) and specially the interaction between space and time (3-way interactions). The basis X and Z of the mixed model representation can be expanded to allow the representation of the 3d model as the sum of smooth main and interaction terms as in (7). However, this representation does not account for independent and separate penalties since we have 3 smoothing parameters λ 1, λ 2 and λ t for each of the dimensions of the model, with penalty matrix given in (5), but we do not allow separate parameters for interaction terms. Alternatively, ANOVA-type models which explicitly consider different amount of smoothing for each smooth function in (7) can be considered. The corresponding new B-splines regression matrix would not be of full rank, given the linearly dependent columns, and the model would not be identifiable. The identifiability problem can be avoided by removing the columns of the basis of f(x 1 ) which are repeated in those for f(x 1, x 2 ) and f(x 1, x 2, x 3 ) and so on. Therefore, we need to impose constraints so that the model (7) is identifiable. We demonstrate that these constraints are applied on the P -spline regression coefficients θ, and are exactly equivalent to those applied in a 3-way factorial design, i.e. Main Effects: 2-Way Interactions: 3-Way Interactions: i i,j θ (1) i θ (12) ij = j = i,t i,j,t θ (2) j θ (23) ik = t = j,t θ (3) t = 0 (8) θ (13) jt = 0 (9) θ (123) ijt = 0 (10)

4 P -spline ANOVA models for spatio-temporal smoothing (b) Smoothed spatial trend for Dec. 1999 40 45 50 55 60 65 0 10 20 30 60 70 80 90 100 50 52 54 56 58 60 (c) Ozone levels for selected locations (d) Smoothed temporal trend for selected locations O3 20 40 60 80 100 120 140 Spain Finland France UK f(time) 20 40 60 80 100 120 140 Spain Finland France UK 1999 2000 2001 2002 2003 2004 2005 1999 2000 2001 2002 2003 2004 2005 time time FIGURE 1. 3d P -spline model: (a) spatial trend for June 1999, (b) spatial trend for december 1999. The symbol denotes the stations where monthly average measurements are available for period 1999-2005, and the stations with missing data. (c) Time series plot of a sample of stations of four countries which reflects the seasonality and temporal patterns in the data. (d) Smoothed temporal trends for the four stations selected. The vertical solid line corresponds to June 1999 and the dashed line to December 1999. 3 Application to ozone levels in Europe We apply the methodology proposed to the analysis of air pollution by ozone levels (in ug/m3 units) over Europe from 1999 to 2005. The data set are collected by the EMEP monitoring network which includes 126 stations in 28 countries. The ozone data are reported hourly in each monitoring station. We consider monthly averages in a regular temporal pat-

D.-J. Lee and M. Durbán 5 tern, but due to limited number of sites available, we selected a sample of 70 monitoring stations covering 15 countries. Data can be obtained at www.emep.int and further information and annual reports about air pollution trends are available in the European Environmental Agency (EEA) web site (www.eea.europa.eu). We fitted a 2d P -spline model for the spatial component with an additive smooth function for time which does not considers space-time interaction, i.e. f(x 1, x 2 ) + f(x t ), and 3d P -spline interaction model (2). In addition, P -spline ANOVA models were fitted considering the appropiate constraints proposed in the previous section depending on the interaction terms included in the model. The model selection criteria was the Akaike Information Criteria (AIC). In general, better AIC results were obtained for interaction models. Figure 1 shows the results for the 3d space-time interaction model: (a) and (b) are the fitted spatial trends for two periods (June and December of 1999). It can be noticed the different spatial trend pattern and also the different overall level in each period, reflecting a seasonal variation which is very common in environmental data. Figure 1(c) show this cyclic pattern in the data for selected monitoring stations in Spain, Finland, France and the UK. As reported by the EEA for ozone levels, summer periods show the highest values in contrast to winter months. Finally, Figure 1(d) shows the smooth function for time covariate, i.e f(x t ), for the four selected stations. Concluding remarks We presented a computationally efficient methodology for multidimensional smoothing. The ANOVA-Type models present an attractive alternative due to their interpretability in terms of decompositions of smooth functions and basis which are identifiable. From our P -spline approach, the mixed model representation and the decomposition of the basis used, allow more flexibility in contrast to existing SS-ANOVA models. The analysis of the ozone level data showed that a model where the time dimension is additive could ignore important features in the data. References Chen, Z. (1993). Fitting Multivariate Regression Functions by Interactions Spline Models. J. R. Statist. Soc. B, 55, 473-491. Currie, I. D., Durbán, M. and Eilers, P. H. C. (2006). Generalized linear array models with applications to multidimensional smoothing. J. R. Statist. Soc. B, 68, 1-22. Eilers, P. H. C. and Marx, B. D. (1996). Flexible Smoothing with B-Splines and Penalties. Statistical Science,11,89-121.

6 P -spline ANOVA models for spatio-temporal smoothing Eilers, P. H. C., Currie, I. D., and Durbán, M. (2006). Fast and compact smoothing on large multidimensional grids. Computational Statistics & Data Analysis, 50(1), 61 76. Gu, C. (2002). Smoothing Spline ANOVA Models. Springer, New York. Kneib, T. and Fahrmeir, L. (2006). Structured Additive Regression for Categorical Space-Time Data: A Mixed Model Approach. Biometrics, 62, 109 118. MacNab, Y. C. and Dean, C.B. (2001). Autoregressive Spatial Smoothing and Temporal Spline Smoothing for Mapping Rates. Biometrics, 57, 949 956. Wood, S. N. (2006). Low-Rank Scale-Invariant Tensor Product Smooths for Generalized Additive Mixed Models. Biometrics, 62, 1025-1036.