P -spline ANOVA-type interaction models for spatio-temporal smoothing

P -spline ANOVA-type interaction models for spatio-temporal smoothing Dae-Jin Lee 1 and María Durbán 1 1 Department of Statistics, Universidad Carlos III de Madrid, SPAIN. e-mail: dae-jin.lee@uc3m.es and mdurban@est-econ.uc3m.es Abstract: In recent years, spatial and spatio-temporal modelling have become an important area of research in many fields (epidemiology, environmental studies, disease mapping,...). However, most of the models developed are constrained by the large amounts of data available. We propose the use of Penalized splines (P -splines) in a mixed model framework for smoothing spatio-temporal data. Our approach allows the consideration of interaction terms which can be decomposed as a sum of smooth functions similarly as an ANOVA decomposition. The properties of the basis used for regression allow the use of algorithms that can handle large amount of data. We show that imposing the same constraints as in a factorial design it is possible to avoid identifiability problems. We illustrate the methodology for Europe ozone levels in the period 1999-2005. Keywords: P -splines; Mixed Models; Spatio-temporal data; space-time interactions; Smooth-ANOVA models. 1 Spatio-temporal smoothing with P -splines Suppose we have normal spatio-temporal data which are located in geographical locations, s = (x 1, x 2 ), and measured over time periods, x t. The response y ijt is then indexed in their spatial locations and over time. A smooth model for the data would be given by: y = Bθ + ɛ, ɛ N (0, σ 2 I), (1) where θ is the vector of coefficients, and B is a regression basis constructed from B-spline basis products. Currie et al. (2006) developed an approach based on Kronecker products, known as generalized linear array methods (GLAM) for data in a grid. When data are scattered (as is the case of spatial data), Eilers et al. (2006) proposed the use of the row-wise Kronecker or box-product of individual basis (denoted as ). Most of the common approaches in spatio-temporal smoothing assume an additive function for the temporal dimension, ignoring the interaction between space and time (MacNab, 2001; Kneib, 2006). This formulation implies a spatio-temporal covariance structure given by separable terms for a

2 P -spline ANOVA models for spatio-temporal smoothing spatial and temporal components respectively. This could be too simplistic in some situations. As an alternative, we propose non-separable models of the form: ŷ = f(x 1, x 2, x t ), (2) which explicitly considers the interaction between space and time. The regression basis for a 3d interaction model (2) is: B = (B 1 B 2 ) s B t = B s B t, nt c 1 c 2 c 3, (3) where B 1, B 2 and B t are the marginal B-spline basis of dimensions n c 1, n c 2 and t c 3 respectively. Model (2) and basis given by (3) can easily be set into GLAM framework. We can express the data in a compact notation replacing y of length nt 1 by the matrix Y of dimension n t and the coefficient vector θ of length c 1 c 2 c 3 1 by an array of coefficients Θ, of dimension c 1 c 2 c 3. In matrix notation, the model can be written as E[Y ] = B t ΘB s (4) Smoothness is imposed via the penalty matrix P based on second order difference matrices D 1, D 2 and D t. The penalty term in 3-dimensions is: P = λ 1 D 1D 1 I c2 I c3 +λ 2 I c1 D 2D 2 I c3 +λ t I c1 I c2 D td t, (5) which implies placing penalties over each dimension of the array Θ. For the spatio-temporal case, the penalty (5) allows spatial anisotropy considering a different amount of smoothing for longitude and latitude (λ 1 λ 2 ) and for the temporal component (λ t ). The mixed model representation of P -splines consists in setting a new basis which allows the reparameterization of (1) and its associated penalty into a mixed model of the form: y = Xβ + Zα + ɛ, α N (0, G), and ɛ N (0, σ 2 I), (6) where G is a diagonal matrix which depends on the smoothing parameters λ 1, λ 2 and λ t. Following a similar approach to Currie et al. (2006), and using the properties of the Kronecker and row-wise Kronecker products it can be shown that using the singular value decomposition (SVD) of (5) the penalty becomes block-diagonal and basis and coefficients are reparameterized into: B [X : Z] and θ (β : α ). 2 Smooth-ANOVA decomposition models Sometimes the interest lies in fitting complex models with functional form given by ŷ = f 1 (x 1 ) + f 2 (x 2 ) + f 3 (x 3 ) + f 1,2 (x 1, x 2 ) + f 1,2,3 (x 1, x 2, x 3 ), (7)

D.-J. Lee and M. Durbán 3 where f 1, f 2 and f 3 are smooth functions for the main effects (x 1, x 2 and x 3 ), f 1,2 the 2d-interaction effects for (x 1, x 2 ) and f 1,2,3 the 3d-interaction effects. Chen (1993) proposed Smoothing Spline Analysis-of-Variance (SS- ANOVA) decompositions to model main effects and interactions which can be interpreted as in classical ANOVA. In contrast, the approach presented in this paper allow a more computationally efficient methodology based on low-rank Penalized splines. Wood (2006) also considers smooth-anova decompositions with P -splines, and notes the need of imposing constraints to maintain the model identifiability. However, the way how these constraints are imposed and how the basis for each component of the decomposition are constructed are not clear. In this paper we use the properties of the SVD of the penalty (5) and show how to fit each component of the model and establish an intuitive connection with the usual ANOVA. In the case of spatio-temporal data this interpretation may be very useful, since we can model not only main effects of latitude and longitude, (or other covariates effects) but also the spatial effects (2-way interactions) and specially the interaction between space and time (3-way interactions). The basis X and Z of the mixed model representation can be expanded to allow the representation of the 3d model as the sum of smooth main and interaction terms as in (7). However, this representation does not account for independent and separate penalties since we have 3 smoothing parameters λ 1, λ 2 and λ t for each of the dimensions of the model, with penalty matrix given in (5), but we do not allow separate parameters for interaction terms. Alternatively, ANOVA-type models which explicitly consider different amount of smoothing for each smooth function in (7) can be considered. The corresponding new B-splines regression matrix would not be of full rank, given the linearly dependent columns, and the model would not be identifiable. The identifiability problem can be avoided by removing the columns of the basis of f(x 1 ) which are repeated in those for f(x 1, x 2 ) and f(x 1, x 2, x 3 ) and so on. Therefore, we need to impose constraints so that the model (7) is identifiable. We demonstrate that these constraints are applied on the P -spline regression coefficients θ, and are exactly equivalent to those applied in a 3-way factorial design, i.e. Main Effects: 2-Way Interactions: 3-Way Interactions: i i,j θ (1) i θ (12) ij = j = i,t i,j,t θ (2) j θ (23) ik = t = j,t θ (3) t = 0 (8) θ (13) jt = 0 (9) θ (123) ijt = 0 (10)

4 P -spline ANOVA models for spatio-temporal smoothing (b) Smoothed spatial trend for Dec. 1999 40 45 50 55 60 65 0 10 20 30 60 70 80 90 100 50 52 54 56 58 60 (c) Ozone levels for selected locations (d) Smoothed temporal trend for selected locations O3 20 40 60 80 100 120 140 Spain Finland France UK f(time) 20 40 60 80 100 120 140 Spain Finland France UK 1999 2000 2001 2002 2003 2004 2005 1999 2000 2001 2002 2003 2004 2005 time time FIGURE 1. 3d P -spline model: (a) spatial trend for June 1999, (b) spatial trend for december 1999. The symbol denotes the stations where monthly average measurements are available for period 1999-2005, and the stations with missing data. (c) Time series plot of a sample of stations of four countries which reflects the seasonality and temporal patterns in the data. (d) Smoothed temporal trends for the four stations selected. The vertical solid line corresponds to June 1999 and the dashed line to December 1999. 3 Application to ozone levels in Europe We apply the methodology proposed to the analysis of air pollution by ozone levels (in ug/m3 units) over Europe from 1999 to 2005. The data set are collected by the EMEP monitoring network which includes 126 stations in 28 countries. The ozone data are reported hourly in each monitoring station. We consider monthly averages in a regular temporal pat-

D.-J. Lee and M. Durbán 5 tern, but due to limited number of sites available, we selected a sample of 70 monitoring stations covering 15 countries. Data can be obtained at www.emep.int and further information and annual reports about air pollution trends are available in the European Environmental Agency (EEA) web site (www.eea.europa.eu). We fitted a 2d P -spline model for the spatial component with an additive smooth function for time which does not considers space-time interaction, i.e. f(x 1, x 2 ) + f(x t ), and 3d P -spline interaction model (2). In addition, P -spline ANOVA models were fitted considering the appropiate constraints proposed in the previous section depending on the interaction terms included in the model. The model selection criteria was the Akaike Information Criteria (AIC). In general, better AIC results were obtained for interaction models. Figure 1 shows the results for the 3d space-time interaction model: (a) and (b) are the fitted spatial trends for two periods (June and December of 1999). It can be noticed the different spatial trend pattern and also the different overall level in each period, reflecting a seasonal variation which is very common in environmental data. Figure 1(c) show this cyclic pattern in the data for selected monitoring stations in Spain, Finland, France and the UK. As reported by the EEA for ozone levels, summer periods show the highest values in contrast to winter months. Finally, Figure 1(d) shows the smooth function for time covariate, i.e f(x t ), for the four selected stations. Concluding remarks We presented a computationally efficient methodology for multidimensional smoothing. The ANOVA-Type models present an attractive alternative due to their interpretability in terms of decompositions of smooth functions and basis which are identifiable. From our P -spline approach, the mixed model representation and the decomposition of the basis used, allow more flexibility in contrast to existing SS-ANOVA models. The analysis of the ozone level data showed that a model where the time dimension is additive could ignore important features in the data. References Chen, Z. (1993). Fitting Multivariate Regression Functions by Interactions Spline Models. J. R. Statist. Soc. B, 55, 473-491. Currie, I. D., Durbán, M. and Eilers, P. H. C. (2006). Generalized linear array models with applications to multidimensional smoothing. J. R. Statist. Soc. B, 68, 1-22. Eilers, P. H. C. and Marx, B. D. (1996). Flexible Smoothing with B-Splines and Penalties. Statistical Science,11,89-121.

6 P -spline ANOVA models for spatio-temporal smoothing Eilers, P. H. C., Currie, I. D., and Durbán, M. (2006). Fast and compact smoothing on large multidimensional grids. Computational Statistics & Data Analysis, 50(1), 61 76. Gu, C. (2002). Smoothing Spline ANOVA Models. Springer, New York. Kneib, T. and Fahrmeir, L. (2006). Structured Additive Regression for Categorical Space-Time Data: A Mixed Model Approach. Biometrics, 62, 109 118. MacNab, Y. C. and Dean, C.B. (2001). Autoregressive Spatial Smoothing and Temporal Spline Smoothing for Mapping Rates. Biometrics, 57, 949 956. Wood, S. N. (2006). Low-Rank Scale-Invariant Tensor Product Smooths for Generalized Additive Mixed Models. Biometrics, 62, 1025-1036.