Geostatistical interpolation using copulas

Size: px

Start display at page:

Download "Geostatistical interpolation using copulas"

Beatrix Craig
5 years ago
Views:

1 WATER RESOURCES RESEARCH, VOL. 44,, doi: /2007wr006115, 2008 Geostatistical interpolation using copulas András Bárdossy 1 and Jing Li 1 Received 13 April 2007; revised 15 February 2008; accepted 12 March 2008; published 24 July [1] In many applications of geostatistical methods, the dependence structure of the investigated parameter is described solely with the variogram or covariance functions, which are susceptible to measurement anomalies and implies the assumption of Gaussian dependence. Moreover the kriging variance respects only observation density, data geometry and the variogram model. To address these problems, we borrow the idea from copulas, to depict the dependence structure without the influence of the marginal distribution. The methodology and basic hypotheses for application of copulas as geostatistical methods are discussed and the Gaussian copula as well as a non-gaussian copula are used in this paper. Copula parameters are estimated using a division of the observations into multipoint subsets and a subsequent maximization of the corresponding likelihood function. The interpolation is carried out with two different copulas, where the expected and median values are calculated from the copulas conditioned with the nearby observations. The full conditional copulas provide the estimation distributions for the unobserved locations and can be used to define confidence intervals which depend on both the observation geometry and values. Observations of a large scale groundwater quality measurement network in Baden-Württemberg are used to demonstrate the methodology. Five groundwater quality parameters: chloride, nitrate, ph, sulfate and dissolved oxygen are investigated. All five parameters show non-gaussian dependence. The copula-based interpolation results of the five parameters are compared to the results of conventional ordinary and indicator kriging. Different statistical measures including mean squared error, relative differences and probability scores are used to compare cross validation and split sampling results of the interpolation methods. The non-gaussian copulas give better results than the geostatistical interpolations. Validation of the confidence intervals shows that they are more realistic than the estimation variances obtained by ordinary kriging. Citation: Bárdossy, A., and J. Li (2008), Geostatistical interpolation using copulas, Water Resour. Res., 44,, doi: /2007wr Introduction [2] Interpolation is a classical and frequently encountered problem in spatial data analysis. The unknown value of the variable under study at an unsampled location has to be estimated from point observations. There are many different methods for interpolation, some of them are based on simple procedures such as the nearest neighbor or the inverse distance method. Others are based on statistical assumptions like those used in the many different kriging procedures. [3] One advantage of the geostatistical procedures is that they deliver information on the quality of interpolation. Unfortunately this quality is often just a function of the observation density and the variogram model. Anomalies in the observations, and zones with high or low variability are usually not considered for the calculation of estimation variances. Some corrections were introduced using the proportional effect correction [see Journel and Huijbregts, 1 Institute of Hydraulic Engineering, University of Stuttgart, Stuttgart, Germany. Copyright 2008 by the American Geophysical Union /08/2007WR ]. Recently developed methods in geostatistics use multipoint statistics for simulation, and a large number of simulations to obtain confidence intervals [Journel and Zhang, 2006]. [4] Introducing a novel contribution to spatial statistics, copulas are standardized multivariate distributions with uniform marginals which can be used to describe the dependence structure of multivariate distributions separately from their univariate marginals. There are several applications of copulas in the financial sector [e.g., Embrechts et al., 2001] where the dependence between extremes plays a very important role. In Malevergne and Sornette [2003] the authors showed that the usual Gaussian type of dependence is not appropriate to estimate financial risks, since it underestimates the dependence of extremes. There are several hydrological applications, most of them related to the analysis of extremes [Salvadori et al., 2007; Favre et al., 2004; Poulin et al., 2007]. [5] The departure of the dependence structure from Gaussian dependence has been discussed in the geostatistical context too [Journel and Alabert, 1989; Gomez-Hernandez and Wen, 1998]. In a recent application in hydrology, copulas were used by Bárdossy [2006] to describe spatial 1of15

2 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS variability of groundwater quality parameters and the departure of Gaussian dependence was also confirmed. [6] The purpose of this paper is to introduce a copulabased model for describing spatial variability, and to use this methodology for interpolation of groundwater quality parameters. The paper is divided into six sections. After the introduction, the basic methodology is briefly described. A new type of non-gaussian copula is introduced. The third section describes the assessment of the multivariate copulas based on observations. In the fourth section, conditional copulas corresponding to unsampled locations are derived. An example is discussed to demonstrate the fact that the combination of the conditional copulas and the marginals leads to the full conditional distributions and to different interpolation results. In the fifth section the methodology is applied to a large scale groundwater quality observation network in Baden-Württemberg. Section 6 summarizes the results of the paper and conclusions. 2. Methodology [7] In this section basic definitions and ideas of how to apply copulas to investigate the spatial variability are presented. Some of them are discussed by Bárdossy [2006] but for the sake of completeness the essential information is summarized Basic Methodology [8] Only a short introduction on copulas is given here. The interested reader is referred to Joe [1997] or Nelsen [1999] for further details. [9] A copula is defined as a distribution function on the n dimensional unit cube. All marginal distributions are uniform on [0, 1]. Formally C : ½0; 1Š n! ½0; 1Š C u ðþ i ¼ ui if u ðþ i ¼ ð1;...; 1; u i ; 1;...; 1Þ [10] For every n dimensional hypercube in the unit hypercube the corresponding probability has to be non negative: X2 n 1 i¼0 ð 1 Þ n P n i¼1 ji Cu ð 1 þ j 1 D 1 ;...; u n þ j n D n Þ 0 if 0 u i u i þ D i 1 and i ¼ Xn 1 j k 2 k [11] Copulas and multivariate distributions are linked to each other by Sklar s theorem [Sklar, 1959]. Sklar proved that each multivariate distribution F(x 1,..., x n ) can be represented with the help of a copula: k¼0 ð1þ Ft ð 1 ;...; t n Þ ¼ CF ð t1 ðt 1 Þ;...; F tn ðt n ÞÞ ð2þ wheref ti (t) represents the i-th one dimensional marginal distribution of the multivariate distribution. If the distribution is continuous then the copula C in (2) is unique. Copulas can be constructed from distribution functions, as described by Nelsen [1999]: CðuÞ ¼ Cu ð 1 ;...; u n Þ ¼ F F 1 ð 1 u 1 Þ;...; F 1 ð n u n Þ ð3þ [12] The advantage of using a copula is that it is invariant to strictly increasing monotonic transformations of the variables. Thus the frequent dilemma whether to transform data or not (for example taking the natural logarithms) does not occur in this case. [13] An interesting and important property of a copula is whether the corresponding dependence is the same for high or low values. A bivariate copula expresses a symmetrical dependence if: Cu; ð vþ ¼ Cð1 u; 1 vþ 1 þ u þ v ð4þ which means that the copula density is symmetrical with respect to the secondary diagonal u =1 v of the unit square and the copula density c fulfills: cu; ð vþ ¼ cð1 u; 1 vþ ð5þ [14] Readers interested in more general details are referred to Joe [1997], Nelsen [1999], or Salvadori et al. [2007] Basic Hypothesis [15] For geostatistical purposes we assume that the parameter of interest corresponds to the realization of a random function. For our study we restrict the random function to a finite set N. (The cardinality of this set N can be arbitrarily large.) The set N should be a dense set of locations in the area of interest D including the locations of the observations. As only a single realization is observed on a limited number of points further assumptions on the random variable have to be made. [16] 1. Stationarity: for each set of points {x 1,..., x k } N and vector h such that {x 1 + h,..., x k + h} N and for each set of possible values w 1,..., w k : PZx ð ð 1 Þ < w 1 ;...; Zðx n Þ < w k Þ ¼ PZx ð ð 1 þ hþ < w 1 ;...; Zðx k þ hþ < w k Þ ð6þ This implies that for any two selected points at locations x and x + h in the investigated domain the bivariate spatial copula C s corresponding to the random variables assigned to these points is assumed to be a function of the vector h separating the two points. This copula is assumed to be independent of the location x C s ðh; u; vþ ¼ PF ð z ðzðxþþ < u; F z ðzðx þ hþþ < vþ ¼ CF ð z ðzðxþþ; F z ðzðx þ hþþþ ð7þ [17] 2. The parametrization of the copula should enable the multivariate copula corresponding to any selected n points to reflect their spatial configuration. [18] 3. The parametrization should allow arbitrarily strong dependence. This means that for a set of very close points in space, there should be a parametrization of the copula which is close to full dependence (Fréchet upper bound). [19] Most of the frequently used high dimensional copulas described in the literature do not fulfill the above conditions. For example the multivariate Fairlie-Gumbel- Morgenstern copula does not fulfill the third condition, as 2of15

3 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS with increasing number of points the maximum possible correlation decreases. Elliptical copulas (based on elliptical distributions [Fang et al., 2002]) can also be used for the description of spatial variability. The reason for not taking them for our example is that their dependence is fully symmetrical [Abdous et al., 2005]. [20] InBárdossy [2006], a non-gaussian copula derived from the noncentered multivariate chi-square distribution was discussed. In this paper, a similar procedure, namely first applying nonmonotonic transformations of the marginals and subsequent use of equation (3), is adopted to construct a so-called v-transformed normal copula, which is also non- Gaussian and inherits the useful properties of Gaussian copula of fulfillment of all the above conditions. [21] Both copulas (Gaussian copula and v-transformed normal copula) correspond to a different kind of multivariate dependence. The main difference between them is whether there is a different dependence between high, medium, and low values, or not. In the following, the construction of v-transformed normal copula is presented The v-transformed Multivariate Normal Copula [22] High dimensional copulas with nonsymmetrical dependence of their bivariate marginals can easily be obtained by nonmonotonic transformations of the multivariate elliptical distributions. Let Y be N(0, G), an n dimensional normal random variable with 0 T =(0,..., 0) mean and G correlation matrix. All marginals are supposed to have unit variance. Let X be defined for each coordinate j =1,..., n as: 8 < k Y j m if Yj m X j ¼ : m Y j if Y j < m [23] Where k is a positive constant and m is an arbitrary real number. When k = 1 this transformation leads to the multivariate non centered c-square distribution. All one dimensional marginals of X are identical and have the distribution function: and density: ð8þ H 1 ðþ¼px< x ð xþ ¼ F x k þ m Fð x þ mþ ð9þ h 1 ðþ¼ x 1 k f x k þ m þ fð x þ mþ ð10þ F being the distribution function of the standard normal distribution and f the corresponding density function. The corresponding multivariate distribution function is: H n ðx 1 ;...; x n where Þ ¼ PX ð 1 < x 1 ;...; X n < x n Þ ¼ X2n 1 i¼0 ð 1Þ i Fðz i þ mþ ð11þ z T i ¼ b ð 1Þ i1 x1 ;...; b ð 1 Þ in xn ð12þ i j = 0 or 1 and and bx ðþ¼ i ¼ Xn 1 i j 2 j j¼0 8 >< 1 if x < 0 >: 1 k if x > 0 [24] Thus for the density h n one has the form: h n ðx 1 ;...; x n Þ ¼ 1 ð2pþ n 2 jg 1 j 1 2 X2 n 1 i¼0 exp 1 ð 2 z i þ m 1 k n P n 1 j¼0 ij Þ T G 1 ðz i þ mþ ð13þ ð14þ ð15þ [25] The density of the v-transformed multivariate normal copula can be calculated through the joint and marginal distribution densities: c n ðu 1 ;...; u n Þ ¼ h n ðx 1 ;...; x n Þ h 1 ðx 1 Þh 1 ðx 2 Þ...h 1 ðx n Þ ð16þ with u i = H 1 (x i ). As mentioned before, when k =1,the transformation is a symmetrical modulus transformation, therefore the noncentered c-square copula is a special case of the v-transformed multivariate normal copula for k = 1. Furthermore, the effect of nonmonotonic transformation vanishes with m! ±1 and the resulting copula converges to a Gaussian copula. In this sense, Gaussian copula is also a special case of v-transformed normal copula for m = 1. Figure 1 shows the bivariate copula densities for different combinations of the values of k, m and correlations r. Notice that when k = 1, it reverts to a c-squared copula, and when k = 1 and m = 4, it approaches to a Gaussian copula. This copula is related to the Gaussian split random variable Abdous et al. [2003]. [26] The v-transformed normal copula has the strongest dependence for the high values. The opposite dependence structure can be obtained by simply taking the copula: c 0 nð u 1;...; u n Þ ¼ c n ð1 u 1 ;...; 1 u n Þ ð17þ 3. Parameter Estimation [27] The multivariate Gaussian copula is identified through its correlation matrix G. The v-transformed normal copula is parametrized by the two transformation parameters m and k and the correlation matrix G of the corresponding normal Y. Because of the stationarity the correlations between any two points can be written as a function of the separating vector h. Then for any set of observations x 1,..., x n the correlation matrix G of the normal variable Y can be written as: G ¼ r i;j n;n l;l ð18þ 3of15

4 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Figure 1. values. Bivariate v-transformed normal copula density with different combinations of k, m, and r 4of15

5 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS where r i,j only depends on the vector h separating the points x i and x j : r i;j ¼ R x i x j ¼ RðhÞ ð19þ [28] The correlation function is the positive linear combination of the RðhÞ ¼ XJ X J D j r j h; L j D j ¼ 1 j¼0 j¼0 ð20þ [29] Here r j (h, L) denotes a valid (positive definite) correlation function such as the exponential or the correlation function corresponding to the spherical variogram. The parameter L stands for the range (correlation length) of the correlation function. The number of functions J is usually 2 or 3. One of the functions is usually describing the pure random fluctuations (L = 0) while the others describe the variability on different scales. For the Gaussian copula the parameters of the correlation function (19) have to be estimated. For the v-transformed normal copula the parameters of the correlation function (19) corresponding to Y and the transformation parameters (m and k) have to be assessed on the basis of the available observations. Notice that in the derivation of v-transformed normal copula, all the components of the vector m are assumed to be equal. [30] In geostatistics parameters are often estimated on the basis of empirical variograms or spatial covariance functions. The problem with these approaches is that the estimation of these functions is not based on independent samples (as observations are considered for in a number of pairs). An approach using different numbers of points is adopted here. The observation set is divided into subsets of different sizes. For each of these subsets and a given parametrization of the copula the likelihood can be obtained. As the sets are disjoint, the overall likelihood is the product of the individual ones. Each subset S k consists of n(k) 2 (not necessarily equal) observations: S k ¼ x k;1 ;...; x k;nðkþ k ¼ 1;...; K ð21þ [31] They are disjoint: S k \ S j ¼ 0ifk 6¼ j and cover all observation points S ¼ [K k¼1 S k ¼ S ð22þ ð23þ [32] For any of the sets S k, the likelihood of the parameter vector (for the v-transformed copula q = (m, k, D 0, D 1, L 1,...)) can be calculated by calculating the corresponding copula density. cs ð k ; qþ ¼ cf Z Z x k;1 ;...; FZ Z x k;nðkþ ; q ð24þ [33] Then the likelihood LðqjZðx 1 Þ;...; Zx ð n ÞÞ ¼ YK k¼1 cs ð k ; qþ ð25þ can be maximized with respect to the parameters q. [34] The sets S k can be selected randomly in order to avoid possible biases caused by preferential sampling. However, taking sets with small diameters (max i,j (d(x k,j, x k,i )<D) is reasonable as the interpolation is carried out using close points. 4. Interpolation [35] The purpose of interpolation is to estimate the value of the random function Z at locations x without observations. Linear estimators are used in kriging to obtain a good estimator of the expected value at the unsampled location. Copulas allow the estimation of the full conditional distribution of the variable Z at the required site x: F n ðx; z Þ ¼ F n ðzðxþ < zjzðx 1 Þ ¼ z 1 ;...; Zðx n Þ ¼ z n Þ ð26þ [36] Where n is the total number of observations. For a multivariate distribution this can be written with the help of the corresponding conditional copula C x,n : Fn ðx; z Þ ¼ C x;nðf Z ðþju z 1 ¼ F Z ðz 1 Þ;...; u n ¼ F Z ðz n ÞÞ ð27þ [37] In order to reduce computational problems (the calculation of the v-copula would require the calculation of a sum of 2 n terms) conditional distributions are restricted to local neighborhoods: F n x ð Þ ðx; zþ ¼ C x;nðxþ F Z ðþju z 1 ¼ F Z z x nð1þ ;...; un ¼ F Z z x nðxþ ð28þ where n(x) n and the points x n(1)...x n(x) are observations in the neighborhood of x. This restriction does not have a strong influence on the results as demonstrated in section General Procedure of Interpolation Using Copulas [38] The general procedure of interpolation using copulas is as follows. [39] 1. The observed values z i are transformed to z 0 i by z 0 i ¼ H 1 ð F Zðz i ÞÞ ð29þ Z 0 where Z indicates the variable assigned to observations and F Z indicates the corresponding marginal distribution, while Z 0 denotes the transformed variable and H Z 0 denotes the corresponding univariate distribution specific to the multivariate joint distribution which is used to generate the applied copula model (see equation (3)). [40] 2. n closest observation points to the interpolation point x are selected and the transformed values (z 0 i s) from those observed values are used to calculate the conditional copula density at the interpolation points over the whole range of values of u (u 2 [0, 1]). It can be derived that the conditional copula density is: c x;n uju 1 ¼ H Z 0 z 0 1 ;...; un ¼ H Z 0 z 0 n ¼ h nþ1 z 0 ; z 0 1 ;...; z0 n 1 h 1 ðz 0 Þh 1 z h1 z 0 n c n ðu 1 ; u 2 ;...; u n Þ ð30þ where, c x,n is the conditional copula density at location x conditioned on n observations, h n+1 denotes the n + 1 5of15

6 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Figure 2. Configuration of the interpolation examples and the corresponding CDF values for input 1 (right) and input 2 (right). dimensional joint density of the transformed variable Z 0, h 1 denotes the corresponding marginal density and c n indicates the multivariate copula density corresponding to the n observations. Since c n is constant over the range of u at one target point and does not influence the estimation results calculated in the following step, the second term of equation (30) can be neglected. Thus the following equation is often used instead: c x;n uju 1 ¼ H Z 0 z 0 1 ;...; un ¼ H Z 0 z 0 n / h nþ1 z 0 ; z 0 1 ;...; z0 n h 1 ðz 0 Þh 1 z 0 ð31þ 1...h1 z 0 n For Gaussian copulas, however, it is computationally more efficient to use equation (32) instead of equation (31). c x;n uju 1 ¼ H Z 0 z 0 1 ;...; un ¼ H Z 0 z 0 n ¼ 1 exp 1 jgj yt G 1 I y ð32þ where y =(z 0, z 0 1,..., z 0 n) and G is the correlation matrix of the normal variable. [41] 3. The expected value z* a and the median value z* m are typical statistics calculated as interpolators. However, note that as the full conditional distribution is calculated, one can define other estimators too. The former is calculated from the observed values weighted by the corresponding conditional copula using equation (33). The latter is the observed value corresponding to the 50% conditional copula and is calculated by equation (34). z* a ¼ Z 1 FZ 1 0 ðuþc x;n uju 1 ¼ H Z 0 z 0 1 ;...; un ¼ H Z 0 z 0 n du ð33þ z* m ¼ F 1 Z u ¼ C 1 ð x;n 0:5 Þ ð34þ [42] In this procedure, the relation between the observation and interpolation points is calculated as pure dependence using the spatial copulas irrespective of the marginal distribution of the observed values. Since the transformation made in step 1 is rank preserving, the underlying copula of the variables should remain the same. Then the dependence of the original variables can be modeled by the assumed theoretical copula. The calculation should be done via the transformed variables because they possess the marginal distributions which are required in equations (31) and (32). In the last step, the original marginal distribution is reproduced by its substitution in equation (33) or equation (34) when the interpolators are estimated. During this final substitution, the ranks are again preserved. Figure 3. Distribution density of estimated quantile values from Gaussian copula (left hand side) and v- transformed normal copula (right hand side). 6of15

7 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Figure 4. Distribution density of estimated real values (with superposition of the marginal distributions) from Gaussian copula (left hand side) and v-transformed normal copula (right hand side) Interpolation Using the Normal Copula [43] If one uses the copula of the multivariate normal distribution for interpolation, then this procedure is practically identical to the interpolation using the normal score transformation Interpolation Using the v-transformed Normal Copula [44] The interpolation with the v-transformed copula is done by inserting equations (10) and (15) into equation (31) to calculate the conditional distribution of the target variable at the target location. Because of the number of summands in (15) only interpolation using a limited neighborhood can be carried out Properties of the Copula Based Interpolation [45] The copula based interpolation has many properties similar to kriging, but also some additional properties can be mentioned. [46] 1. The interpolation is exact; for every observation the interpolation returns the observed value. [47] 2. The interpolation depends on the configuration of the observations and the target point. [48] 3. The interpolation delivers full conditional distributions, thus confidence intervals can be identified. [49] 4. The conditional distribution depends mainly on the observations near to the target point. Distant observations have minor effect on the calculated conditional distribution. [50] 5. The interpolator depends both on the configuration of the observations, the observed values and the marginal distribution. [51] In the next section properties 4 and 5 are illustrated with the help of two hypothetical examples Examples of Interpolation Using Different Copula Models [52] It must be pointed out that unlike the traditional Kriging methods, the way in which the dependence varies across the distribution plays a significant role in the estimation using copulas. The following hypothetical examples are used to illuminate this point. [53] In the examples, the value of one target point is going to be estimated by the four surrounding points. Figure 2 shows the configuration of the test field. Two sets of cumulative probability values of the hypothetical sampled locations are given as input information, which are listed separately in the subfigures of Figure 2. [54] From the symmetrical property of the configuration and the same averaged value of the two input sets, it is easy to infer that Kriging will give the same estimator, namely 0.5 for both cases. However, what do the copulas produce? To investigate this problem, a normal copula and a v- transformed normal copula are used to interpolate the value at the target point. [55] First, the density functions of the uniform variable U of the target point x are estimated by the two copula models and can be compared from Figure 3. The dashed line corresponds to the first input set, while the solid line corresponds to the second one. The expression of the estimated density function f n is shown in equation (35). In this case n =4. f n ðx; uþ ¼ c x;n ðuju 1 ¼ u 1 ;...; U n ¼ u n Þ ð35þ [56] In the application of Gaussian copula, the resulting curves are symmetrical and overlap with each other. This is mainly due to three reasons. First, the configuration of the test field is symmetrical. Secondly, for both input sets, the two values greater than 0.5 and the two values less than 0.5 are mirrored with respect to 0.5. Last but not least the dependence structure represented by Gaussian copulas is symmetrical. Therefore the influences from the large and the small values balance out and compromise to produce the middle value 0.5 as the best estimator. [57] On the contrary, when the v-transformed normal copula is used, the difference between the estimated density curves of the two cases is prominent. This can be explained by the asymmetrical dependence structure represented by the v-transformed normal copula, which in this test has 7of15

8 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Figure 5. Groundwater quality observation network in Baden-Württemberg. Locations marked with green triangles are used for the split sampling validation of the interpolation methods. maximum at the upper corner. This means the large values have a stronger association and the density curves are no more symmetrical. Moreover, the distributions are more skewed to the upper tail for the second case, since the gaps between the inputed large and small values are much wider than in the first case. [58] Next, the estimated quantile values are transformed to the hypothetical observed values. In other words, the uniform variable U is transformed to the random variable Z which is assigned to the hypothetical data set and assumed to follow a certain distribution. [59] In the following test, two marginal distributions, namely, a normal distribution (N(1, 1)) and an exponential distribution (E(1)), are used for both input sets in order to elucidate the influence of the marginal distributions on the interpolation results. [60] The first panel in Figure 4 shows the estimated density curves of Z at the target location obtained from Gaussian copula, while the second panel shows those obtained from v-transformed normal copula. From these curves, the estimators of any statistics (the median, the mode, etc.), can be gained. It can be observed that, for Gaussian copula, the density curves peel away from each other for the same input cdf set with different marginal distributions, but are glued together for different input cdf sets with the same marginal distribution. In comparison, for 8of15

9 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Table 1. Statistics of the Five Groundwater Quality Parameters ph Sulfate Nitrate Chloride Dissolved Oxygen Mean Median Standard deviation Skewness v-transformed normal copula, all the resulting curves differ from each other. [61] Taken together, the interpolation using copulas are influenced by not only the configuration, but also the marginal distribution and dependence structure which varies across the whole distribution. By considering more facets and providing the whole estimation distribution instead of one estimator, copulas are a more judicious and informative way of interpolation than traditional Kriging method. 5. Application [62] An extensive data set consisting of more than 2500 measurements of groundwater quality parameters of the near surface groundwater layer in Baden-Württemberg were used to illustrate the methodology. Five quality parameters namely chloride, nitrate, ph, sulfate and dissolved oxygen were selected for this study. A geostatistical investigation of the data was carried out and reported by Bárdossy et al. [1997] and Bárdossy et al. [2003]. Figure 5 shows the groundwater quality observation network. Table 1 shows the basic statistics for the five selected groundwater quality parameters. Note the high skewness of the parameters of chloride and sulfate Parameter Estimation [63] In this application, no marked anisotropy was observed, thus only the isotropic case was dealt with. Therefore the scalar h was used instead of the vector h. [64] Parameter estimation was carried out using the partition of the observation set into subsets S k and a subsequent maximization of the likelihood. Partition sets with a different number of observations were selected. The diameter of the sets was kept small as for interpolation only neighboring observations are considered. The likelihood function corresponding to the selected partition was numerically maximized. Seven parameters m, k, D 0, D 1, D 2, L 1, L 2 were optimized. The correlation length L 0 = 0, which means that D 0 represents the so-called nugget effect. The parameters depend only very slightly on the selection of S k. Figure 6 shows the logarithm of the likelihood function for nitrate as a function of m and k (optimized for the remaining five parameters) for two different partitions of the observations. Because of symmetry reasons only the case m 0 is shown. Note the similar structure of the likelihood function. The maximum is located in both cases in the same region. The lower likelihoods corresponding to values m > 3 indicate that the v-transformed normal copula fits the observed data better than the normal. Similar behavior was observed for all other groundwater quality parameters too. [65] The model parameters for the v-transformed normal copula are listed in Table 2. Notice that D 0 refers to the sill of a nugget effect component, while D 1, D 2, L 1, and L 2 refer to the sills and ranges of two spherical components of the complex correlation model for R(h). A general form of the complex correlation model is described in equation (20) in section 3. The asymmetry of the dependence is denoted as negative if the multivariate copula had to be turned upside down as described in equation (17) Empirical Copulas [66] Empirical copulas can be used to explore spatial dependence. Because of the stationarity assumption a copula can be assigned to a given configuration of points. The simplest case is the construction of bivariate empirical copulas. Higher dimensional empirical copulas are difficult to visualize, and require a large number of observations. However, for the case of training images considered in multipoint geostatistical methods [Journel and Zhang, 2006] higher dimensional empirical copulas can also be assessed. The calculation of empirical bivariate copulas was already recounted by Bárdossy [2006]. Empirical copulas were calculated for all groundwater quality parameters for a set of different separation distances. The corresponding theoretical bivariate marginal copulas corresponding to the parameters listed in Table 2 were derived. Figure 7 shows a pair of empirical and theoretical copulas for ph and the separation distance 6 km. The two copulas are very similar. Note that the empirical bivariate marginal copulas were not used for the parameter estimation. The two copulas cannot directly be used to judge the goodness of fit, as the empirical copulas were estimated from a highly dependent sample Neighborhood Selection [67] As mentioned in section 4.4, if the number of observations n used in interpolation is too big, it is quite computationally demanding to calculate the conditional copula density. However, sufficient number of neighboring points should be kept in order to provide a good estimator. On the other hand, too many points could be redundant due to the similar screening effect as in Kriging. To this end, a test of neighborhood selection is done. As shown in Figure 8, the estimated density functions are already nearly identical for n = 9. Therefore it suffices to choose any number greater than nine neighboring points. In the investigation of groundwater quality in Baden-Württemberg, the Table 2. Estimated Model Parameters for v-transformed Normal Copula Chloride Nitrate ph Sulfate Dissolved Oxygen m k D D D L 1, m L 2, m Asymmetry positive positive negative negative negative 9of15

number of n = 12 neighbors is chosen for the interpolation of all the groundwater quality parameters. 5.4.

10 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Figure 6. Contour plot of the likelihood as a function of the parameters m and k of the v-transformed normal copula for nitrate using 6 point neighborhoods (left) and 11 point neighborhoods (right). number of n = 12 neighbors is chosen for the interpolation of all the groundwater quality parameters Interpolation [68] Interpolation was carried out on a regular grid of 300 m 300 m resolution for the whole state of Baden- Württemberg. Four different interpolators were considered. [69] 1. Interpolation with multivariate v-transformed normal copula function. [70] 2. Interpolation with multivariate Gaussian Copula function. [71] 3. Ordinary-kriging. [72] 4. Indicator-kriging: Note that different variograms were used at the different thresholds b. The relationship between indicator variograms corresponding to a cutoff b and the bivariate marginal copulas. g b ðhþ ¼ F Z ðbþ C S ðh; F Z ðbþ; F Z ðbþþ ð36þ derived by Bárdossy [2006] is used for this purpose. Here C S (h, u, v) denotes the theoretical bivariate marginal copula corresponding to the vector h. [73] Figure 9 shows the interpolation maps obtained from Gaussian copula and v-transformed normal copula for Figure 7. Bivariate marginal copula densities corresponding to the separation distance of 6 km for ph empirical (left) and theoretical (right). 10 of 15

11 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Figure 8. Distribution densities of estimated quantile values (left hand side) and real values (right hand side) at one location of studied field using different number of neighboring points for interpolation. nitrate. The interpolated maps are different, however, due to the large number of observation points and the small scale of the figures these differences are difficult to see. Figure 10 shows the interpolation maps obtained from the v-transformed normal copula for chloride using the mean and the median estimators. [74] In order to have a more visible comparison, the differences of the maps obtained by the methods were also calculated. Figure 11 shows the maps of the differences between the v-transformed normal copula results and Ordinary Kriging for ph and sulfate respectively. Note that the differences are concentrated in areas with high and low values. The reason for this is that the v-transformed copula interpolation treats high and low values differently while in Ordinary Kriging all values are treated the same way (kriging weights do not depend on the values but the locations and the variogram). Figures 6 and Confidence Intervals [75] Since interpolation using copulas can provide the estimation distribution of the parameters of interest, confidence intervals of the estimators can also be obtained. In Figure 9. (right). Interpolation maps for nitrate using Gaussian copula (left) and v-transformed normal copula 11 of 15

12 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Figure 10. (right). Interpolation maps for chloride using v-transformed normal copula mean (left) and median Figure 11. Estimation differences between different methods: Ordinary Kriging and v-transformed normal copula for ph (left) and sulfate (right). 12 of 15

BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Figure 12. Length of the 90% confidence intervals obtained from the v transformed normal copula for dissolved oxygen (left) and ph (right).

13 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Figure 12. Length of the 90% confidence intervals obtained from the v transformed normal copula for dissolved oxygen (left) and ph (right). Figure 12 the width of the 90% confidence intervals are shown for the parameters dissolved oxygen and ph. They were calculated by subtracting estimators corresponding to the 5-percentile from the estimators corresponding to the 95-percentile. One can see that near the observations the confidence intervals are narrow while in other regions they become much wider. The structures of the two maps are very different even though the observation locations are the same. The highest uncertainty of the ph estimates is in the Black Forrest area (West side of the state Baden- Württemberg) where acidification of the groundwater body is a serious problem. In contrast the uncertainty corresponding to dissolved oxygen is low in the same area as it is unproblematic for this parameter. The estimation variances calculated using ordinary kriging would lead to very similar patterns. This is the consequence of the very different dependence structures of the two parameters Cross Validation and Split Sampling Results [76] In order to test how well the model fits the observed data, crossvalidation was carried out for the four selected interpolation methods. For the multivariate v-transformed normal copula two estimators, the mean and the median were considered. [77] Further a dense uniformly distributed set was selected for a split sampling validation. Figure 5 shows the locations of the control set. Parameter values were estimated at each point of the control set using the rest of the set. The results are analyzed with different criteria. Besides the conventional mean squared and absolute error, two criteria in the probability space were considered: [78] 1. MAE (mean absolute error). [79] 2. RMSE (square root of the mean squared error). [80] 3. LTIF (Linear error in probability space) The quantity L is the difference between the estimated and the observed values in the probability space [Ward and Folland, 1991]: L ¼ 1 n X n i¼1 F z ðzðx i ÞÞ F z z* ðx i Þ ð37þ An advantage of this quantity is that it does not depend on the magnitude of variable values. The possible values of L range from 0 (perfect) to 1 (improper). [81] 4. Misclassification rate: for a selected percentile (85% for our case) the portion of misclassified points with regard to the exceedence of the threshold was calculated. The reason for taking this measure is that for groundwater quality parameters often the exceedence of a certain threshold is of great interest. For the sake of comparability the same quantile is selected for all parameters. [82] As one can see in Table 3 the copula based approaches give better results than ordinary or indicator kriging. The mean is the best estimator in the sense of RMSE while the median is the best for MAE. The v-transformed normal copula based estimators are better than those corresponding to the Gaussian copula. Results of the cross validation and the split sampling are very similar. [83] As the copula based approaches and indicator kriging provide an estimation of the full conditional distribution these can also be verified. For this purpose the value of the conditional distribution function for the observation was 13 of 15

14 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Table 3. Results of Crossvalidation and Split Sampling for Different Interpolation Methods Cross Validation Split Sampling Parameter Method MAE RMSE LTIF CLE MAE RMSE LTIF CLE Chloride V mean V median Gauss O. Kriging I. Kriging Nitrate V mean V median Gauss O. Kriging I. Kriging ph V mean V median Gauss O. Kriging I. Kriging Dissolved V mean Oxygen V median Gauss O. Kriging I. Kriging Sulfate V mean V median Gauss O. Kriging I. Kriging calculated for each point of the control set in the split sampling case. u * i ¼ F Z;xi ðzðx i ÞÞ ð38þ [84] The u* i values should follow a uniform distribution. Figure 13 shows the distributions for the parameter ph for the v-transformed normal copula, the Gaussian copula and indicator kriging. One can see that the v-transformed normal copula leads to the most uniform-like distribution. A c-square test was carried out for each parameter and interpolator to test if the distribution can be assumed to be uniform or not. For three parameters (chloride, dissolved oxygen and ph) for the v-transformed normal copula the hypothesis of the distribution being uniform was not rejected. In the case of the Gaussian copula and the indicator kriging for all parameters the hypothesis was rejected at the 95% level. For the parameters nitrate and sulfate the v-copula based results were also rejected, but the calculated c-square values were much lower than those obtained for the other methods. [85] As pointed out above the confidence intervals corresponding to the estimators are very different. The cross validation approach was also used to validate the confidence intervals. For this purpose the number of observations falling into the 75% and the 90% confidence intervals were calculated. The results are shown in Table 4. As one can see Figure 13. Histogram of the value of the observation in the estimated distribution function for ph using the v-transformed normal copula (left) the Gaussian copula (middle) and indicator kriging (right). 14 of 15

15 BÁRDOSSY AND LI: GEOSTATISTICAL INTERPOLATION USING COPULAS Table 4. Results of Crossvalidation for the Confidence Intervals of the Different Interpolation Methods Parameter v-copula N-Copula O-Kriging I-Kriging 75% 90% 75% 90% 75% 90% 75% 90% Chloride Nitrate ph Dissolved Oxygen Sulfate the most reliable confidence intervals are obtained by the v-copulas followed by the normal copulas. The kriging estimation errors are nearly useless, they are a mere index of the spatial distribution of the observation points [Journel and Alabert, 1989]. The only reasonable confidence intervals for kriging are obtained for dissolved oxygen. The reason for this is the nearly normal distribution of the parameter. The Indicator kriging intervals are far too optimistic. 6. Summary and Discussion [86] In this paper an interpolation method based on copulas was introduced. [87] 1. The proposed method is based on the separation of the marginal distributions from the dependence structure. The dependence structure is described using copulas. [88] 2. An interpolation method for two different parametric copulas the normal and the v-transformed normal was developed. The conditional distribution (conditioned on the observed values at the observation points) for unobserved locations was derived from the conditional copula and the marginal distribution. [89] 3. The copula based interpolator is non linear and the estimated value depends both on the configuration of the observation and target points and their values. [90] 4. For the case study the copula based interpolators gave better cross validation results than ordinary and indicator kriging. [91] 5. The copula based interpolation allows the calculation of confidence intervals. Cross validation results indicate that these intervals are more realistic than those based on the estimation variance obtained by ordinary or indicator kriging. [92] The copula-based estimation can be extended to conditional stochastic simulation using direct sequential simulation and at each point drawn from that point conditional copula. [93] In this paper the marginal distributions were considered to be the same for each point of the domain under study. This condition can be relaxed by assigning different marginals corresponding for example to different geological formations or soft information. [94] Conditional multivariate copulas corresponding to any finite set of points conditioned on n observations can be written explicitly. Equation (32) as well as equations (31) and (16) allow the specification of the copulas. This means that the conditional distribution of blocks of any shape and size can be approximated. The inverse problem, the identification of the multivariate structure from block values and the consideration of block values for interpolation is more difficult and could be assessed through simulation. [95] Acknowledgments. Research leading to this paper was supported by the German Science Foundation (DFG) under grant GRK 1398 and BA 1150/12-1. The authors thank the reviewers for their constructive comments. References Abdous, B., K. Ghoudi, and B. Rémillard (2003), Nonparametric weightedsymmetry test, in Can. J. Stat., 31(4), Abdous, B., C. Genest, and B. Rémillard (2005), Dependence properties of meta-elliptical distributions, in Statistical Modeling and Analysis for Complex Data Problems, edited by P. Duchesne and B. Rémillard, pp. 1 15, Springer, New York. Bárdossy, A. (2006), Copula-based geostatistical models for groundwater quality parameters, Water Resour. Res., 42, W11416, doi: / 2005WR Bárdossy, A., U. Haberlandt, and J. Grimm-Strele (1997), Interpolation of groundwater quality parameters using additional information, geoenv I, Geostat. Environ. Appl., 47, Bárdossy, A., H. Giese, J. Grimm-Strele, and K. Barufke (2003), SIMIK + GIS - implementierte Interpolation von Grundwasserparametern mit Hilfe von Landnutzungs- und Geologiedaten, Hydrol. Wasserbewirtsch., 47, Embrechts, P., A. J. McNeil, and D. Straumann (2001), Correlation and Dependency in Risk Management: Properties and Pitfalls, Cambridge Univ. Press, New York. Fang, H.-B., K.-T. Fang, and S. Kotz (2002), The meta-elliptica distribution with given marginals, J. Multivariate Anal., 82, Favre, A.-C., S. E. Adlouni, L. Perreault, N. Thiémonge, and B. Bobée (2004), Multivariate hydrology frequency analysis using copulas, Water Resour. Res., 40, W01101, doi: /2003wr Gomez-Hernandez, J., and X. Wen (1998), To be or not to be multi-gaussian? A reflection on stochastic hydrogeology, Adv. Water Resour., 21, Joe, H. (1997), Multivariate Models and Dependence Concepts, CRC Press, Boca Raton, Fla. Journel, A. G., and F. Alabert (1989), Non-Gaussian data expansion in the Earth sciences, Terra Nova, 1, Journel, A. and C. Huijbregts (1978), Mining Geostatistics, Elsevier, London. Journel, A., and T. Zhang (2006), The necessity of a multi-point prior model, Math. Geol., 38, Malevergne, Y., and D. Sornette (2003), Testing the Gaussian copula hypothesis for financial assets dependences, Quant. Finan., 3, Nelsen, R. (1999), An Introduction to Copulas, Springer, New York. Poulin, A., D. Huard, A.-C. Favre, and S. Pugin (2007), Importance of tail dependence in bivariate frequency analysis, J. Hydrol. Eng., 12, Salvadori, G., C. D. Michele, N. Kottegoda, and R. Rosso (2007), Extremes in Nature. An Approach Using Copulas, Springer, New York. Sklar, A. (1959), Fonctions de répartition à n dimensions et leurs marges, Publ. Inst. Stat. Paris, 8, Ward, M., and C. Folland (1991), Prediction of seasonal rainfall in the nordeste of brazil using eigenvectors of sea-surface temperature, Int. J. Climatol., 11, A. Bárdossy and J. Li, Institute of Hydraulic Engineering, University of Stuttgart, Pfaffenwaldring 61, D Stuttgart, Germany. (bardossy@ iws.uni-stuttgart.de; jing.li@iws.uni-stuttgart.de) 15 of 15

Application of Copulas as a New Geostatistical Tool

Application of Copulas as a New Geostatistical Tool Presented by Jing Li Supervisors András Bardossy, Sjoerd Van der zee, Insa Neuweiler Universität Stuttgart Institut für Wasserbau Lehrstuhl für Hydrologie