Online publication date: 04 September 2010

Size: px

Start display at page:

Download "Online publication date: 04 September 2010"

Godwin Richard
6 years ago
Views:

This article was downloaded by: [University of South Carolina] On: 11 January 2011 Access details: Access Details: [subscription number 917809949] Publisher Taylor & Francis Informa Ltd Registered in

1 This article was downloaded by: [University of South Carolina] On: 11 January 2011 Access details: Access Details: [subscription number ] Publisher Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: Registered office: Mortimer House, Mortimer Street, London W1T 3JH, UK International Journal of Geographical Information Science Publication details, including instructions for authors and subscription information: Local entropy map: a nonparametric approach to detecting spatially varying multivariate relationships Diansheng Guo a a Department of Geography, University of South Carolina, Columbia, South Carolina, USA Online publication date: 04 September 2010 To cite this Article Guo, Diansheng(2010) 'Local entropy map: a nonparametric approach to detecting spatially varying multivariate relationships', International Journal of Geographical Information Science, 24: 9, To link to this Article: DOI: / URL: PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

2 International Journal of Geographical Information Science Vol. 24, No. 9, September 2010, Local entropy map: a nonparametric approach to detecting spatially varying multivariate relationships Diansheng Guo* Department of Geography, University of South Carolina, Columbia, South Carolina, USA (Received 5 December 2008; final version received 19 December 2009) The relationship between two or more variables may change over the geographic space. The change can be in parameter values (e.g., regression coefficients) or even in relation forms (e.g., linear, quadratic, or exponential). Existing local spatial analysis methods often assume a relationship form (e.g., a linear regression model) for all regions and focus only on the change in parameter values. Therefore, they may not be able to discover local relationships of different forms simultaneously. This research proposes a nonparametric approach, a local entropy map, which does not assume a prior relationship form and can detect the existence of multivariate relationships regardless of their forms. The local entropy map calculates an approximation of the Rényi entropy for the multivariate data in each local region (in the geographic space). Each local entropy value is then converted to a p-value by comparing to a distribution of permutation entropy values for the same region. All p-values (one for each local region) are processed by several statistical tests to control the multiple-testing problem. Finally, the testing results are mapped and allow analysts to locate and interactively examine significant local relationships. The method is evaluated with a series of synthetic data sets and a real data set. Keywords: local analysis; entropy; minimum spanning tree; scan statistics; spatial data mining 1. Introduction One of the major challenges to spatial data analysis and geographic data mining is that relationships between two or more variables may change over the geographic space. This is related to the problem of spatial non-stationarity, the measurement of a relationship depends in part on where the measurement is taken (Fotheringham et al. 2002). To address the spatial non-stationarity problem, a number of local spatial analysis methods have been developed, including local point pattern analysis (Openshaw et al. 1987), univariate local statistics (Getis and Ord 1992, Ord and Getis 1995), local indicators of spatial association (LISA) (Anselin 1995), and multivariate local analysis such as the Geographically Weighted Regression (GWR) (Brunsdon et al. 1996, Fotheringham et al. 2002). Existing local spatial analysis methods have been successfully applied in various application domains. Univariate local statistics are useful in detecting hot spots or spatial clusters but they mainly focus on the spatial distribution of the values of a single variable. Local bivariate or multivariate analysis can examine spatially varying relationships between two or more variables but they usually assume a prior relationship form, which is often a * guod@sc.edu ISSN print/issn online # 2010 Taylor & Francis DOI: /

3 1368 D. Guo linear correlation (e.g., high high and low low associations) or a regression model. For example, GWR examines the spatial variation of parameter values for a linear regression model (or any chosen model) while assuming that the multivariate relationship in different regions conforms to the same model. For complex spatial analysis tasks, however, both the parameter values of a specific relation form and the relation form itself may change across space. Assuming the same relationship form for all local regions may inevitably miss important local relationships that are of different forms from what is assumed. This research proposes a nonparametric local analysis approach, a local entropy map, which is able to simultaneously detect local multivariate relationships of various forms. It first calculates an approximation of the Rényi entropy for the multivariate data in each local region (in the geographic space) and then uses a permutation-based approach to convert each local entropy value to a p-value. All p-values (one for each local region) are processed by several statistical tests to control the multiple-testing problem. Finally the results of significance testing are mapped to allow visual detection and examination of local relationships through interactive exploration. The method is robust with noise and outliers and invariant to linear transformations of the variables. Evaluations are carried out with a series of synthetic data sets (including a bivariate data set and several multivariate data sets with up to six variables) and the US 2004 presidential election data. The rest of the paper is organized as follows. Related research is reviewed in Section 2. A detailed explanation of the proposed method is provided in Section 3. Evaluation of results with synthetic data sets and the US election data are presented in Sections 4 and 5, respectively. Section 6 concludes with discussions and a summary. 2. Related research 2.1. Local spatial analysis Spatial analysis is often concerned with the identification of relationships that may vary over the geographic space, especially when a huge number of spatial observations are collected over a large geographic area (Openshaw et al. 1987, Getis and Ord 1992, Anselin 1995, Getis and Ord 1996, Fotheringham and Brunsdon 1999, Fotheringham et al. 2000). Existing local spatial analysis methods can be classified into two major categories: (1) those on univariate or bivariate spatial associations; and (2) those on local versions of multivariate regression analysis. Below is a brief review on the two different types of local analysis methods. Readers are referred to Anselin 1995, Getis and Ord 1996, and Fotheringham et al for more detailed reviews. A number of local spatial statistics have been developed to examine univariate or bivariate spatial associations, including the local G statistics (Getis and Ord 1992, Ord and Getis 1995), local Moran s I, and a more generalized class of LISA (Anselin 1995). Such local spatial statistics are primarily used to detect univariate spatial clusters (e.g., concentration of high or low values) and examine how such clusters vary from place to place. The local Moran s I may also be extended to measure the association between one variable at a location and another variable at a neighborhood of that location (Anselin et al. 2006). Nevertheless, existing local statistics, univariate or bivariate, rely on a measure that is designed to capture a specific type of relationship (e.g., high high and low low associations), which in the meanwhile will ignore or miss other types of relationships. GWR extends traditional regression analysis by allowing local rather than global parameters to be estimated (Brunsdon et al. 1996, Fotheringham et al. 2002). For each point in the geographic space, a regression analysis is performed on the data in the neighborhood

4 International Journal of Geographical Information Science 1369 around that point. The neighborhood (or bandwidth) can be defined in several different ways (Fotheringham et al. 2002). Prior to GWR, other local extensions of regression analysis have also been developed, for example, the spatial expansion method (Casetti 1972). A limitation of existing local regression methods, however, is the assumption of a relationship form. Regression fitting is also sensitive to outliers and variable transformations (such as a logarithmic transformation). To capture non-linear relationships with regression, there are several existing options: (1) perform variable transformation (such as a logarithmic transformation) and then use the transformed variables in the regression analysis; (2) use a non-linear regression model (Bates and Watts 1988) or generalized regression models (GLMs) (Nelder and Wedderburn 1972); or (3) use smoothing-based local fitting approaches, such as the locally weighted regression (LOWESS) (Cleveland and Devlin 1988) or generalized additive models (Hastie and Tibshirani 1990), which are further developments of the generalized linear models. The first two options, if extended for local spatial analysis, still require prior knowledge of a relationship form (such as choosing a link function for a GLM) and assume that the form does not change across space. Although the third option provides greater flexibility with local smoothing curves, its configuration requires several subjective decisions such as the choice of different smoothers and the size of local neighborhood in the attribute space (Hastie and Tibshirani 1990). Moreover, there is always a risk of overfitting the data with such local smoothing approaches (Hastie and Tibshirani 1990) Multiple-testing problems To indentify local patterns, a method often needs to evaluate the presence and significance of patterns for each local neighborhood. When a large number of significance tests are carried out simultaneously, the multiple-testing problem occurs. The problem refers to the fact that, when many tests are performed at the same time, the probability of incorrectly rejecting a null hypothesis is much higher than the confidence interval indicates. For example, if a test rejects a null hypothesis at a confidence level of 0.05, it means that there is a 5% possibility that the null hypothesis is incorrectly rejected. However, if 100 such tests (which, for simplicity, are assumed to be independent from each other) are performed at the same time, then it is almost certain that one or more (actually the expected number is five) null hypotheses will be incorrectly rejected. In this case, when we have five rejected null hypotheses, it is difficult to tell whether they are merely out of chance or indeed significant. Therefore, local analysis methods need to adjust for the multiple-testing problem. In the literature of statistics, a number of methods have been proposed to control the multipletesting problem (See Castro and Singer 2006 for a detailed review). Two groups of methods are commonly used: (1) controlling the family-wise error rate (FWER); or (2) controlling the false-discovery rate (FDR). These two groups of method are briefly introduced below. The FWER is the probability of erroneously rejecting even one of the true null hypotheses among all the m tests. If the FWER is to be controlled at some level a, then each of the m tests has to be carried out at much lower levels. The simplest FWER approach is the Bonferroni method, which simply sets the level for each test at a/m. The Bonferroni method is the most conservative correction for multiple testing. Holm (1979) proposed a step-down procedure to control FWER (as opposed to a single-step procedure such as the Bonferroni method). Consider testing m hypotheses H 1, H 2,..., H m, based on the corresponding p-values P 1, P 2,...,P m.let P (1) P (2)...P (m) be the ordered p-values and denote their corresponding hypotheses

5 1370 D. Guo by H (1), H (2),...,H (m). Holm s method starts testing from the smallest P value P (1).Letk be the smallest i for which P (i). a/(m - i + 1), then reject all H (i), i, k. Benjamini and Hochberg (1995) proposed an alternative approach that controls the FDR, which is the proportion of rejected hypotheses that are incorrectly rejected. Consider testing H 1, H 2,..., H m, based on the corresponding p-values P 1, P 2,..., P m. Let P (1) P (2)... P (m) be the ordered p-values and their corresponding hypotheses be H (1), H (2),..., H (m). To control the FDR at a level a, and let k be the largest i for which P (i) a(i/m), then reject all H (i), i, k. FDR-based correction is less conservative and may reject more null hypotheses than an FWER-based method does as long as the proportion of incorrect rejections is controlled. Both of the above correction procedures (FWER or FDR) assume that the hypotheses are independent from each other. When dependence exists among the hypotheses to be tested, the above procedures may need further adjustments (Getis and Ord 2000, Benjamini and Yekutieli 2001). Such a dependence is more likely to exist for local spatial analysis because of spatial autocorrelation and the overlapping of local neighborhoods (Castro and Singer 2006). To correct the Bonferroni method for the overlapping problem, Getis and Ord (2000) use a/v (instead of a/m) as the control level, where v is the estimated number of nonoverlapping local regions. To perform a hypothesis test, one needs to calculate a p-value, which is the probability of obtaining the observed pattern while the null hypothesis is true. To derive a p-value, two pieces of information are needed: a measure of the observed pattern and the distribution of measure values under the null hypothesis. The measure value is then compared with the distribution to derive its p-value. For a traditional statistical test (such as the Student s T-test), both the measure calculation and the distribution (e.g., the t-distribution) are well defined and given. However, for local multivariate spatial analysis, it remains an open research challenge to design an effective measure of patterns and to obtain the distribution of expected values of the measure under the null hypothesis. The research presented in this paper proposes both a new entropy-based multivariate relationship measure and a permutation-based approach to construct an empirical distribution of the measure values for significance testing Entropy-based analysis Entropy-based approaches have been widely used in many disciplines. The concept of entropy has its root in thermodynamics and statistical physics. Intuitively, entropy represents the amount of uncertainty or randomness about a system or phenomena. In information theory, entropy quantifies the uncertainty or randomness of a system, such as the Shannon entropy (Shannon 1948) and its generalization Rényi entropy (Rényi 1960, Aczel and Daroczy 1975, Ben-Bassat and Raviv 1978). For a real-valued d-dimensional data space R d, the Rényi entropy is defined as 0 1 H l ¼ 1 ð 1 l f ðxþ l dxa l 0; lþ1; (1) R d where x is a d-dimensional vector, f(x) is the probability density function, and l 0 is the order of the Rényi entropy. When l!1, the Rényi entropy H l converges to the Shannon entropy:

6 International Journal of Geographical Information Science 1371 ð H ¼ R d f ðxþ log f ðxþdx: (2) A major challenge in using entropy for exploratory data analysis is that the probability density function is often unknown, which is critical in calculating the entropy. A common solution to this problem in the literature is to construct a contingency table by binning (or cutting) each continuous variable into intervals (Guo 2003). The derived contingency table is then used as an approximation of the multivariate probability density function. However, such a contingency table is not a reliable approximation of the true data distribution, especially when the number of data points is limited, which is true for local spatial analysis since only a small local neighborhood is analyzed at each time. Different binning methods may produce dramatically different contingency tables and thus different probability density estimations. The research on the probability theory of minimum spanning trees (MST) points to another direction for entropy estimation (Beardwood et al. 1959, Steele 1988). Steele (1988) shows that the total length of a minimum spanning tree, constructed with a set of multivariate points, is related to the unknown probability density function for those data points. Therefore, it is possible to estimate the Rényi entropy of a set of multivariate data points using its minimum spanning tree as a surrogate of its probability density function. This new direction for entropy estimation has been recently adopted for image comparison and registration (Hero and Michel 1999, Hero et al. 2002). Steele (1988) s theorem is introduced in Section 3.2. The concept of entropy has long been used in various geographical research (Medvedkov 1967, Marchand 1972, Batty 1976). Recent efforts of using entropy in spatial analysis include, for example, measuring spatial information in maps (Li and Huang 2002), comparing categorical maps (Remmel and Csillag 2006), and examining multi-scale causality in spatial systems (Phillips 2005). Most existing entropy-based approaches focus on (1) deriving a global measure of spatial information and/or (2) analyzing categorical data (e.g., soil types, map symbols, or classified data). The proposed local entropy map in this paper makes two main contributions. First, it can calculate the multivariate entropy of continuous variables (instead of categorical variables) without assuming a probability density function or binning data into classes. Second, it performs local analysis (instead of global analysis) to detect significant local multivariate patterns without assuming a regression model. Researchers in computer vision have recently proposed a robot navigation approach based on entropy map (Arbel and Ferrie 2001), which measures the ambiguity of target recognition at each observation location and thus can be used to guide a mobile observer (robot) to navigate along an optimal route to focus on the target. Although similar in names, the entropy map approach in computer vision is fundamentally different from the proposed local entropy map in two main aspects. First, the goal of the former is to discriminate (or recognize) a known target from other objects based on their differences. The entropy at an observational position indicates how easily the target can be uniquely recognized from its surrounding environment. Second, the entropy calculation of the former approach is based on a posteriori probability density function derived from training data (i.e., experiments conducted beforehand at each observation point). The local entropy map approach proposed in this paper, on the other hand, is an exploratory analysis method that can detect the existence of significant local relationships without knowing the probability function.

7 1372 D. Guo 3. Local entropy map: detecting local multivariate relationships using entropy The research presented in this paper proposes a nonparametric approach, local entropy map, which combines the Rényi entropy, permutation-based distribution estimation, and statistical tests to detect the existence of significant multivariate local relationships, regardless of their forms. The approach involves the following four steps: 1. Estimating the multivariate Rényi entropy for each local neighborhood; 2. Using a permutation-based approach to construct an empirical distribution of entropy values for each local neighborhood under the null hypothesis, which is used to convert the local entropy value to a p-value; 3. Processing the p-values with a set of statistical tests, including the Bonferroni, Bonferroni adjusted for spatial dependence, and FDR; 4. Creating a local entropy map to show the significance level of each local region and allow interactive examination of significant local multivariate relationships. The above steps are presented in the following subsections, respectively. Before moving on to the methodological details, it is necessary to explain what constitutes a good multivariate relationship. For two variables, they have a good relationship if one variable is dependent on the other. In other words, a good relationship means that one may predict, more or less, the value of one variable if given the value of the other variable. Figure 1 shows six different relationships between variables G and H, with three good relationships (a c) and three null relationships (d f). Among the good relations, pattern (a) is linear, (b) is quadratic, and (c) is non-smooth and non-monotonic. Note that existing bivariate statistics, such as the Pearson, Spearman, or Kendall indices, cannot capture the three good relationships (a c) simultaneously. For each of the three Figure 1. Illustrative examples of good relationships (a c) and null relationships (d f).

8 International Journal of Geographical Information Science 1373 null relationships, variables G and H are independent from each other. Specifically, pattern (d) is a random distribution, pattern (e) is a Gaussian distribution, and pattern (f) has two extreme values on G and a random distribution for the rest. Similarly, a good relationship for three or more variables would form a hypersurface in the multivariate space so that the value of the dependent variable can be roughly determined by the values of predictor variables Rényi entropy approximation with power-weighted MST Let X ={x i }(x i 2 R d,1in) ben independent observations in a continuous d dimensional space (hereafter attribute space). Each observation is fixed to a location in the geographic space. A spanning tree T of X isatreethathasexactlyn-1edges, which connect all n data points in the attribute space. Let {T i } be all the possible spanning trees of X. The length of an edge e =(x i, x j )(iþj) is denoted by e, which is the Euclidean distance between x i and x j in the attribute space. A power-weighted minimum spanning tree M a is defined as: M a ðx 1 ; x 2 ;...; x n Þ¼min ft i g X jj e!; a (3) where a is the edge power and the minimum is over all spanning trees {T i } of the data set. In other words, M a ðx 1 ; x 2 ;...; x n Þ is the shortest spanning tree in {T i }. According to Steele (1988), for 0, a, d, with probability 1 we have: ð lim n!1 n ðd aþ=d M a ðx 1 ; x 2 ;...; x n Þ¼cða; dþ f ðxþ ðd aþ=d dx; (4) R d where f denotes the probability density function of the absolutely continuous attribute space and c(a, d) denotes a strictly positive constant that depends only on the edge power a and the dimensionality d. Equation (4) indicates that the total length of the minimum spanning tree is related to the unknown density function and therefore, as shown next, is closely related to the Rényi entropy of a multivariate data distribution (Steele 1988). Combining Equations (1) and (4), we can remove the requirement of a density function f andbeabletodirectlyapproximate the Rényi entropy of a multivariate data set from its minimum spanning tree length: e2t i H l ¼ 1 1 l log M ðx 1 ; x 2 ;...; x n Þ a n l c (5) where l =(d-a)/d, 0, a, d, and c is a constant that depends only on a and d. Since this research converts each local entropy value to a p-value against a distribution of permutation entropies under the null hypothesis (i.e., the absolute entropy value is not directly used for significance testing), the monotonic logarithmic transformation and the constant factor in Equation (5) can be omitted. Therefore, the proposed approach uses the normalized power-weighted MST length, M a ðx 1 ; x 2 ;...x n Þ=n ðd aþ=d,asasurrogate approximation of Rényi entropy to measure local multivariate relationships and conduct further significance testing. The edge power a should be configured within range (0, d).

1374 D. Guo Figure 2. A bivariate example for entropy calculation with minimum spanning trees. Figure 2 illustrates the procedure to calculate the multivariate Rényi entropy for a local region.

9 1374 D. Guo Figure 2. A bivariate example for entropy calculation with minimum spanning trees. Figure 2 illustrates the procedure to calculate the multivariate Rényi entropy for a local region. During the data preprocessing step, each variable is linearly normalized to a range [0, 1], that is, having the minimum equal to 0 and maximum to 1. Note that this normalization step is optional but desired, especially when different variables have dramatically different value ranges. A local region (or neighborhood) is defined as the k nearest neighbors of a location s i, including s i itself (Figure 2a). The observations in the local region, denoted as X(s i, k) ={x i1, x i2,..., x ik }, form a distribution in the attribute space (Figure 2b), for which a minimum spanning tree is constructed (Figure 2c). Algorithms for constructing an MST can be found in Baase and Gelder (2000). The normalized MST length, M a ðx i1 ; x i2 ;...; x ik Þ=k ðd aþ=d,isthenusedtoapproximateitsrényi entropy. The effects and configuration of k and a are discussed in Section 4. For the six data distributions in Figure 1, the entropies for patterns (a), (b), and (c) will be much smaller than that of pattern (d). Note that smaller entropy values indicate less uncertainty and thus possible stronger patterns. However, although (d), (e), and (f) are all null relations (i.e., the variables are independent from each other), their entropy values are not similar, with pattern (e) and (f) having a much smaller entropy value than pattern (d). This indicates that the Rényi entropy alone is not enough to distinguish dependent relations from independent relations. For example, a clustered (Figure 1e) or skewed distribution (Figure 1f) may also generate a small entropy value. Therefore, this research further converts each local entropy value to a p-value, according to an estimated empirical distribution of local entropy values under the null hypothesis (i.e., no dependence between the two variables) established through permutations in the attribute space. In other words, given a local entropy value, we need to find out the probability that such an entropy value exists while the variables are actually independent from each other Permutation-based distribution estimation and testing To find out the p-value of a local entropy value, we need to establish the distribution of local entropy values under the null hypothesis. A permutation-based approach is used to create a large number of random permutations of the multivariate attribute data in the local region and calculates an entropy value for each permutation. Suppose a local region has k multivariate data points {x 1, x 2,...,x k } and each point x i is a vector of values for d variables {v 1, v 2,...,v d }. Let v d be the dependent variable and the rest be independent variables. The permutation procedure shuffles the values of v d among the k vectors while keeping the values of other variables unchanged. Thus it creates a new data set in which variable v d is

International Journal of Geographical Information Science 1375 Figure 3. Permutations and empirical distribution under the null hypothesis. independent from other variables.

10 International Journal of Geographical Information Science 1375 Figure 3. Permutations and empirical distribution under the null hypothesis. independent from other variables. This permutation procedure by design ensures that the marginal distribution for each variable remains the same as in the original data since it only shuffles existing values of v d. The Rényi entropy for each permutation is calculated and recorded. The above permutation step is repeated many (e.g., 999) times to obtain an empirical distribution of local entropy values under the null hypothesis. Figure 3 illustrates the steps to obtain the permutation entropy distribution, using the same example data shown in Figure 2. This permutation procedure is similar to the one used by Kulldorff et al. (2005) to detect space-time interactions. Figure 4 shows the permutation entropy distribution for a local neighborhood. Given such a distribution, there are two options to find out the p-value of the actual local entropy value: the percentile approach and the normal distribution based approach. Using the percentile approach, one can simply count how many of the permutation entropy values are lower than the actual entropy value. For example, if the actual value is between the 50th Figure 4. An empirical distribution of entropy values derived with permutations. The normal distribution curve is defined by the mean and standard deviation of the permutation entropy values.

11 1376 D. Guo and 51st lowest values among the 999 permutation entropies, then its p-value is With the normal distribution based approach, one assumes that the permutation entropy distribution conforms to a normal distribution. The mean and standard deviation of the permutation entropy values are then used to configure a normal distribution curve and find out the p- value, which is the cumulative probability at the given local entropy value (Figure 4). For the analysis of the US election data, more than 3000 local regions are evaluated and thus over 3000 permutation distributions are constructed, among which 99.8% conform to a normal distribution according to Komogorov tests. Experiments with both synthetic data and real data also show that the two approaches give very similar results. For the remainder of this paper, the normal distribution approach is used. Let us look at the six examples in Figure 1 again. For (a), (b), and (c), permuting the values of the two variables will produce a much more random distribution and thus significantly higher entropy values. On the other hand, the data distributions for (d), (e), or (f) will not change much after each permutation and thus entropy values will remain similar to the original value. Therefore, the p-values for (a), (b), and (c) will be very low while the p-values for (d), (e), or (f) will be much higher. Note that, with the permutation, the approach is robust with outliers or extreme values, and invariant under data transformations (such as a logarithmic transformation) Controlling the multiple-testing problem Now that we have converted each local entropy value to a p-value, the next step is to determine which p-values are significant. In other words, in which local regions should the null hypothesis be rejected? If ignoring the multiple-testing problem, we can simply use a specified significance level, say P = 0.05, to declare all p-values that are lower than P significant. However, the multiple-testing problem does exist and should be controlled. If r local regions are examined for local patterns, there are r hypotheses being tested simultaneously. Section 2.2 reviewed existing approaches for controlling the multiple-testing problem. Instead of adopting a single method, this research uses a set of different methods to process the p-values and their results can be compared and related to each other. The adjusting methods adopted by this research include the Bonferroni, Bonferroni adjusted for spatial dependence, and the FDR methods. Essentially, given a significance level P, each method sets its own critical value for rejecting null hypotheses: l The Bonferroni method simply sets the critical value at P/r, that is, p-values that are lower than P/r will be declared significant and their corresponding hypotheses will be rejected. In this research, r = n, that is, a local region is created around each data point. l The Bonferroni method adjusted for spatial overlapping (dependence) sets the critical value at P/v, where v is the estimated number of non-overlapping local regions (Getis and Ord 2000). This research uses v = n/k, where n is the total number of data points and k is the number of data points covered by each local region. l The FDR method first sorts all p-values into an ascending order, P (1) P (2)... P (r). Given a specified significance level P, and let j be the largest i for which P (i) P*i/r, then declare all P (i) (i, j) significant and reject their corresponding hypotheses. In other words, the FDR method sets its critical value at P (j). l If the multiple-testing problem is ignored, the critical value will be set at the given level P, which is 0.05 in this research.

12 International Journal of Geographical Information Science 1377 Figure 5. Controlling the multiple-testing problem with several methods. The four critical values separate the p-values into five classes, which are mapped with darker colors representing more significant patterns. With the above four different methods (including the last one that does not control the multiple-testing problem), four different critical values can be obtained to classify the p-values of local entropies into five classes, each being represented by a gray shade color (Figure 5). With the colors, a local entropy map can be rendered to visualize the spatial distribution of significant local patterns (See Figures 7 9, and 12). Such a mix-method approach not only helps understand the relations between different statistical tests, it also visualizes different levels of significance and shows the gradual change of patterns over space Putting it all together: the local entropy map The local entropy map method introduced in previous sections can be summarized and implemented with the following procedure: Input: D ={y i, x i }, X =n, x i is a vector of dimensionality d, observed at location s i. Steps: 1. (Optional) Normalize each variable separately, to a range of [0, 1]; 2. For each location (or spatial unit) s i : (a) Find its k nearest neighbors (including s i ) in the geographic space; (b) Construct an MST with the local multivariate data {x i1, x i2,...,x ik } and edge power a; (c) Estimate the local entropy e with the normalized MST length: M a ðx i1 ; x i2 ;...x ik Þ=k ðd aþ=d ; (d) Randomly generate 999 permutations (See Section 3.3 for details); (e) For each permutation (i) Construct an MST with edge power a; (ii) Calculate and record the entropy; (f) Convert the local entropy value e to a p-value according to the distribution of the 999 permutation entropy values; 3. Significance testing of the p-values using four different testing methods; Output: A local entropy map, with colors representing the results of statistical testing. Without using a spatial index, the computational complexity of the entire procedure is O(rnlogn + rpk 2 logk), where r is the number of local regions, n is the total number of observations, k is the number of nearest neighbors used, and P is the number of permutations for each local test. Both k and P are constant. If a local region is constructed around each observation, then r = n. The overall complexity is primarily determined by two components: (1) the search for k nearest neighbors for each local region O(rnlogn); and (2) the repetitive construction of MST for each permutation (and each neighborhood)

1378 D. Guo O(rpk 2 logk). To speed up the first component, spatial index structures can be employed. Since both k and P are constants, the second component is scalable with large data sets.

13 1378 D. Guo O(rpk 2 logk). To speed up the first component, spatial index structures can be employed. Since both k and P are constants, the second component is scalable with large data sets. A grid-based (or raster-based) approach may be used to reduce r, the number of local regions to be examined (Fotheringham et al. 2000). 4. Experiments with synthetic data sets To evaluate the performance of the proposed method and understand the effects of parameters k and a, experiments with several synthetic bivariate and multivariate data sets are carried out. GWR is also applied for comparison Experiments with bivariate data The first synthetic data set has two variables and 1500 data points (Figure 6e). Each data point also has a spatial location. Three bivariate relationships are generated (Figure 6a c): (A) A linear relationship (consisting of 243 points); (B) A quadratic relationship (consisting of 300 points); (C) A logarithmic relationship (consisting of 151 points). All other 806 points are from a random distribution (Figure 6d). Moreover, points for each of the three relationships are confined in a different geographic area (Figure 6f): relation A is Figure 6. A synthetic data set: (a) a linear relationship; (b) a quadratic relationship; (c) a logarithmic relationship; (d) random data; (e) the bivariate space of data points from the above three relationships and the random data; and (f) the geographic space. Data points in regions (a), (b), and (c) in the geographic space are from the linear relationship (a), quadratic relationship (b), and the log relationship (c), respectively. Points outside the three regions are random data (d).

International Journal of Geographical Information Science 1379 contained in an elongated area; relation B occupies a circular area; and relation C is present in an irregular polygonal region.

14 International Journal of Geographical Information Science 1379 contained in an elongated area; relation B occupies a circular area; and relation C is present in an irregular polygonal region. Points outside those three areas are all from a random distribution. Figure 7 shows four local entropy maps of the synthetic data, each being configured with the same a value (0.5) but with a different k value (ranging from 35 to 100). In general, all four maps effectively highlight the three regions where the two variables have a strong dependent relationship. The k value affects the resulted entropy maps in two aspects. On the one hand, the estimation of entropy values is more robust with a larger k value because more data points can better represent a relationship. On the other hand, a smaller k value helps achieve a better spatial resolution in detecting local relationships. Figure 8 shows the results with four different a values (ranging from 1.5 to 0.05) while k is fixed to 50. Decreasing the a value will reduce the length difference between long edges and short edges and therefore the result will be more robust to noise and extreme values. The map with a = 1.5 captures the core of the three regions but misses the points near the periphery. The map with a = 0.5 and 0.05 outlines all the three regions with high fidelity. The gray shades of points in Figures 7 and 8 represent the significance level of local relationships (See Figure 5 for the meaning of colors). Figure 7. Local entropy maps of the bivariate synthetic data for different k values (with a = 0.5). See Table 1 for more experiment results of different a and k values.

1380 D. Guo Figure 8. Local entropy maps of the bivariate synthetic data for different a values (with k = 50). See Table 1 for more experiment results of different a and k values.

15 1380 D. Guo Figure 8. Local entropy maps of the bivariate synthetic data for different a values (with k = 50). See Table 1 for more experiment results of different a and k values. A more comprehensive experiment is then conducted to compare the results of different combinations of a and k values, with the above synthetic bivariate data. Instead of visual inspection, Type I and Type II errors are calculated for each result. Type I error, that is, false positive or false discovery, is the number of points outside the three areas (see Figure 6f) that are declared significant based on the spatially adjusted Bonferroni method. Type II error, that is, false negative or missed patterns, is the count of points inside the three areas that are declared insignificant based on the spatially adjusted Bonferroni method. Table 1 shows the results for the combination of four different k values (35, 50, 75, and 100) and seven different a values (ranging from to 2.0). Several trends can be observed in the table: (1) with a being constant, a larger k value will generate more false discoveries but miss less; (2) with k being constant, a smaller a value will also generate more false discoveries but miss less; and (3) in terms of total error, a smaller k (e.g., 35 50) with a ranging between 0.25 and 0.5 gives better result. It is interesting to notice that, with k being equal, when a gets smaller (e.g.,,0.5), the results become stable and statistically identical. This confirms that, when a approaches zero, the Rényi entropy becomes the Shannon entropy. It should be emphasized that although different combinations of a and k values generate different amount of errors as shown in Table 1, each of the configurations can effectively identify the three local areas that have good local relationships. A major portion of the Type I

16 International Journal of Geographical Information Science 1381 Table 1. Comparison of Type I and Type II errors for different a and k configurations. For each k, the smallest total error (i.e., the sum of Type I and II errors) is highlighted in bold. k = 35 k = 50 k = 75 k = 100 Type I Type II Type I Type II Type I Type II Type I Type II a = a = a = a = a = a = a = error is due to the edge effect, especially for larger k values. With different a values we can control the sensitivity to noise and extreme values. As shown in Figures 7 and 8, and Table 1, the effects of different a and k are highly predictable and the result is not very sensitive to small changes in those parameters. For example, with k =35to50anda ranging from 0.05 to 0.5, the local entropy maps are very similar to each other Experiments with multivariate data To further evaluate the proposed approach for more than two variables, a series of multivariate synthetic data sets were generated, each with a different number of variables, ranging from 3 to 6. To be able to compare with the bivariate result, we use the same three geographic areas, with each having the same type of pattern, same number of points, and same sets of locations. The main difference is that each relationship now has more independent variables. For example, to generate data for the linear relationship with six variables (one dependent and five independent variables), we first generate a random value for each independent variable, then use the linear function to calculate the dependent variable value with the five independent variable values, and finally add a random error (within a certain range, same as in the bivariate data) to the value of the dependent variable. Visually, a bivariate linear relationship forms a line in the 2D attribute space while a multivariate linear relationship forms a hyperplane in the multidimensional space. Similarly, a quadratic or logarithmic relationship forms a curved hypersurface. Table 2 shows the local entropy results for four multivariate data sets (d =3,4,5,and6) with different a and k values. Similar trends can be observed: (1) with a being constant, a larger k value will generate relatively more false discoveries but miss less; (2) with k being Table 2. Comparison of Type I and Type II errors for different number of variables (d). k =50 k =75 d =3 d =4 d =5 d =6 d =3 d =4 d =5 d =6 I II I II I II I II I II I II I II I II a = a = a = a = a =

1382 D. Guo Figure 9. Local entropy maps of a multivariate synthetic data set, which has six variables (including one dependent variable and five independent variables).

17 1382 D. Guo Figure 9. Local entropy maps of a multivariate synthetic data set, which has six variables (including one dependent variable and five independent variables). See Table 2 for more results with different number of variables and different a and k values. constant, a smaller a value will generate more false discoveries but miss less; and (3) with both a and k being constant, the more variables involved, the more pattern points are missed (i.e., more Type II errors). Therefore, k should increase as the number of variables increases. This is well anticipated since, because of the curse of dimensionality, more data points are needed to faithfully represent a relationship in higher-dimensional space. This is a common challenge for almost all data analysis methods (e.g., see Figure 10 for its effect on GWR). Nevertheless, in the multivariate data of six variables, the local entropy map still effectively detects the three areas that have significant relationships of different forms (Figure 9) Comparison with GWR To demonstrate the advantage of the local entropy map in terms of detecting different forms of relationships simultaneously, the GWR method (available in ArcGIS 9.3) is also appliedtoboththebivariatedatasetandthemultivariatedatasetusedinprevious

18 International Journal of Geographical Information Science 1383 experiments. For each application of the GWR, the adaptive Gaussian kernel is used and its bandwidth is configured with two different options: (1) using the AICc (the corrected Akaike Information Criterion) method to find the optimal bandwidth; and (2) setting the bandwidth as 100 nearest neighbors, which is deliberately set larger than the k value (e.g., 50 or 75) in a local entropy map since GWR assigns much less weight to farther data points. Figure 10 maps the local residual squares (r-square) of the GWR results. The top two maps are for the bivariate data, in which GWR can detect the linear relationship but misses the other two (although it partially captures the logarithmic relation as a linear one). The bottom two maps show the GWR results for the multivariate data with six variables. With the automatic bandwidth selection by AICc, GWR can only capture a small part of the linear relationship. Comparing Figures 7 and 9 with Figure 10, we can see that the local entropy map is at least as powerful as the GWR in detecting multivariate linear relationships, but more importantly, it can also detect other types of relationships at the same time. Figure 10. Top-left: the GWR result of the bivariate data (same as in Figures 6 8), using the adaptive Gaussian kernel and the AICc method. Top-right: the GWR result of the bivariate data, using the adaptive Gaussian kernel and a bandwidth of 100 nearest neighbors. Bottom-left: the GWR result of the multivariate data (same as in Figure 9), using the adaptive Gaussian kernel and AICc method. Bottomright: the GWR result of the same multivariate data with the adaptive Gaussian kernel and a bandwidth of 100 nearest neighbors.

19 1384 D. Guo 5. Application with election data analysis The 2004 US presidential election data at the county level is used to assess the proposed approach in real applications. For the simplicity of visual inspection, only two variables are used: the percentage of votes for Bush and the per capita income for each county. There are 3111 counties for the continental 48 states. A local neighborhood for a county C is the k nearest counties of C (including C itself), according to centroid-to-centroid distances. The global bivariate data distribution is shown in Figure 11. Each dimension (variable) is linearly scaled and has seven value ticks, which are the nested mean values. Nested means are recursively calculated by first calculating the mean of all data values, cutting the data into two halves at the mean, finding the mean for each half, and so on. Since nested means adapt with the global data distribution, they can serve as the global context when only local data points are shown (see Figures 13 and 14). From the scatterplot in Figure 10, we do not see any obvious global relationship between the two variables. However, its local entropy map (k = 50, a = 0.5) shows that the two variables have strong relationships in some local regions (See Figure 12). The local entropy map can detect the existence of a strong local relationship but would not be able to show what type of relationship it is. Therefore, the local entropy map is linked to a scatterplot so that users can interactively select a local neighborhood and visually inspect the local multivariate relationship in the scatterplot (See Figures 13 and 14). Figure 13 shows four selected local regions that have significant bivariate relationships as indicated by the local entropy map. The corresponding local bivariate data are shown in the scatterplot on the right side of each map. From the entropy map, we can see that the significance levels of the four selected local regions are different, with the top two (California and Texas) being the most significant and the bottom one being relatively less significant (Maine). This is confirmed by the patterns shown in their corresponding scatterplots: in California, the percentage of votes for Bush has a strong negative linear relationship with the per capita income; in Texas, the relationship is equally strong but exactly the opposite; in Maine (and surrounding counties), it is visually obvious in the entropy map that Figure 11. Two variables from the 2004 US presidential election data for 3111 counties.

International Journal of Geographical Information Science 1385 Figure 12. Local entropy map: percentage voted for Bush and per capita income.

20 International Journal of Geographical Information Science 1385 Figure 12. Local entropy map: percentage voted for Bush and per capita income. the relationship is weaker than those in California and Texas. Moreover, the local entropy map also successfully detects the existence of a different relationship in the North and South Carolina area, where the relationship is positively linear for medium-low income values but becomes negatively linear for high income values. Figure 14 shows three local regions that have no significant bivariate relations as indicated by the local entropy map. From their corresponding scatterplots we can see that there is indeed no obvious dependence between the two variables. In addition to the examples shown in Figures 13 and 14, an extensive (and almost exhaustive) inspection of other local regions and their local bivariate relationships confirms that the local entropy map can not only detect the existence of dependent local relationships but also reliably differentiate relationships of different significance levels. 6. Conclusion and discussions This paper presents a nonparametric local analysis approach, the local entropy map, which does not assume a prior relationship model and can simultaneously discover different forms of local multivariate relationships. It estimates the Rényi entropy and uses its derived p-values as a measure of the goodness of local multivariate relationships. The p-values of local entropies are processed with a set of statistical testing methods to control the multiple-testing problem and to produce a local entropy map that allows interactive examination of local patterns. The presented method requires two algorithmic parameters, k and a, which are the size of the local neighborhood and the edge power for the minimum spanning tree. The k value is configured by compromising two considerations: (1) it should be large enough so that a relationship can be faithfully represented; and (2) it should be as small as possible to maintain a good spatial resolution in examining local patterns. The experiments shown in this paper suggest that k between 35 and 50 is a reasonable configuration for bivariate

1386 D. Guo Figure 13. Selected examples of significant local relationships. analysis (d = 2) and should increase accordingly when more variables are used. For a data set of many variables (e.g.,.

21 1386 D. Guo Figure 13. Selected examples of significant local relationships. analysis (d = 2) and should increase accordingly when more variables are used. For a data set of many variables (e.g.,.10), feature selection (to identify a small subset of variables) or dimension reduction (e.g., principle component analysis [PCA]) may be needed to reduce the number of variables before applying the local entropy map. The a value also has two effects. On the one hand, a smaller a value will reduce the length difference between long and short edges and therefore is more robust to noise and extreme values. On the other hand, a larger a value can help discriminate subtle differences in patterns (but will be sensitive to noisy data). The experiments with the synthetic data set indicate that a = 0.5 or smaller is generally better when with the presence of noise.

University, Tempe, Arizona, USA b Department of Mathematics and Statistics, University of New. Mexico, Albuquerque, New Mexico, USA

University, Tempe, Arizona, USA b Department of Mathematics and Statistics, University of New. Mexico, Albuquerque, New Mexico, USA This article was downloaded by: [University of New Mexico] On: 27 September 2012, At: 22:13 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered