A Hypothesis-Free Multiple Scan Statistic with Variable Window

Size: px

Start display at page:

Download "A Hypothesis-Free Multiple Scan Statistic with Variable Window"

Cordelia Nicholson
6 years ago
Views:

1 Biometrical Journal 50 (2008) 2, DOI: /bimj A Hypothesis-Free Multiple Scan Statistic with Variable Window L. Cucala * Institut de Math matiques de Toulouse, Universit Paul Sabatier, Toulouse Cedex 9, France Received 7 August 2007, revised 14 January 2008, accepted 28 January 2008 Summary In this article we propose a new technique for identifying clusters in temporal point processes. This relies on the comparision between all the m-order spacings and it is totally independent of any alternative hypothesis. A recursive procedure is introduced and allows to identify multiple clusters independently. This new scan statistic seems to be more efficient than the classical scan statistic for detecting and recovering cluster alternatives. These results have applications in epidemiological studies of rare diseases. Key words: Disease alarm; Epidemiology; Scan statistics; Spacings. 1 Introduction Let X 1 ;...; X n be independent and identically distributed random variables that denote the times of occurrence of n events observed in an interval (0, T). Without loss of generality, let T ¼ 1 throughout this paper. The objective of this work is to identify, if they exist, the subintervals in which the number of events is abnormally high, usually named clusters. The absence of cluster corresponds to the null hypothesis H 0 that the events are uniformly distributed on (0, 1). This problem arises typically in epidemiological applications when one wants to assess the outbreak of an unknown disease and to analyse for example whether it is infectious or dependent on a seasonal environmental factor. In these cases, it is crucial to recover as precisely as possible the clusters (or clustering zone). Many procedures are applicable to aggregated data, that is when the observation period is divided into subintervals and only the number of events within each interval is known. See for example the test introduced by Tango (1984) relying on the division of the time interval in equal subintervals. These procedures may also be used when individual data are available but the loss of information may be important and the test result depends on the arbitrarily chosen division. Moreover, as the monitoring tools get improved and the computational capacities increase, the methods for individual data become more and more relevant. The most popular of the methods for individual data is the scan statistic, first introduced by Naus (1965). Originally it was simply the maximum number of events observed within an interval of fixed length d. Later on, Nagarwalla (1996) extended the method so as to compare intervals with different lengths: he introduced the scan statistic with variable window, also called variable scan statistic. The test derived from this statistic is the generalized likelihood-ratio test for a uniform distribution against a piecewise-constant alternative. However, we may wonder whether this test is powerful against different clustering alternatives, sometimes more realistic, such as bell-shaped densities. A non-parametric method, due to Kelsall and Diggle (1995), relies on a kernel intensity estimation of the point process and identifies the clusters as the intervals in which the intensity estimate is higher * cucala@cict.fr, Phone: # 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

2 300 L. Cucala: A Hypothesis-Free Scan Statistic than expected under the null hypothesis. Many other tests, so called global tests, are powerful to reject the uniform hypothesis but do not identify the most significant clusters. A lot of these tests rely on spacings (Pyke, 1965). Recently, Molinari, Bonaldi and Daur s (2001) introduced a cluster detection method based on applying a piecewise-constant regression model to the spacings issued from the times of occurrence of events. This method allows for multiple cluster detection but does not take into account the spacings dependence. Moreover, such as the variable scan statistic, it seems to be more adapted to piecewiseconstant cluster alternatives. In this article, we introduce a multiple cluster detection method based on spacings and not relying on any particular alternative hypothesis. Section 2 is an introduction to the test statistic based on a new concentration index. Then we study the distribution of this statistic under the null hypothesis in Section 3. We present a multiple procedure allowing to detect several clusters in Section 4. The interest of the method is assessed by its application to simulated and real data in Section 5. The paper concludes in a short discussion in Section 6. 2 A Hypothesis-Free Scan Statistic Let X 1 ;...; X n be defined as in the Introduction. We assume that the hypothesis H 0 is satisfied throughout this section. Let 0 ¼ X ð0þ X ð1þ... X ðnþ X ðnþ1þ ¼ 1 denote the order statistics associated to ðx 1 ;...; X n Þ and D i ¼ X ðiþ X ði 1Þ ; i ¼ 1;...; n þ 1 ; denote the associated (one-order) spacings. Since a single event cannot be considered as a cluster, the potential clusters are all the intervals ½X ðiþ ; X ðjþ Š; 1 i < j n. We need a concentration index in order to compare them. The one introduced by Nagarwalla (1996) relies on the generalized likelihood-ratio test between H 0 and a piecewise-constant density hypothesis. We introduce another concentration index which is only based on H 0. Let D i; j ¼ X ðjþ X ðiþ ¼ Pj k¼iþ1 D k ; 1 i < j n ; denote the ð j iþ-order spacings associated to ðx 1 ;...; X n Þ: they are beta distributed with parameters a ¼ j i and b ¼ n þ 1 j þ i (David, 1981). Let also U i; j ¼ B inc ðd i; j ; j i; n þ 1 j þ iþ ; 1 i < j n ; where B inc ð:; j i; n þ 1 j þ iþ stands for the so-called incomplete Beta function, that is the distribution function of D i; j. All the random variables U i; j are uniformly distributed on (0, 1). Moreover, the smaller U i; j is, the greater the concentration of events on the interval ½X ðiþ ; X ðjþ Š is. Thus, a concentration index on the interval ½X ðiþ ; X ðjþ Š is the quantity I HF ði; jþ ¼1=U i; j and the hypothesis-free (HF in short) scan statistic, L HF ¼ sup I HF ði; jþ ; 1i< jn is the maximum concentration observed on the whole domain.

3 Biometrical Journal 50 (2008) The Distribution of L HF under H 0 The second step of a cluster detection method consists in testing whether the maximal concentration observed on the domain is significant. For this, we need to know or at least to estimate the distribution of this maximal concentration under the null hypothesis. Since, even under H 0, the spacings are all correlated, finding the distribution of spacings-based statistics has always been tedious. For example, many authors have studied the distribution of the scan statistic with fixed window and it was finally tabulated by Huntington and Naus (1975). But letting the window vary makes it even more difficult. Nagarwalla (1996) did not study the distribution of his variable scan statistic L scan and his test is based on empirical quantiles obtained by a Monte Carlo procedure. However, in this section, we show how a technique introduced by Huffer and Lin (2001) can be used for deriving the distribution of L HF under H 0. Remark that this technique can also be used for deriving the distribution of the variable scan statistic L scan (see Appendix A for more details). We wish to express Pðt; nþ ¼P 0 ðl HF < tþ ; where P 0 stands for the probability under the null hypothesis. Let b a;n;m denote the quantile of the beta distribution such that B inc ðb a;n;m ; m; n þ 1 mþ ¼a. We get that Pðt; nþ ¼P 0 sup 1i< jn I HF ði; jþ < t ¼ P 0 ðu i; j > 1=t; 81 i < j nþ ¼ P 0 ðd i; j > b 1=t;n;j i ; 81 i < j nþ P ¼ P j 0 D k > b 1=t;n; j i ; 81 i < j n : k¼iþ1 Now, Huffer and Lin (2001) have introduced an algorithm for computing the joint distribution of general linear combinations of uniform spacings (i.e. spacings issued from the uniform distribution). This algorithm consists in a repeated and systematic use of two basic recursions and finally leads to a sum of simpler components which can be easily expressed in closed form. Using this algorithm, we get for example that 8 0 if 1=t < 0 ; >< 1 ð1 b t;3;2 Þ 2 ½1 4b t;3;2 þ 6b t;3;1 Š if 1=t 2½0; 2=3Š ; Pð1=t; 3Þ ¼ 1 ð1 2b t;3;1 Þ 3 if 1=t 2ð2=3; 7=8Š ; >: 1 if 1=t > 7=8 : For larger values of n, the algorithm can only be applied by a computer as it deals with high-dimensional matrices. However, these matrices must be arranged in a specific form (sometimes called lower block triangular form) through the permutation and the deletion of certain rows and columns. Translating this algorithm into a program is not straight-forward and unfortunately the C program provided by the authors only works with simpler examples. Therefore, in the following, we use empirical quantiles obtained by a Monte Carlo procedure to carry out all the tests related to L HF and L scan. 4 A Multiple Procedure In this section we introduce a method to identify multiple clustering. This procedure is adapted either to the HF scan statistic (as presented here) or to the variable scan statistic. The nominal level of the global test for H 0 is set to a. Let q n;a denote the quantile of L HF such that Pðq n;a ; nþ ¼1 a. The density function of the random variables X 1 ;...; X n is denoted f ð:þ. Let

4 302 L. Cucala: A Hypothesis-Free Scan Statistic D1 D2 D5 D6 D7 STEP T D1/T D2/T D5/T D6/T D7/T STEP Figure 1 The multiple procedure. m ¼ min t2ð0;1þ f ðtþ. The clustering zone is C ¼ft 2ð0; 1Þ: f ðtþ > mg and the null hypothesis corresponds to C ¼;. The estimator of the cluster zone, denoted ^C; is initially set to ;. The first step consists in computing L HF from X 1 ;...; X n. Let i* and j* denote the indexes such that L HF ¼ I HF ði*; j*þ. We assume that the interval ½X ði Þ; X ðj ÞŠ is a significant cluster, that is L HF > q n;a. In order to look for another significant cluster, we need to transform the data. Indeed, we do not want to analyse all the distances between events located in the cluster ½X ði Þ; X ðj ÞŠ. This is why we introduce a new set of random variables fx ð2þ k ; k ¼ 1;...; n j* þ i*g defined by X ð2þ k ¼ ( X ðkþ T if 1 k i* ; X ðkþ j i Þ X ðj Þ þx ði Þ T if i* þ 1 k n j* þ i* ; where T* ¼ 1 X ðj Þ þ X ði Þ. This data transformation is illustrated by Figure 1 for n ¼ 6; i* ¼ 2 and j* ¼ 4. The spacings between these new data are all the original spacings located outside the cluster ½X ði Þ; X ðj ÞŠ, scaled so that their sum equals 1. Actually, if C ½X ði Þ; X ðj ÞŠ, the random variables fx ð2þ k ; k ¼ 1;...; n j* þ i*g are distributed as order statistics issued from a (0, 1)-uniform sample. The proof is given in Appendix B. Therefore, we look for any cluster in this new data set using again the HF scan statistic L HF. This procedure is repeated as long as L HF remains significant. Remark that this procedure could identify a first cluster C 1 and then a second one C 2 containing the first one. This just means that the concentration is more significant in C 1 but still high in C 2 \ C 1. This suggests that the density function f ð:þ may be bell-shaped. Recently, Zhang, Kulldorff and Assuncão (2007) introduced a multiple procedure adapted to spatial scan statistics (Kulldorff, 1997). It just consists in removing the most significant cluster and going on with the procedure. Remark that doing the same in the temporal setting is not appropriate since the data to analyse in step two would be fx ð2þ 1 ;...; Xð2Þ i 1 ; Xð2Þ iþ1 ;...; Xð2Þ n jþig, which are not distributed as order statistics issued from a uniform sample, even ½X ði Þ; X ðj ÞŠ coincides exactly with C. 5 Data Analysis In this section we compare the results obtained by the test based on the HF scan statistic to the test based on the variable scan statistic. The multiple procedure introduced in the previous section is used both with the HF scan statistic L HF (it is denoted ML HF ) and the variable scan statistic L scan (it is denoted ML scan ). 5.1 Simulated data We apply the different cluster detection methods to data sets containing n ¼ 100 events. We compare the results obtained by the HF and variable scan statistics, L HF and L scan, and the multiple procedures

5 Biometrical Journal 50 (2008) derived from these statistics, ML HF and ML scan. Remark that the variable scan statistic depends on the parameter n 0, the minimal number of events in a cluster, which must be set before applying the procedure. Here, we set this parameter to n 0 ¼ 5, which is the arbitrary value chosen by Nagarwalla (1996) One-cluster detection The events are simulated according to different one-cluster alternatives. A flat cluster is obtained using the density function 1 f 1 ðxþ ¼ 0:8 þ 0:2r 1 r ð0;0:4š[½0:6;1þðxþþ 0:8 þ 0:2r 1 ð0:4;0:6þðxþ : A bell-shaped cluster is obtained using the density function f 2 ðxþ ¼ 15 2r þ 13 1 ð0;0:4š[½0:6;1þðxþ þ 15 2r þ 13 f1 þðr 1Þ½1 100ðx 0:5Þ2 Šg 1 ð0:4;0:6þ ðxþ : In both cases, the parameter r is the ratio between the maximum density and the minimum density on (0, 1). These two functions are plotted in Figure 2 for different values of r. We compare the power of the global tests associated to each method. Moreover, we check whether the significant clusters exhibited by the different methods match the true cluster, that is the interval in which the density is higher, C ¼ð0:4; 0:6Þ in our examples. Let nðþ denote the Lebesgue measure and let A ¼ð0; 1Þn A; 8A ð0; 1Þ. A true positive index is given by TP ¼ nðc \ ^CÞ. A true negative index is given by TN ¼ nð ^C \ CÞ. Finally, a correspondance index between the true and the estimated clustering zone is given by I ¼ TP þ TN. For each alternative, 1000 data sets are simulated. The nominal level of the global test is set to 5%. All the tests conducted in this section rely on a Monte Carlo procedure with a number of simulations equal to Tables 1 and 2 give the results obtained with two different cluster shapes. The bold values are the best results obtained among all the methods. By definition, the multiple procedures have the same power as the scan statistics they are derived from. Their true positive index is larger and their true negative index smaller. We first remark that the HF scan statistic is always more powerful than the variable scan statistic, even when the cluster alternative is piecewise-constant. The multiple procedure ML HF always gets the best true positive index and the variable scan statistic L scan the best true negative index. It seems that the HF concentration index gives more weight to larger intervals containing a great number of events than the concentration index derived by Nagarwalla (1996). Thus, the HF scan statistic usually selects larger clusters than the f r=2 r=3 r=5 f r=2 r=3 r= x x Figure 2 One-cluster alternative functions.

6 304 L. Cucala: A Hypothesis-Free Scan Statistic r Table 1 Tests applied to a simulated bell cluster. Empirical results of the following: L HF ML HF L scan ML scan 2 Power TP TN I Power TP TN I Power TP TN I variable scan statistic. Since the correspondance index is always larger for the HF methods, these techniques seem to be more adapted to cluster detection than the classical scan statistic Multiple cluster detection Two bell-shaped clusters are obtained using the density function f 3 ðxþ ¼ 15 4r þ 11 1 ð0;0:2š[½0:4;0:6š[½0:8;1þðxþ þ 15 4r þ 11 f1 þðr 1Þ½1 100ðx 0:3Þ2 Šg 1 ð0:2;0:4þ ðxþ þ 15 4r þ 11 f1 þðr 1Þ½1 100ðx 0:7Þ2 Šg 1 ð0:6;0:8þ ðxþ ; which is plotted in Figure 3 for different values of r. r Table 2 Tests applied to a simulated flat cluster. Empirical results of the following: L HF ML HF L scan ML scan 2 Power TP TN I Power TP TN I Power 1 1 TP TN I

7 Biometrical Journal 50 (2008) r=2 r=3 r=5 f Figure 3 x Two-cluster alternative functions. Table 3 Tests applied to two simulated bell clusters. r Empirical results of the following: L HF ML HF L scan ML scan 2 Power TP TN I Power TP TN I Power TP TN I Then, the simulation study is exactly the same. Table 3 gives the results. These results confirm the efficiency of the HF methods compared to the classical ones. They also point out the benefit of the multiple procedure when the clustering zone is the union of disjoint sets. 5.2 Real data Knox data set The same methods are now applied to a classical data set published by Knox (1959) and describing birth defects oesophageal atresia and tracheo-oesophageal fistula observed in an hospital in Birmingham, UK, during 2191 days, between 1950 and The observation days corresponding to the 35 events are plotted in Figure 4. There is a great concentration of events around the middle of the observation period and another one at the end. But comparing these two possible clusters and, moreover, discussing their significance is not possible by visual inspection. Before applying the cluster detection methods, the observation days are scaled so that they lie in [0, 1]. The variable scan statistic L scan is computed for two different values of the parameter n 0, the minimal number of events in a cluster. Table 4 gives the most significant clusters ^C and the associated p-value for each method.

8 306 L. Cucala: A Hypothesis-Free Scan Statistic Figure 4 Knox data set. We set the nominal level to a ¼ The HF scan method identifies a significant cluster from event 8 to event 22: it corresponds to the high concentration in the middle of the observation period. This interval also maximises the concentration index used by Nagarwalla (1996). Though, the significance values for L scan associated to this cluster are very dependent on the parameter n 0. Choosing n 0 ¼ 5, the arbitrary value recommended by Nagarwalla (1996), leads to significative clustering. Nevertheless, a logical choice for n 0 is 2 since two events may constitute a cluster if they are close enough. This is the implicit choice made when we use ML HF. But making this choice for ML scan leads to rejecting the significance of the cluster. Therefore, the results obtained with the variable scan method are quite difficult to interpret, whereas the HF scan method clearly indicates a significative clustering. Table 4 Tests applied to Knox data set. Method ^C p-value ML HF [1233, 1491] [1233, 1491] [ [2049, 2174] ML scan ðn 0 ¼ 2Þ [1233, 1491] ML scan ðn 0 ¼ 5Þ [1233, 1491] [1233, 1491] [ [2049, 2174]

9 Biometrical Journal 50 (2008) Figure 5 Cancer cases in Lancashire. A second cluster is detected at the end of the period, between event 29 and event 35 but its significance value is too high, whatever the method. However, it is important to remark that this could be the beginning of a more significant cluster not detected because the study stops at day Lancashire data set We now apply the methods to a data set describing the location of cancers of the larynx and of the lung recorded between 1973 and 1984 in the Chorley-Ribble area, Lancashire, UK (Diggle et al., 1990). In Figure 5, the lung cancer cases are represented by smaller dots and the larynx cancer cases by larger dots. The triangle point down represents a disused incinerator around which the larynx cancer cases are suspected to aggregate. Since the lung cancer cases seem to be less dependent on any environmental factor, they are used as control data. Let X 1 ;...; X n denote the euclidian distances from the n larynx cancer cases to the incinerator and Y 1 ;...; Y m denote the euclidian distances from the m lung cancer cases to the incinerator. We would like to test whether the X i s are distributed as the Y i s or whether they cluster around 0. First, we need to transform the data into T i ¼ ÐX i 1 ^f h ðtþ dt ; 1 i n ; Table 5 Tests applied to Lancashire data set. Method ^C p-value ML HF [1, 4] ML scan ðn 0 ¼ 2Þ [1, 3] ML scan ðn 0 ¼ 5Þ [1, 5]

10 308 L. Cucala: A Hypothesis-Free Scan Statistic where ^f h ð:þ is a kernel density estimate issued from the Y i s, and h is a bandwidth given by a crossvalidation method (Scott, 1992). Then the different scan methods are applied to T 1 ;...; T n and Table 5 gives the most significant cluster and the associated p-value for each one. We set the nominal level to a ¼ 0:01. The HF scan method identifies a very significant cluster containing the 4 larynx cancer cases closest to the incinerator. On the other hand, the variable scan method gives very contradictory results, depending on the value of the parameter n 0. When n 0 ¼ 2, a significant cluster containing the closest 3 larynx cancer cases is exhibited. But when n 0 ¼ 5, it does not conclude to significant clustering. Indeed, the fifth closest larynx cancer case is further from the incinerator than many lung cancer cases. These results confirm that the variable scan statistic is very efficient when one knows a priori the number of events in the clusters to detect. When it is not the case, which happens most of the time, the HF scan method should be preferred. Remark that this data set has previously been studied by Kelsall and Diggle (1995): they compared the kernel density estimates based on the X i s and the Y i s but they obtained a non-significant result. We believe that this comes from using a statistic reflecting the global discrepancy between H 0 and the data set in the whole observation domain. On the other side, all the scan statistics rely on a local discrepancy measure in the possible clusters and are thus more likely to detect small clusters. 6 Discussion The presented method allows one to detect several clusters in temporal data without assuming anything about the clustering structure, and without setting up any parameter. The widely-used variable scan statistic (Nagarwalla, 1996) also allows to compare intervals having different lengths but it appears that the parameter n 0, the minimal number of events in a cluster, plays a major role. The HF scan statistic is much less influenced by this minimal number of events, so that it can be considered as parameter-free. Moreover, its empirical results are better. Usually, when people apply the variable scan statistic L scan, they often classify all intervals ½X ðiþ ; X ðjþ Š according to the concentration index introduced by Nagarwalla (1996). Then they consider that the significant clusters are all the intervals where the concentration is higher than the a-quantile of L scan. This leads to a conservative procedure. The multiple procedure we introduce here allows to detect multiple clusters independently and gives a significance value to each cluster. Moreover, this procedure does not restrict the performance of the single test when only one cluster is present. It is sometimes necessary to take into account the inhomogeneity of the observed population, as it is usually done in the spatial setting (Kulldorff, 1997). The adaptation of the HF scan statistic to inhomogeneity is obtained through the transformation introduced in the Lancashire data set analysis, which is similar to the spacings transformation introduced by Molinari, Bonaldi and Daurès (2001). In this article, we only talked about scan statistics used for retrospective disease surveillance, that is statistics regarding all intervals as potential clusters. Recently, Kulldorff (2001) introduced a prospective scan statistic only considering the time intervals that are still alive, that is including the end of the observation period. This statistic is based on the variable scan statistic but the adaptation to the HF scan statistic is straight-forward and could provide an efficient tool for regular time periodic disease surveillance. Even if the theory underlying the HF scan statistic relies on individual data, the concentration index I HF may also quantify the concentration on a subinterval even if we do not know the locations of events, but only how many there are. Thus, it could also be used for analysing aggregated data. Finally, a possible extension of this method to spatial or spatio-temporal clustering problems is to transform the data as proposed by DematteÒ, Molinari and Daur s (2007) to construct a distance between events. This is the subject of a future work.

11 Biometrical Journal 50 (2008) Appendix Appendix A Let c a;n;m denote the quantity such that f scan ðc a;n;m ; n; mþ ¼a, where f scan ðt; n; mþ ¼ m þ 1 mþ1 n m 1 m m 1 1 m þ 1 > t : nt nð1 tþ n The concentration index on the interval ½X ðiþ ; X ðjþ Š related to the variable scan statistic is I scan ði; jþ ¼f scan ðd i; j ; n; j iþ : Thus, we get that P 0 ðl scan < tþ ¼P 0 sup 1i< jn I scan ði; jþ < t ¼ P 0 ð f scan ðd i; j ; n; j iþ < t; 81 i < j nþ ¼ P 0 ðd i; j > c t;n;j i ; 81 i < j nþ P ¼ P j 0 D k > c t;n; j i ; 81 i < j n : k¼iþ1 Appendix B Let N C denote the random number of events located in the clustering zone C. Given N C ¼ n C, there are ðn n C Þ events, denoted by Y 1... Y n nc, uniformly distributed on C ¼ð0; 1ÞnC. Let Z k ¼ nðð0; Y k Þ\ CÞ; k ¼ 1;...; n n C. The random variables Z 1 ;...; Z n nc are order statistics associated to aðn n C Þ uniform sample on ð0; nð ^CÞÞ: Since C ½X ði Þ; X ðj ÞŠ, we get that X ði Þ ¼ Z i and X ðj Þ ¼ Z j n C. Let ~D k ¼ Z k Z k1 ; k ¼ 1;...; n n C þ 1 denote the spacings associated to Z 1 ;...; Z n nc, where Z 0 ¼ 0 and Z n nc þ1 ¼ nð CÞ. Since these uniform spacings are interchangeable random variables, the random variables ( P k l¼1 ~X k ¼ D l if 1 k i* ; P i l¼1 D l þ P kþj i l¼j n C þ1 D k if i* þ 1 k n n C j* þ i* ; are order statistics associated to a ðn n C j* þ i*þ uniform sample on ð0; T*Þ. Scaling these random variables by T* leads to the expected result. Acknowledgements This research project is within the scope of my Ph.D. thesis work under the direction of Christine Thomas-Agnan. I gratefully thank her for her constant help and support. All programs are written in R language and are available on demand. Conflict of Interests Statement The author have declared no conflict of interest.

12 310 L. Cucala: A Hypothesis-Free Scan Statistic References David, H. A. (1981). Order statistics. Second edition. New York: Wiley. DematteÒ, C., Molinari, N., and Daur s, J. P. (2007). Arbitrarily shaped multiple spatial cluster detection for case event data. Computational Statistics and Data Analysis 51, Diggle, P. J., Gatrell, A. C., and Lovett, A. A. (1990). Modelling the prevalence of cancer of the larynx in part of Lancashire: a new methodology for spatial epidemiology. In: R. W. Thomas (ed.), Spatial epidemiology, 21. London: Pion. Huffer, F. W. and Lin, C. T. (2001). Computing the joint distribution of general linear combinations of spacings or exponential variates. Statistica Sinica 11, Huntington, R. J. and Naus, J. I. (1975). A simpler expression for k-th nearest neighbor coincidence probabilities. Annals of Probability 5, Kelsall, J. E. and Diggle, P. J. (1995). Kernel estimation of relative risk. Bernoulli 1, Knox, G. (1959). Secular pattern of congenital oesophageal atresia. British Journal of Preventive Social Medecine 13, Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics. Theory and Methods 6, Kulldorff, M. (2001). Prospective time-periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society, Series A 164, Molinari, N., Bonaldi, C., and Daur s, J. P. (2001). Multiple temporal cluster detection. Biometrics 57, Nagarwalla, N. (1996). A scan statistic with a variable window. Statistics in Medicine 15, Naus, J. I. (1965). The distribution of the size of the maximum cluster of points on a line. Journal of the American Statistical Association 61, Pyke, R. (1965). Spacings (with discussion). Journal of the Royal Statistical Society, Series B 27, Scott, D. W. (1992). Multivariate density estimation. Theory, practice, and visualization. Wiley, New York. Tango, T. (1984). The detection of disease clustering in time. Biometrics 40, Zhang, Z., Kulldorff, M., and Assuncão, R. (2007). Spatial scan statistics adjusted for multiple clusters. Preprint.

USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES

USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES Mariana Nagy "Aurel Vlaicu" University of Arad Romania Department of Mathematics and Computer Science