A Separability Index for Distance-based Clustering and Classification Algorithms

Size: px

Start display at page:

Download "A Separability Index for Distance-based Clustering and Classification Algorithms"

Mervin Clark
6 years ago
Views:

1 Journa of Machine Learning Research 1 (2000) 1-48 Submitted Apri 30, 2010; Pubished 10/00 A Separabiity Index for Distance-based Custering and Cassification Agorithms Arka P. Ghosh Ranjan Maitra Anna D. Peterson Department of Statistics Iowa State University Ames, IA apghosh@iastate.edu maitra@iastate.edu ericksad@iastate.edu Editor: Abstract We propose a separabiity index that captures the degree of difficuty in a hard custering probem where each observation is generated from one of K different p-variate Gaussian distributions (K 2 and p 1). This index is motivated by the intuition that, in genera, an observation from a Gaussian distribution shoud be coser (in terms of a suitabe metric) to its own mean than to the mean of a different Gaussian distribution. The purpose of this index is to capture the difficuty of custering a dataset based on the vaue of the index for that particuar dataset. We expore severa properties of the index and make adjustments to it so that the vaues of the index cosey track the Adjusted Rand index. Furthermore, this index eads to the deveopment of a data-simuation agorithm that generates datasets having a specified vaue of the index. This agorithm is particuary usefu for systematicay generating datasets with varying degrees of custering difficuty (captured by their index vaues) which can be used as a patform for comparing different custering agorithms. We have aso discussed an estimated version of the index cacuated for some we-known datasets containing cass information for each observation. For these datasets having mutipe casses, we have aso iustrated the use of a pairwise (between custers) version of the index to summarize distinctiveness of each cass reative to the other. Keywords: Separation index, partitiona custering agorithm, custering compexity, mutivariate visuaization, exact-c-separation, Iris dataset 1. Introduction There is a arge body of iterature (Hartigan, 1985; Ramey, 1985; Kaufman and Rousseuw, 1990; Everitt et a., 2001; McLachan and Basford, 1988; Kaufman and Rousseuw, 1990; Everitt et a., 2001; Fraey and Raftery, 2002; Kettenring, 2006; Baudry et a., 2008; Hennig, 2010) dedicated to custering, i.e. the difficut probem of categorizing data into groups of simiar observations, but no method uniformy outperforms the others. Many agorithms perform we in some settings but not so in other cases. More importanty, in many cases, the settings where agorithms work we or poory is not quite understood. Thus there is need for a systematic study of performance of any custering agorithm, and aso to evauate effectiveness of new methods using the same objective criterion. c 2000 Ghosh, Maitra and Peterson.

2 Ghosh, Maitra and Peterson Many researchers therefore evauate the performance of a proposed custering technique by comparing its performance on cassification datasets. A partia isting of commony used datasets in this regard incudes textures (Brodatz, 1966), wine (et a, 1991), Iris (Anderson, 1935), crabs (Campbe, 1974), image (Newman et a., 1998), E. coi (Nakai and Kinehasa, 1991) as we as Ruspini (1970) s dataset. The approach of evauating performance on seect cassification datasets, whie usefu in assessing performance, does not provide a systematic and comprehensive understanding of the performance of custering agorithms in a wide variety of scenarios. Thus, there is need for the abiity to simuate datasets of different custering difficuty and caibrating the performance of a custering agorithm under different scenarios. In order to do so, we need to define an appropriate measure to quantify custering compexity of a dataset. Such an index shoud ideay be amenabe to caibrate custering agorithms for high as we as ow dimensions as we as sma and arge datasets. 1.1 Background and Reated Work There has been some attempt at providing objective methods for performance evauations in terms of separabiity indices. Miigan (1985) proposed an agorithm that generates we-separated custers from norma distributions within the constraint of a bounded range and aso has provisions for incuding scatter (Maitra and Ramer, 2009), non-informative dimensions, outiers and adding normay distributed errors to each observation. The agorithm is much used (Miigan et a., 1983; Miigan and Cooper, 1986, 1988; Baakrishnan et a., 1994; Brusco and Cradit, 2001) because it is easiy impemented with properties simiar to rea datasets, but McIntyre and Bashfied (1980) observed that both increasing the variance and adding outiers increases the variance of the custers which increases the degree of overap in unpredictabe and differing ways. Thus, this method is incapabe of accuratey indexing custering compexities. As a resut, Miigan s agorithm does not provide a method to generate datasets of a guaranteed custering compexity. Steiney and Henson (2005) proposed the OCLUS agorithm for generating custers based on known (asymptotic) overap by having the user provide a design matrix and an order matrix the former indicates the custers that are desired to be overapping with each other and the ater dictates the ordering of observations. The design matrix is set up such that a custer can overap with at most two other custers. The order matrix indicates the ordering of custers in each dimension. Athough, the idea of using overap as a way to generate custers is appeaing, the agorithm has some restrictions such as the forced independence between dimensions, and the constraints on the design matrix. Qiu and Joe (2006a) proposed a separation index between pairs of custers in an univariate framework which was extended to higher-dimensions by optimay projecting a pairs of the mutivariate norma custers onto one-dimensiona space and then cacuating the index based on the resuting projections. Their separation index is then defined as the maximum of a of the ( n 2) pairwise index vaues. The index is the basis of their custer generation agorithm (Qiu and Joe, 2006b) and their corresponding R package custergeneration (formery, GenCus). However, their summary index is based ony on the two nearest groups so that variation between other groups may not impact the index. Additionay, the attempt to characterize separation between severa muti-dimensiona custers by means of the best univariate projection ceary oses substantia information, and thus resuting statements on 2

3 A Separabiity Index for Distance-based Custering and Cassification Agorithms custer overap are very partia and can be miseading. The approach invoving projections is aso computationay intensive and practicay unusabe for very high dimensions. Likas and Verbeek (2003) and Verbeek et a. (2003a,b) demonstrated performance of their custering agorithms using simuation datasets based on the concept (or a variant) of the c-separation between custers proposed by Dasgupta (1999). Deveoped in the context of earning Gaussian mixtures, Dasgupta (1999) defined two Gaussian distributions N(µ 1, Σ 1 ) and N(µ 2, Σ 2 ) in IR n to be c-separated if µ 1 µ 2 c n max(λ max (Σ 1 ), λ max (Σ 2 )) where λ max (Σ i ) is the argest eigenvaue of Σ i. Maitra (2009) formaized this index by requiring equaity for at east one pair (i, j) of custers (caed exact-c-separation) and used it for generating datasets to caibrate some partitiona initiaization agorithms. Maitra (2009) aso pointed out some shortcomings of using exact-c-separation, which is due to the inherent nature of the index which ignores the reative orientations of the custer dispersions. Other factors (such as mixing proportions) beyond separation aso impact the degree of difficuty of custering; the atter is thus ony partiay captured by exact-c-separation. More recenty, Maitra and Menykov (2010) proposed a method for generating Gaussian mixture distributions according to some summarized overap measure between any two component pairs. The overap between any two custers was defined as the unweighted sum of the probabiities of their individua miscassification rates. They aso provided open-source C software (C-MixSim) and a R package (MixSim) for generating custers corresponding to different desired overap characteristics. In contrast to many of the existing indices and simuation agorithms, their methodoogy does not impose any restriction on the parameters of the distributions but was derived entirey in the context of mixture modes and mode-based custering agorithms. Thus, their methods were specificay geared toward soft custering and mode-based custering scenarios. In this paper, we compement Maitra and Menykov (2010) s scenario and derive a separation index (Section 2) for distance-based partitioning and hard custering agorithms. Our index is motivated by the intuition that for any two we-separated groups, the majority of observations shoud be coser to their own center than to the other center. In other words, we expect that the distances of most of the observations to its correct center shoud be smaer than the distances to other group centers. We use Gaussian-distributed custers and Eucidean or Mahaanobis distance to simpify our theoretica cacuations. We investigate and fine-tune this index in the context of homogeneous sphericay symmetric custers in Section 3.1 and aso deveop a summary for mutipe groups. The methodoogy is then studied for the genera case in Section 3.2. In this section, we aso introduce the parae empirica distribution pot or the PEDpot to dispay grouped mutivariate data and use it to demonstrate the performance of our methodoogy for mutipe dimensions. Our index can be used to quantify cass distinctions in grouped data, and we iustrate this appication in Section 4 in the context of severa cassification datasets. The main paper concudes with some discussion. The paper aso has an appendix detaiing the agorithm that uses the deveoped index to generate datasets of desired custering compexity and another appendix of two supporting tabes for the materia in Sections 3.1 and

4 Ghosh, Maitra and Peterson 2. Methodoogica Deveopment Consider a dataset S = {X 1, X 2,..., X n }. The objective of hard custering or fixedpartitioning agorithms is to group the observations into hard categories C 1, C 2,..., C K such that some objective function measuring the quaity of fit is minimized. For instance, if the objective function is specified in terms of minimizing the tota distance of each observation from some characteristic of the assigned partition, then we define an objective function as O K = n i=1 k=1 K I(X i C k )D k (X i ) (1) where D k (X i ) is the distance of X i from the k-th partition. We consider a distance measure used for fixed-partitioning agorithms that is of the form D k (X i ) = (X i µ k ) k (X i µ k ) (2) where k is a non-negative definite matrix of dimension p p. We consider two specia cases here: in the first case, k = I p, the identity matrix of order p p where D k (X i ) reduces to the squared Eucidean distance between X i and µ k and soving for (1) invoves finding partitions C 1, C 2,..., C K in S such that the sum of the squared Eucidean distance of each observation to the center of its assigned partition is minimized. The popuar k-means agorithm (Hartigan, 1975; Hartigan and Wong, 1979) provides ocay optima soutions to this minimization probem. In the second scenario, k = Σ 1 k, where Σ k is the dispersion matrix of the kth partition. Then D k (X i ) is the Mahaanobis distance between X i and µ k. There is actuay a convenient mode-based interpretation for the setup above, which we adopt in this paper. Specificay, et X 1, X 2,..., X n be independent p-variate observations with X i N p (µ ζi, Σ ζi ), where ζ i {1, 2,..., K} for i = 1, 2,..., n. Here we assume that µ k s are a distinct and that n k is the number of observations in custer k. Then the density for the X i s is given by f(x) = K I(X C k ) og φ(x; µ k, Σ k ), (3) k=1 where C k is the sub-popuation indexed by the N p (µ k, Σ k ) density and I(X C k ) is an indicator function specifying whether observation X beongs to the kth group with mean µ k, covariance matrix Σ k and p-dimensiona mutivariate norma density φ(x; µ k, Σ k ) Σ k p 2 exp( 1 2 (X µ k) Σ 1 k (X µ k)), k = 1,..., K. When Σ k = I p, maximizing the ogikeihood with respect to the parameters ζ i, i = 1, 2,..., n and µ k s is equivaent to soving for (1) with D k (X i ) as Eucidean distance. We use the above formaism to deveop, in theoretica terms, a separation index that quantifies separation between any two custers and reates it to the difficuty in recovering the true partition C 1, C 2,..., C K of a dataset. We begin by defining a preiminary index. 2.1 A preiminary separation index between two custers Consider the case with K = 2 groups, abeed C j and C for (j {1, 2}). Define Y j, (X) = D j (X) D (X), where X C, (4) 4

5 A Separabiity Index for Distance-based Custering and Cassification Agorithms and Y,j (X), simiary. Using the modeing formuation in (3) above, Y j, (X) is a random variabe which represents the difference in squared distances of X C to the center of C j and to the center of C. For distance-based cassification methods, Pr[Y j, (X) < 0] is the probabiity that an observation is cassified into C j when in fact the observation is from C. Intuitivey, one expects that since X ζi : ζ i =, i = 1,..., n, beong to C, then most of these n observations wi be coser to the mean of C, compared to the mean of C j. Based on this observation, we define the index in terms of the probabiity that α fraction of the observations are coser to the incorrect custer center. In other words, we find the probabiity that an order statistic (say, n α -th, for α (0, 1)) of {Y j, ζ i (X) : ζ i =, i = 1,..., n} is ess than 0. (Here x is the greatest integer smaer than x). We specify this probabiity as p j,. To simpify notation we wi assume that n α is an integer. Therefore, p j, = n i=n α ( ) n Pr[Y j, (X) < 0] i Pr[Y j, (X) > 0] n i, (5) i and p,j is defined simiary. Since both these probabiities can be extremey sma we take the average to an inverse power of a function of n j and n. We define the pairwise index as ( p I j,,j + p j, )(1/n j,) = 1, (6) 2 where n j, = n,j = (n 2 + n2 j )/2. Note that our preiminary index incorporates cass sizes into it, which is desirabe since miscassification rates are affected by the reative sizes of groups. Aso, when n j = n = n then n j, = n. (At this stage, we ca this index preiminary, because we wi modify it with suitabe adjustments obtained through simuations in Section 3. We use the form of the index in (6) to provide an interpretabe measure of our index. The index I j, takes vaues in [0, 1], with a vaue cose to unity indicating a we-separated dataset and vaues coser to zero indicating that the two groups have substantia overap. The cacuation of I needs knowedge of the distribution of the random variabe Y j, (X) (and aso of Y,j (X)), which depends on the choice of the distance function in (2). The foowing resut summarizes the distribution of Y j, (X) for different cases. (Note that here and for the remainder of this paper, the notation X d = Y means that X has the same distribution as Y.) Theorem 1 For any, j {1,..., K}, et Y j, (X) be as defined in (4), for some distance metric D j and D as defined in (2). Further assume X N p (µ k, Σ k ). Then for specific choices of, j the foowing hod: (a) For the Eucidean distance ( j = = I) Y j, (X) N p ((µ µ j ) (µ µ j ), 4(µ µ j ) Σ (µ µ j )). (7) (b) For the Mahaanobis distance ( j = Σ 1 j, = Σ 1 ), when both custers have identica covariance structures, i.e. Σ = Σ j Σ, then Y j, (X) N p ((µ µ j ) Σ 1 (µ µ j ), 4(µ µ j ) Σ 1 (µ µ j )). (8) 5

6 Ghosh, Maitra and Peterson (c) For the Mahaanobis distance ( j = Σ 1 j, = Σ 1 ), when the two custers do not have identica covariance structures (i.e. Σ Σ j ), et λ 1, λ 2,..., λ p be the eigenvaues of Σ j Σ 1 2 Σ 1 j Σ 1 2 and the corresponding eigenvectors be given by γ 1, γ 2,...γ p. Then Y j, (X) d = p i=1:λ 1 (λ i 1) 2 U i λ i δ 2 i λ i 1 p + λ i δ i (2W i + δ i ) (9) i=1:λ=1 where U i s are independent non-centra χ2 random variabes with one degree of freedom and non-centraity parameter given by λ 2 δ2 /(λ 1) 2 with δ i = γ i 1 Σ 2 (µ µ j ) for i {1, 2,..., p} {i : λ 1}, independent of W i s, which are independent N p (0, 1) random variabes, for i {1, 2,..., p} {i : λ i = 1}. Proof For any p-variate vector X, D j (X) D (X) = (X µ j ) j (X µ j ) (X µ ) (X µ ) = X ( j )X + 2X ( µ j µ j ) + (µ j j µ j µ µ ).(10) Therefore, when j = = I, Y j, (X)=D j (X) D (X)=2X (µ µ j )+µ j µ j µ µ, where X N p (µ, Σ ). Simpe agebra competes the proof of part (a). Simiar cacuation using (10), when j = = Σ 1, shows that Y j, (X)=2X (µ µ j ) + µ j Σ 1 µ j µ Σ 1 µ, where X N p (µ, Σ ). This competes the proof of part (b) using simiar arguments as above. For part (c), et ξ N p (0, I). Since X d = Σ 1 2 ξ + µ, using (10), we have Y j, (X) = X (Σ 1 j Σ 1 d = (Σ 1 2 ξ + µ ) (Σ 1 j )X + 2X (Σ 1 Σ 1 µ Σ 1 µ j ) + µ jσ 1 µ j µ Σ 1 j )(Σ 1 2 ξ + µ ) + 2(Σ 1 2 ξ + µ ) (Σ 1 µ Σ 1 j µ j ) +µ jσ 1 j µ j µ Σ 1 µ = ξ (Σ j I)ξ + 2ξ (Σ 1 2 Σ 1 j )(µ µ j ) + (µ µ j ) (Σ 1 j )(µ µ j ) (11) j µ where Σ j = Σ 1 2 Σ 1 j Σ 1 2. Let the spectra decomposition of Σ j be given by Σ j = Γ j Λ j Γ j, where Λ j is a diagona matrix containing the eigenvaues λ 1, λ 2,... λ p of Σ j, and Γ j is an orthogona matrix containing the eigenvectors γ 1, γ 2,...γ p of Σ j. Since 6

7 A Separabiity Index for Distance-based Custering and Cassification Agorithms Z Γ j ξ N p (0, I) as we, we get from (11) that Y j, (X) d = ξ (Γ j Λ j Γ j Γ j Γ j )ξ + 2ξ (Γ j Λ j Γ j Σ )(µ µ j ) +(µ µ j ) (Σ Γ j Λ j Γ j Σ 1 2 )(µ µ j ) = (Γ j ξ) (Λ j I)(Γ j ξ) + 2(Γ j ξ) (Λ j Γ j Σ 1 2 )(µ µ j ) d = +(µ µ j ) (Σ 1 2 Γ j Λ j Γ 1 j Σ 2 )(µ µ j ) p (λ i 1)Z 2 i + 2λ i δ i Z i + λ i δi 2, (12) i=1 where δ i, i = 1,..., p are as in the statement of the theorem. Depending on the vaues of λ, one can simpify the expression in (12). If λ i > 1: (λ i 1)Z 2 i +2λ iδ i Z i +λ i δ 2 i = ( λ 1Z i + λ i δ i / λ i 1) 2 λ i δ 2 i /(λ i 1), which is distributed as a (λ i 1)χ 2,λ 2 δ2 /(λ i 1) 2 -random variabe. If λ i < 1: (λ i 1)Z i 2 + 2λ i δ i Z i + λ i δ 2 i = ( 1 λz i + λ i δ i / λ i 1) 2 λ i δ 2 i /(λ i 1), which is distributed as a (1 λ i )χ 2,λ 2 δ2 /(λ i 1) 2 -random variabe. In the case of λ i = 1, (λ i 1)Zi 2 + 2λ iδ i Z i + λ i δi 2 = 2λ i δ i Z i + λ i δi 2. Hence, incorporating these expressions and rearranging terms in (12), we get (9). This competes the proof of the theorem. Remark 2 Theorem 1 enabes us to compute Pr[Y j, (X) < 0] for different choices of the distance measure described in (2). For the first two cases (corresponding to (a) and (b) of the theorem), this invoves cacuating a Gaussian probabiity. For the third case (part (c) of the theorem) we see that this invoves cacuating the probabiity of a inear combination of independent non-centra χ 2 and Gaussian variabe, for which we use Agorithm AS 155 of Davies (1980). Once Pr[Y j, (X) < 0] and Pr[Y,j (X) < 0] are computed, the index can be cacuated from (6) using p j, and p,j computed from (5) Properties of our Preiminary Index In this section, we highight some properties of our preiminary index with regard to achievabe vaues and scaing that we wi use in achieving our second objective of simuating random configurations satisfying a desired vaue of our index. We start with Theorem 3 Fix c > 0 α (0, 1) and consider two custers C and C j. Each C i has n i p-variate Gaussian observations with mean cµ i and covariance Σ i, i {j, }. Then for specific choices of the distance metric, the foowing properties hod for I,j : (a) For the Eucidean distance ( j = = I), define θ i Φ( n i (1 2α)), i {j, }. Then for arge n j, n, im c 0 Ij, 1 ( (θj + θ ) 2 ) 1/nj, I j, 0 and im c I j, = 1. (13) 7

8 Ghosh, Maitra and Peterson (b) For the specia case of Mahaanobis distance with identica covariance structure (i.e., j = = Σ 1 ), define θ j, θ as in part (a). Then for arge n j, n, ( ) (θj + θ ) 1/nj, im c 0 Ij, 1 I j, 0 and im I j, = 1. (14) 2 c (c) For the genera Mahaanobis distance (i.e., j = Σ 1 j, = Σ 1 with Σ Σ j ), et λ 1, λ 2,..., λ p be the eigenvaues of Σ j Σ 1 2 Σ 1 j Σ 1 2 and Z,...Z be independent ( ni (κ α) N(0, 1) random variabes. Define for i {j, k}, θ i Φ ), where κ = κ(1 κ) Pr [ p i=1 (λ i 1)Z 2 i < 0 ]. Then for arge n j, n ( (θj + θ ) im c 0 Ij, 1 2 ) 1/nj, I j, 0 and im c I j, = 1. (15) Proof First note that for arge n and the norma approximation to binomia probabiities, n ( ( ( p j, n = )Pr[Y j, (X) < 0] i Pr[Y j, (X) > 0] n i n Pr[Y j, (X) < 0] α ) ) Φ. i i=n α Pr[Y j, (X) < 0] Pr[Y j, (X) > 0] (16) For part (a), note that from Theorem 1(a), Y j, (X) N(c 2 (µ µ j ) (µ µ j ), 4c 2 (µ µ j ) Σ (µ µ j )). Therefore, Pr[Y j, (X) < 0] = Φ c(µ µ j ) (µ µ j ). (17) 2 ((µ µ j ) Σ (µ µ j )) Taking imits on both sides of (17) and by continuity of Φ( ), we get im Pr[Y j, (X) < 0] = Φ(0) = 0.5, and im Pr[Y j, (X) < 0] = 0. (18) c 0 c This, together with (16) and the definition of I j, in (6), competes the proof of part (a). The proof for Part (b) is very simiar to that of part (a) and is omitted. For part (c), note that from Theorem 1(c), p Y j, (X) = d (λ i 1)Z 2 i + 2λ i δ i cz i + λ i δi 2 c 2, (19) i=1 where δ i, i = 1,..., p are as defined there(and using the fact that the means are cµ j and cµ here). Then by continuity arguments, it foows that [ [ ] p ] im Pr Y j, (X) < 0 = im Pr (λ i 1)Z 2 i + 2λ i δ i cz i + λ i δi 2 c 2 < 0 c 0 c 0 i=1 [ p ] = Pr =1 (λ i 1)Z i 2 < 0 = κ. (20) 8

9 A Separabiity Index for Distance-based Custering and Cassification Agorithms Note that the right-side of (19) is a quadratic in c with eading coefficient p i=1 λ iδ 2 i > 0. Hence, for arge vaues of c (in particuar, when c is arger than the argest root of the quadratic equation), the right side in (19) is positive. Taking imits on both sides of (19) and using continuity arguments we get [ ] im Pr Y j, (X) < 0 = 0. (21) c This competes the proof of Theorem 3. Note that in the setting of Theorem 3, I j,, when viewed as a function of the parameter c > 0, is a continuous function. This is summarized in the foowing coroary, which foows immediatey from the proof of Theorem 3. Coroary 4 Fix α (0, 1). Let for any c > 0, I j, (c) = I j, be our preiminary index of separation between two groups C and C j, each having n i number of p-variate Gaussian observations with mean cµ i and covariance Σ i, i {j, }. For different choices of the distance metric, et I j, 0 be as defined in the three parts of Theorem 3. Then I j, (c) is a continuous function of c with its range containing (I j, 0, 1). From Theorem 1 (see Remark 2), one can compute the vaue of the preiminary index given vaues for the custer means and dispersions. In rea datasets, we can appy some custering method to obtain the custer memberships and estimate the custer means and the variance structures and compute an estimated version of the index. See Section 4 for one such appication. The above coroary aso enabes us to generate datasets having specific vaues of the index. 2.2 Generating Grouped Data According to Our Index From Coroary 4, it is cear that any vaue in (I j, 0, 1) is a possibe vaue of Ij,, for a given set of mode parameters (cµ, Σ, n 1, n 2 ) by suitaby choosing the scaing parameter c > 0. This eads to the data-generation agorithm described in Appendix A. The main idea is that for a given target vaue of the index, start with some initia sets of parameters p, (µ, Σ, n 1, n 2 ), and compute our index for this initia configuration. Then find c > 0 iterativey so that the data with parameters (cµ, Σ, n 1, n 2 ) attains the target vaue. The agorithm described in Appendix A gives a more genera version of this for mutipe custers as we as for the adjusted version of the index discussed in Section Iustrations and Adjustments In this section, we provide some iustrative exampes obtained by simuating reaizations from groups under different scenarios using the preiminary index and the agorithm in Appendix A. For each simuated dataset we reate the index with the Adjusted Rand index (R) of Hubert and Arabie (1985) obtained by custering the observations in each dataset and comparing the resut with the true cassification. By design, R takes a vaue of zero on the average if a observations are assigned to a group competey at random. A perfect 9

10 Ghosh, Maitra and Peterson grouping of the observations matching the true, on the other hand, yieds R, its highest vaue of unity. Our simuation resuts indicate some shortcomings in the preiminary index so we study and propose a series of adjustments to make it a better measure of custering compexity. 3.1 Homogeneous Spherica Custers In this section, we have assumed that a groups have the same spherica dispersion structure, i.e., Σ k = σ 2 I for a k = 1, 2,..., K The Two-Groups Case Figures 1a-y dispays simuated datasets for different vaues of our preiminary index I I 1,2 using the agorithm in Appendix A and α = In each case, 100 observations were generated from each of two groups separated according to I. In the figures, coor is used to distinguish the true grouping. For each simuated dataset, we aso grouped the observations using the k-means agorithm (with K = 2) and initiaized using the true group means (in order to minimize effects of custering owing to poor initiaization). Thus, for each dataset, the resuting cassification, dispayed via potting character in the corresponding Figures 1a-y, represents the idea grouping obtained using the best-case scenario of the k-means agorithm. Figures 1 demonstrates that as the vaue of I increases, the custers become more separated. This is confirmed by the vaues of R. The cassifications of the datasets in Figures 1a-e have the owest R between 0.61 and Each subsequent row down has a range of R higher than those in the previous row. Thus, there is support for our measure of separation (I) vis-a-vis custering difficuty. The 20 simuated datasets presented in Figures 1a-y give some evidence for I as a quaitative indication of custering difficuty, however a more comprehensive anaysis is needed for using I as a quantitative measure. Thus, we conducted a simuation study to investigate the vaidity of our index as a quantitative surrogate for custering compexity. Specificay, we simuated 25 datasets, each having observations from two groups at each of 15 evenyspaced vaues of I (using α = 0.75) in (0, 1), for nine different combinations of numbers of observations per custer and dimensions. We used the k-means agorithm initiaized with the true custer means, and computed R of the subsequent custering, reative to the true. Figures 2a-c dispays the resuts for p = 10, 100 and 1000, respectivey. For each figure, the x-axis represents I and the y-axis represents the obtained R. Coor denotes the number of observations in each dataset for each setting. In each figure, the shaded region for each coor denotes the spread between the first and third quarties of R based on 25 datasets. From the figures, we note that I tracks R very we, providing support for our index as a surrogate for custering compexity. However, the reationship is not inear. There is aso more variabiity in R when the number of observations is ow. Further, the bands representing the spread between the first and third quarties of R do not often overap when we change dimensions. Thus, there is some inconsistency in the reationship between our preiminary index I and R across dimensions. Indeed, this inconsistency is most apparent when the number of observations is ess than the number of dimensions. We are thus ed to investigating possibe adjustments to our preiminary index for n and p. 10

11 A Separabiity Index for Distance-based Custering and Cassification Agorithms I = 0.6 (a) R = 0.74 (b) R = 0.72 (c) R = 0.64 (d) R = 0.70 (e) R = 0.61 I = 0.7 (f) R = 0.76 (g) R = 0.86 (h) R = 0.88 (i) R = 0.83 (j) R = 0.76 I = 0.8 (k) R = 0.94 () R = 0.96 (m) R = 0.94 (n) R = 0.86 (o) R = 0.85 I = 0.9 (p) R = 0.98 (q) R = 0.98 (r) R = (s) R = 1.0 (t) R = 0.96 I = 1.0 (u) R = 1.0 (v) R = 1.0 (w) R = 1.0 (x) R = 1.0 (y) R = 1.0 Figure 1: Simuated datasets at five different vaues of I. Coors represent the true custer cass and symbos represent the custer cass using the k-means agorithm. Above each set of pots we give the vaue of I and beow each individua pot we give R based on the k-means agorithm with initiaization at the true custer centers. 11

12 Ghosh, Maitra and Peterson (a) p = 10 (b) p = 100 (c) p = 1000 Figure 2: Pots for K = 2 custers comparing I and R for α = The x-axis is I and the y-axis is R. The three coors designate number of observations per custer such that n 1 = n 2. The ower and upper bound of the bands represent the first and third quartie of R cacuated on 25 datasets for severa different I. R is cacuated using the k-means agorithm with initiaization at the true custer centers. An Initia Adjustment to I for group size and dimension. To understand further the reationship between I and R we simuated 25 datasets each of observations from two homogeneous spherica groups for a combinations of (n 1, n 2, p), where n 1 n 2 (assumed without oss of generaity) were the observations in the first and second groups, and I, and p was in the set {2, 4, 5, 10, 20, 50, 100, 200, 500, 1000}, and there were fourteen combinations of n 1 and n 2, where (n 1, n 2 ) = {(20, 20), (50, 50),(75, 75),(100, 100),(200, 200),(500, 500), (1000, 1000), (30, 100),(20, 50),(60, 75),(90, 100),(150, 250), (50, 600),(100, 1000)}. We simuated datasets according to ten vaues of I eveny spaced between 0 and 1, and for each of α {0.25, 0.5, 0.75}. This means that we had a tota of 105,000 simuated two-custer datasets. For each of these datasets, we again found R based on the k-means agorithm initiaized with the true custer means. Through extensive exporatory work, we investigated severa reationships between I and R. Out of these, the foowing mutipicative reationship between I and R performed the best: ( ( )) 1 θα(n1,n 2,p) og((r + 1)/(1 R)) exp(δ α (n 1, n 2, p)) og, 1 I where δ α (n 1, n 2, p) = ζ ωi,ω j,ω k,α og(ω i ) og(ω j ) og(ω k ), θ α (n 1, n 2, p), = 1 i j k 4 1 i j 4 β ωi,ω j,α og(ω i ) og(ω j ), and (ω 1, ω 2, ω 3, ω 4 ) = (n 1, n 2, p, e). (22) Then for each of three tuning parameters (α = 0.25, 0.50 and 0.75) we fit the inear mode ( ( )) 1 og(og((r + 1)/(1 R))) θ α (n 1, n 2, p) og og + δ α (n 1, n 2, p) (23) 1 I 12

13 A Separabiity Index for Distance-based Custering and Cassification Agorithms where θ α (n 1, n 2, p) and δ α (n 1, n 2, p) are as defined in (22). This reationship has the property that a arge vaue of I corresponds to R that is cose to 1 whie vaues of I that are cose to zero correspond to Rs that are aso cose to zero. Tabe 2 in Appendix B shows that the genera trend of fitted vaues either increases or decrease for each parameter estimate as α increases. The sum of the estimates for the four three-way interaction terms for the ns is cose to zero for each α. This suggests that the when n 1 = n 2 the cubed term for the ns is not different from zero. Therefore, based on the fitted vaues in the eft pane of Tabe 2, we define R I,α as foows: ( exp R I,α = exp ( og ( 1 exp(δ α (n 1, n 2, p)) ( ( ( )) ) θα(n1,n 2,p) exp(δ α (n 1, n 2, p)) og 1 1 I + 1 )) ) θα(n1,n 2,p) 1 I 1. (24) We ca R I,α as our initia adjusted index. We now investigate its performance in indexing custering compexity in a simiar framework to Figures 1 and 2. Figure 3 iustrates twocuster datasets simuated in a simiar manner as in Figure 1 but using our initia adjusted index R I,0.75. Simiar is the case with Figure 4 which mimics the setup of Figure 2 with the ony difference being that datasets are generated here using R I,0.75 (instead of I, as in Figure 2). In both cases, we note that the agreement between the numerica vaues of R and R I,0.75 is substantiay improved. In particuar, Figures 3a-y show that the range of actua vaues of R contains the vaue of R I,0.75. Aso, Figures 4a-c demonstrates that R I,0.75 tracks R very we, providing support for R I,α as a surrogate of custering compexity. This support is consistent for different numbers of observations and dimension. (We note that whie we report figures here ony for α = 0.75, resuts using vaues of α = 0.25 and α = 0.5 that we have tried have supported this reationship). Note however, that Figure 4 continues to indicate greater variabiity in the obtained R when there are fewer observations in a group. This is expected because smaer sampe sizes ead to resuts with higher variabiity The Case with Many Groups (K 2) So far we have anayzed and deveoped our index for datasets with observations in two group (i.e., K = 2). In this section, we extend it to the genera case of K 2. Ceary, separation between different pairs of custers impact custering difficuty. Indeed, there are ( ) K 2 pairwise indices of the type discussed in Section We investigated severa possibiities for summarizing custering difficuty based on these pairwise indices, however, they a possessed severa drawbacks. For instance, the average pairwise separation was found to be artificiay infated when compared to R whie the minimum of the pairwise separabiity indices was overy infuenced by the presence of two cose groups. Therefore, we aso investigated an adaptation of the summarized mutipe Jaccard simiarity index proposed by Maitra (2010). The Jaccard coefficient of simiarity (Jaccard, 1901) measures simiarity or overap between two species or popuations. This was extended by Maitra (2010) for summarizing many pairwise indices for the case of mutipe popuations. Note that both the Jaccard index and its summarized mutipe version address simiarity or overap between popuations whie our pairwise index measures separabiity. This needs to be adapted for our case. Therefore, we defined R ii I,α = 0 for i = 1...., K. Further, for each 13

14 Ghosh, Maitra and Peterson R I,0.75 = 0.6 (a) R = 0.59 (b) R = 0.64 (c) R = 0.62 (d) R = 0.64 (e) R = 0.67 R I,0.75 = 0.7 (f) R = 0.67 (g) R = 0.62 (h) R = 0.77 (i) R = 0.74 (j) R = 0.72 R I,0.75 = 0.8 (k) R = 0.81 () R = 0.79 (m) R = 0.76 (n) R = 0.77 (o) R = 0.86 R I,0.75 = 0.9 (p) R = 0.94 (q) R = 0.85 (r) R = 0.86 (s) R = 0.88 (t) R = 0.90 R I,0.75 = 1.0 (u) R = 1.0 (v) R = 1.0 (w) R = 1.0 (x) R = 1.0 (y) R = 1.0 Figure 3: Simuated datasets at different R I,0.75. The coors represent the true custer cass and the symbos represent the custer cass using the k-means agorithm. Above each row is the vaue of R I,0.75 and beow each pot we give the R based on the k-means agorithm. 14

15 A Separabiity Index for Distance-based Custering and Cassification Agorithms (a) p = 10 (b) p = 100 (c) p = 1000 Figure 4: Pots for K = 2 custers comparing R I,0.75 and R. The x-axis is R I,α and the y-axis is R. The three coors designate number of observations per custer such that n 1 = n 2. The bands in the pots are created in a simiar framework to Figure 2 pair R ij I,α : 1 i, j K of custers, et Rij I,α = Rji I,α R I,α, i.e., the adjusted index defined using custers C i and C j. Aso, et Υ = ((R ij I,α )) 1 i,j K be the matrix of corresponding R ij I,α s. Then we define the foowing summarized index for K custers: R I,α = 1 (J K Υ) λ (1) 1, (25) K 1 where J K is a K K matrix with a entries equa to unity, and (J K Υ) λ (1) is the argest eigenvaue of the matrix (J K Υ). Maitra (2010) motivates his summary using principa components anaysis (PCA) in the context of a correation matrix where the first principa component is that orthogona projection of the dataset that captures the greatest variabiity in the K coordinates. Like his summary, our summary index (25) has some very appeaing properties. When the matrix of pairwise separation indices is J K I K, i.e., R j I,α = 1 j. then the first (argest) eigenvaue captures ony 1/K proportion of the tota eigenvaues. In this case, R I,α = 1. On the other hand when every eement is zero, there is perfect overap between a groups, and ony the argest eigenvaue is non-zero. In this case R I,α = 0. Finay, when K = 2, R I,α is consistent with the (soe) pairwise index R ij I,α. Fina Adjustments to Separation Index. Whie the summarized index in (25) provides a summarized pairwise measure that is appeaing in some specia cases, we sti need to investigate its performance as a surrogate measure of custering compexity. We therefore performed a detaied simuation study to study the reationship between the summarized R I,α and R. Specificay, we generated 25 K-custer datasets each with 100 dimensions, at each of 10 different vaues of R I,0.75 between 0 and 1, for each of 9 different combinations of number of observations per custer and numbers of custers. Specificay, we generated datasets for K = 3, 5 and 10 custers, and equa numbers of observations in each group, (i.e., n k n 0 ) where n 0 {20, 100, 1000}. For each dataset, we used the k-means agorithm initiaized at the true custer centers, and computed R of the subsequent custering. 15

16 Ghosh, Maitra and Peterson (a) K = 3 (b) K = 5 (c) K = 10 Figure 5: Pots for various number of custers that compare R I,0.75 and R. The x-axis is R I,α and the y-axis is R. The three coors designate number of observations n 1 = n 2 = n for n {20, 100, 1000}. The bands in the pots are created in a simiar framework to Figure (a) K = 3 (b) K = 5 (c) K = 10 Figure 6: Pots that compare I and R for the case where α = The x-axis is I and the y-axis is R. The three coors designate number of observations n 1 = n 2 = n for n {20, 100, 1000}. The bands in the pots are created in a simiar framework to Figure 2 Figures 5a-c pots the interquartie ranges of R for each combination of K and n 0. Here we have R I,α on the x-axis and R on the y-axis. Figures 5a-c demonstrates that R I,α tracks R. In addition, the reationship between R I,α and R is consistent for different numbers of observations. However, it is aso cear that the exact reationship between R I,α and R depends on K so some more adjustments may be caed for. To study this issue further, we simuated 25 datasets for each combination of ten different dimensions p in the set {2, 4, 5, 10, 20, 50, 100, 200, 500, 1000} seven different combinations of number of observations n i = n j n 0 for a i, j, where n 0 is in the set {20, 50, 75, 100, 200, 500, 1000}, and ten 16

17 A Separabiity Index for Distance-based Custering and Cassification Agorithms different vaues of R I,α eveny spaced between 0 and 1, three different tuning parameters α = 0.25, 0.5, 0.75 and four different numbers of custers per dataset K = 3, 5, 7 and 10. For each dataset we cacuated R. Using the R from each of these datasets, and for each combination of (p, α, K) we fit the mutipicative mode: R R β k,p I,α. (26) Using the fitted vaues for β k,p for each combination of dimension and number of custers, we then fit the foowing simpe inear mode separatey for each tuning parameter: β k,p = η + η 1 k + η 2 k 2 + η 3 p + η 4 p 2 + η 5 kp. (27) The fitted vaues are presented in the eft pane of Tabe 3 in Appendix B. Thus the fina version of our index, after a adjustments, for the case of homogeneous spherica custers is I = R β k,p I,α. (28) Figures 6a-c are constructed simiary to Figures 5a-c except that the datasets are now generated using the agorithm in Appendix A using I instead of R I,0.75. Correspondingy, the x-axis on these figures now represents I. Note that the effect of K has been argey addressed and the reationship between I and R is argey consistent across dimensions. Iustrations. We now present, in Figures 7a-y, some simuated datasets to demonstrate possibe configurations obtained using the agorithm in Appendix A for five different vaues of I. In each case we used α = 0.75 and K = 4. Each group was assumed to have 100 observations and the true casses are distinguished by coor in the figures. The different symbos in the pots represent the cassification obtained upon using the k-means agorithm with K = 4 and initiaized with the true custer means. We note that the custers are we-separated as I increases. Custering compexity aso decreases as confirmed by the computed vaues of R. The datasets in the first row have the owest I. Each subsequent row down has a range of R higher than those in the previous row. In each row we see that I is within the range of the actua R index vaues. This provides further evidence that the transformed version of our index in the form of I is simiar to R. We concude this section with a few comments. Note that I is stricty between 0 and 1. Further, et us denote I (c) to be the vaue of I that is defined using (23), (26) and (28), but with I i,j repaced by I(c) i,j as in Coroary 4 for the custer C i with n i p-variate Gaussian observations with mean cµ i and covariance Σ i, i {1,..., K}. The foowing coroary impies that we can generate datasets with any vaue of the fina index between I (0) and unity using the agorithm in Appendix A. Coroary 5 Fix α {0.25, 0.5, 0.75}. Let c > 0, then for positive θ α (n 1, n 2, p) and β k,p, I (c) is a continuous function where its range contains (I (0), 1). In addition, I (c) is an increasing function of c. Proof The resut foows directy from Theorem 3 and Coroary 4. Note that we found the adjustments in (23) and (28) empiricay through simuations. In a of the cases we considered, θ α (n 1, n 2, p) and β k,p are positive and thus Coroary 5 hods. Ideay, if we desire adjustments for n 1, n 2, K or p that are very different from the cases we considered in our simuations, further simuations may need to be conducted. In this case we need to find other fitted estimates as in Appendix B. 17

18 Ghosh, Maitra and Peterson I = 0.6 (a) R = 0.67 (b) R = 0.57 (c) R = 0.54 (d) R = 0.54 (e) R = 0.60 I = 0.7 (f) R = 0.72 (g) R = 0.72 (h) R = 0.73 (i) R = 0.67 (j) R = 0.62 I = 0.8 (k) R = 0.84 () R = 0.85 (m) R = 0.79 (n) R = 0.83 (o) R = 0.77 I = 0.9 (p) R = 0.88 (q) R = 0.91 (r) R = 0.86 (s) R = 0.9 (t) R = 0.92 I = 1.0 (u) R = 1.0 (v) R = 1.0 (w) R = 1.0 (x) R = 1.0 (y) R = 1.0 Figure 7: Four-component simuated datasets at different I. Above each row of pots is the vaue of I. The coors represent the true custer cass and the symbos represent the custer cass using the k-means agorithm. R using the k-means agorithm initiaized at true custer centers is dispayed beow each pot. 18

19 A Separabiity Index for Distance-based Custering and Cassification Agorithms 3.2 Nonhomogenous Eipsoida Custers In this section, we assume that Σ k is any genera nonnegative definite covariance matrix, for each custer k (k = 1,..., K) The Two-Groups Case Figures 8a-y dispays simuated datasets for different vaues of our preiminary index I using the agorithm in Appendix A and α = In each case, 100 observations were generated from each of two groups separated according to I. In these figures, coor distinguish the true grouping. For each simuated dataset, we aso grouped the observations using the hierarchica custering method with Ward s criterion Ward (1963) and cutting the tree hierarchy at K = 2. The resuting cassification is dispayed via potting character in the corresponding Figures 8a-y. The grouping of the datasets in Figures 8a-e have the owest R between 0.59 and 0.77, but each subsequent row down has R vaues, on the average, higher than in previous rows. In genera, therefore, Figures 8a-y demonstrates as the vaue of I increases, the custers become more separated. This coincides with an increase in the vaues of R. Thus ower vaues of I correspond to custering probems of higher difficuty whie higher vaues of I are associated with higher vaues of R and hence custering compexity. We aso investigated further the reationship between I and R in the case. Specificay, we simuated 25 datasets each of observations from two nonhomogeneous popuations each with arbitrary eipsoida dispersion structures for a possibe combinations of (n 1, n 2, p, I), where n 1 n 2 (assumed w..o.g.), and n 1, n 2 are the numbers of observations in the two groups. In our experiments, p {2, 4, 5, 10, 20, 50, 100, 200, 500, 1000} whie (n 1, n 2 ) took vaues of each of the fourteen combinations in {(20, 20), (50, 50), (75, 75), (100, 100), (200, 200), (350, 350), (500, 100), (30, 100), (20, 50), (60, 75), (90, 100), (150, 250), (50, 600), (100, 500)}. We simuated 25 datasets according to each of ten vaues of I spaced eveny between 0 and 1, and for each α {0.25, 0.5, 0.75}. Each dataset was partitioned into two groups using hierarchica custering with Ward (1963) s inkage and the resuting partition evauated with the true grouping of the simuated dataset in terms of R. We noticed noninearity in the genera reationship between I and R so as in Section 3.1.1, we expored a reationship between I and R aong with the parameters p, n, and α. Simiar to (22), (23), (26) and (27) we found appropriate adjustments and defined R I,α for this case. (See Tabe 2, right pane in Appendix B for the fitted vaues.) The Case with Many Groups (K 2) Summarizing the index for K > 2 groups brings the same issues outined in Section We propose adapting Maitra (2010) s summarized mutipe Jaccard simiarity index in the same manner as before, resuting in (24), noting here that R I,α is cacuated within the setting of nonhomogeneous eipsoida custers. As in Section 3.1.2, we conducted an extensive simuation experiment to study the reationship between R I,α and R. We simuated 25 datasets for each combination of p, n i s, α, R I,α and K where p {2, 4, 5, 10, 20, 50, 100, 200}, α (0.25, 0.5, 0.75), n i = n j = n 0 for a i, j with n 0 {20, 50, 75, 100, 200, 500}, K = 3, 5, 7, 10. and R I,α eveny-spaced over ten vaues in (0, 1). We partitioned each simuated dataset using hierarchica custering with Ward (1963) s inkage and cacuated R of the resuting partitioning reative to the true. We used the resuts to adjust our index in as 19

20 Ghosh, Maitra and Peterson I = 0.6 (a) R = 0.64 (b) R = 0.67 (c) R = 0.77 (d) R = 0.59 (e) R = 0.61 I = 0.7 (f) R = 0.86 (g) R = 0.69 (h) R = 0.86 (i) R = 0.85 (j) R = 0.81 I = 0.8 (k) R = 0.92 () R = 0.86 (m) R = 0.94 (n) R = 0.96 (o) R = 0.86 I = 0.9 (p) R = 1 (q) R = 1 (r) R = 0.98 (s) R = 0.98 (t) R = 1 I = 1.0 (u) R = 1.0 (v) R = 1.0 (w) R = 1.0 (x) R = 1.0 (y) R = 1.0 Figure 8: Simuated datasets at different I in heterogeneous case: The coors represent the true custer cass and the symbos represent the custer cass using hierarchica custering with Ward s inkage. Under each pot we give R based on hierarchica custering with Ward s inkage. 20

21 A Separabiity Index for Distance-based Custering and Cassification Agorithms in (26) and (27). The fina summary (I ) has form simiar to (28), and coefficient estimates provided in the right pane of Tabe 3 in Appendix B. Figure 9 presents resuts of simuation experiments in the same spirit as in Figures 5 and 6 except that the datasets are now generated using the agorithm in Appendix A with I obtained here. Note as in Figures 6 that the effect of K has been argey addressed and the reationship between I and R is argey consistent across dimensions (a) K = 3 (b) K = 5 (c) K = 10 Figure 9: Pots that compare I and R for the heterogeneous case. The x-axis is I and the y-axis is R. R is cacuated based on using hierarchica custering with Ward (1963) s inkage. In a cases α = The three coors designate number of dimensions p {10, 50, 200}. The bands in the pots are created in a simiar framework to Figure The impact of α on the reationship between I and R Our discussion in this paper has been in the context of α = Here, we study the reationship between I and R for α = 0.25 and Using the same setup as in Figure 9 and for each α, we simuated 25 datasets, at each of 15 eveny-spaced vaues of I (0, 1), for severa combinations of p and K. For this set of experiments, n k = 100 k. As before, we partitioned each simuated dataset using hierarchica custering with Ward (1963) s inkage, and cacuated R of the resuting partition reative to the true. Figure 10 dispay the reationship between I and R for α = 0.25 and α = Together with the resuts of α = 0.75 presented in Figure 9, we see that the reationship between I and R is consistent. Thus, it appears that the reationship hods at east for interior vaues of α in (0,1), regardess of its exact vaue and the tuning parameter has not much effect on the custering difficuty indexed by I. This aso expains our excusion of α from the notation I. 3.4 Iustrative Exampes We now provide some iustrations of the range of muti-custered datasets that can be obtained using the agorithm in Appendix A and for different vaues of the summarized index I. We first dispay reaizations obtained in two dimensions, and then move on to higher dimensions. 21

22 Ghosh, Maitra and Peterson α = (a) K = 3 (b) K = 5 (c) K = 10 α = (d) K = 3 (e) K = 5 (f) K = 10 Figure 10: Pots that compare I and R for the heterogeneous case. The x-axis is I and the y-axis is R. The three coors designate number of dimensions p {10, 50, 200}. α is dispayed above each pot. The bands in the pots are created in a simiar framework to Figure 9 Figures 11a-y mimics the setup of Figure 7 for nonhomogenous eipsoida custers. Here the observations are grouped using hierarchica custering method with Ward s criterion (Ward, 1963) and cutting the tree hierarchy at K = 4. The grouping of the datasets in Figures 11a-e have the owest R between 0.57 and 0.67, but each subsequent row down has R vaues, on the average, higher than in previous rows. In genera, therefore, Figures 11a-y demonstrates as the vaue of I increases, the custers become more separated. This coincides with an increase in the vaues of R. Thus ower vaues of I correspond to custering probems of higher difficuty whie higher vaues of I are associated with higher vaues of R and hence custering compexity. Figures 11a-y demonstrates the various possibe configurations obtained using the agorithm in Appendix A for five different vaues of I. Note that in some cases ony two of the four groups are we-separated whie in other cases none of the groups are we-separated. 22

A Separability Index for Distance-based Clustering and Classification Algorithms

A Separability Index for Distance-based Clustering and Classification Algorithms A Separabiity Index for Distance-based Custering and Cassification Agorithms Arka P. Ghosh, Ranjan Maitra and Anna D. Peterson 1 Department of Statistics, Iowa State University, Ames, IA 50011-1210, USA.