A Separability Index for Distance-based Clustering and Classification Algorithms

Size: px

Start display at page:

Download "A Separability Index for Distance-based Clustering and Classification Algorithms"

Harvey Tucker
5 years ago
Views:

1 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 1 A Separabiity Index for Distance-based Custering and Cassification Agorithms Arka P. Ghosh, Ranjan Maitra and Anna D. Peterson Abstract We propose a separabiity index that quantifies the degree of difficuty in a hard custering probem under assumptions of a mutivariate Gaussian distribution for each custer. A preiminary index is first defined and severa of its properties are expored both theoreticay and numericay. Adjustments are then made to this index so that the fina refinement is aso interpretabe in terms of the Adjusted Rand Index between a true grouping and its hypothetica ideaized custering, taken as a surrogate of custering compexity. Our derived index is used to deveop a data-simuation agorithm that generates sampes according to the prescribed vaue of the index. This agorithm is particuary usefu for systematicay generating datasets with varying degrees of custering difficuty which can be used to evauate performance of different custering agorithms. The index is aso shown to be usefu in providing a summary of the distinctiveness of casses in grouped datasets. Index Terms Separation index, mutivariate Jaccard index, custering compexity, exact-c-separation, radia visuaization pots 1 INTRODUCTION There is a arge body of iterature [1], [2], [3], [4], [5], [6], [7], [8], [9] dedicated to custering observations in a dataset into homogeneous groups, but no method uniformy outperforms the others. Many agorithms perform we in some settings but not so otherwise. Further, the settings where agorithms work we or poory is very often not quite understood. Thus there is need for a systematic study of performance of any custering agorithm, and aso for evauating effectiveness of new methods using the same objective criterion. Many researchers evauate the performance of a proposed custering technique by comparing its performance on cassification datasets ike textures [10], wine [11], Iris [12], crabs [13], image [14], E. coi [15] or [16] s dataset. Whie evauating a custering agorithm through its performance on seect cassification datasets is usefu, it does not provide a systematic and comprehensive understanding of its strengths and weaknesses over many scenarios. Thus, there is need for ways to simuate datasets of different custering difficuty and caibrating the performance of a custering agorithm under different conditions. In order to do so, we need to index the custering compexity of a dataset appropriatey. There have been some attempts at generating custered data of different custering compexity in terms of separabiity indices. [17] proposed a much-used agorithm [18], [19], [], [21], [22] that generates we-separated custers from The authors are with the Department of Statistics, Iowa State University, Ames, IA , USA. c 10 IEEE. Persona use of this materia is permitted. However, permission to use this materia for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. Manuscript received June 21, 10; revised October 21, 10. First pubished xxxxxxxx x, xxxx, current version pubished yyyyyyyy y, yyyy R. Maitra s and A. D. Peterson s research was partiay supported by the Nationa Science Foundation under Grant Nos. CAREER DMS and VIGRE , respectivey. Digita Object Identifier norma distributions over bounded ranges, with provisions for incuding scatter [23], non-informative dimensions, outiers. However, [24] observed that both increasing the variance and adding outiers increases the degree of overap in unpredictabe and differing ways, and thus, this method is incapabe of accuratey generating indexed custered data. [25] proposed the OCLUS agorithm for generating custers based on known asymptotic) overap by having the user provide a design matrix and an order matrix the former indicates the at most) tripets of custers that are desired to be overapping with each other whie the atter dictates the ordering of custers in each dimension. Athough, the idea of using overap in generating custered data is appeaing, the agorithm has constraints beyond the structure of the design matrix above: for instance, independence between dimensions is aso required. A separation index between any two univariate Gaussian custers was proposed by [26]. For higher dimensions, they aso used the same index on the 1-D transformation obtained after optimay projecting the two mutivariate norma custers onto 1-D space. For mutipe custers, their overa separation index aso the basis for their custer generation agorithm [27]) is the maximum of a n 2) pairwise indices and thus, quite impervious to variations between other groups that are not in this maximizing pair. Additionay, characterizing separation between muti-dimensiona custers by means of the best 1-D projection oses substantia information: thus, resuting statements on custer overap can be very miseading. Finding the optima 1-D projection is aso computationay intensive and impractica for very high dimensions. [28], [29], [30] demonstrated performance of their custering agorithms using simuation datasets generated using the concept or a variant) of c-separation between custers proposed by [31], which defines two Gaussian distributions Nµ 1, Σ 1 ) and Nµ 2, Σ 2 ) in IR n as c-separated if µ 1 µ 2 c n maxλ max Σ 1 ), λ max Σ 2 )) where λ max Σ i ) is the argest eigenvaue of Σ i. [32] formaized the concept to

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 2 exact-c-separation by requiring equaity for at east one pair i, j) of custers and used it for generating datasets to caibrate some partitiona initiaization agorithms. He aso pointed out some inherent shortcomings that originate from ignoring the reative orientations of the custer dispersions. More recenty, [33] proposed a method for generating Gaussian mixture distributions according to some summary measure of overap between every component pair, defined as the unweighted sum of the probabiities of their individua miscassification rates. They aso provided open-source C software C-MixSim) and a R package MixSim) for generating custers corresponding to desired overap characteristics. In contrast to many of the existing indices and simuation agorithms, their methodoogy does not impose any restriction on the parameters of the distributions but was derived entirey in the context of mixture modes and mode-based custering agorithms. Thus, their methods were specificay geared toward soft custering and mode-based custering scenarios. In this paper, we compement [33] s scenario and derive a separation index Section 2) for distance-based partitioning and hard custering agorithms. Our index is motivated by the intuition that for any two we-separated groups, the majority of observations shoud be coser to their own center than to the other. We use Gaussian-distributed custers and Eucidean and Mahaanobis distances to simpify our theoretica cacuations. The preiminary index is investigated and fine-tuned in the context of homogeneous spherica custers in Section 3.1 and then extended to the case for mutipe groups. The methodoogy is then studied for the genera case in Section 3.2. Our derived index can be used to quantify cass distinctions in grouped data, and we iustrate this appication in Section 4 in the context of severa cassification datasets. The main paper concudes with some discussion. The paper aso has an appendix detaiing the agorithm that uses our index to generate datasets of desired custering compexity. 2 METHODOLOGICAL DEVELOPMENT Consider a dataset S = {X 1, X 2,..., X n }. The objective of hard custering or fixed-partitioning agorithms is to group the observations into hard categories C 1, C 2,..., C K such that some objective function measuring the quaity of fit is minimized. If the objective function is specified in terms of minimizing the tota distance of each observation from some characteristic of the assigned partition, then it is defined as O K = n i=1 k=1 K IX i C k )D k X i ) 1) where D k X i ) is the distance of X i from the center of the k-th partition, and assumed to be of the form D k X i ) = X i µ k ) k X i µ k ) 2) where k is a non-negative definite matrix of dimension p p. We consider two specia cases here: in the first case, k = I p, the identity matrix of order p p where D k X i ) reduces to the squared Eucidean distance and soving for 1) invoves finding partitions C 1, C 2,..., C K in S such that the sum of the squared Eucidean distance of each observation to the center of its assigned partition is minimized. The popuar k-means agorithm [34], [35] provides ocay optima soutions to this minimization probem. In the second scenario, k = Σ 1 k, where Σ k is the dispersion matrix of the kth partition. Then D k X i ) is the Mahaanobis distance between X i and µ k. In this paper, we adopt a convenient mode-based formuation for the setup above. In this formuation, X 1, X 2,..., X n are independent p-variate observations with X i N p µ ζi, Σ ζi ), where ζ i {1, 2,..., K} for i = 1, 2,..., n. Here we assume that µ k s are a distinct and that n k is the number of observations in custer k. Then the density for the X i s is given by fx) = K k=1 IX C k)φx; µ k, Σ k ), where C k is the subpopuation indexed by the N p µ k, Σ k ) density and IX C k ) is an indicator function specifying whether observation X beongs to the kth group having a p-dimensiona mutivariate norma density φx; µ k, Σ k ) Σ k p 2 exp 1 2 X µ k ) Σ 1 k X µ k)), k = 1,..., K. When Σ k = I p, maximizing the ogikeihood with respect to the parameters ζ i, i = 1, 2,..., n and µ k s is equivaent to soving for 1) with D k X i ) as Eucidean distance. Using this formaism, we deveop, in theoretica terms, a separation index that quantifies separation between any two custers and reates it to the difficuty in recovering the true partition C 1, C 2,..., C K of a dataset. We begin by defining a preiminary index. 2.1 A preiminary separation index between two custers Consider the case with K = 2 groups, abeed C j and C for j {1, 2}). Define Y j, X) = D j X) D X), where X C, 3) and Y,j X), simiary. Using the modeing formuation above, Y j, X) is a random variabe which represents the difference in squared distances of X C to the center of C j and to the center of C. For distance-based cassification methods, Pr[Y j, X) < 0] is the probabiity that an observation is cassified into C j when in fact the observation is from C. Intuitivey, one expects that since X ζi : ζ i =, i = 1,..., n, beong to C, then most of these n observations wi be coser to the mean of C, compared to the mean of C j. Based on this observation, we define the index in terms of the probabiity that α fraction of the observations are coser to the incorrect custer center. In other words, we find the probabiity that an order statistic say, n α -th, for α 0, 1)) of {Y j, ζ i X) : ζ i =, i = 1,..., n} is ess than 0. Here x is the greatest integer smaer than x). We specify this probabiity as p j,. To simpify notation we wi assume that n α is an integer. Therefore, p j, = n i=n α ) n Pr[Y j, X) < 0] i Pr[Y j, X) > 0] n i, i 4) and p,j is defined simiary. Since both these probabiities can be extremey sma we take the average to an inverse power of a function of n j and n. We define the pairwise index as

3 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 3 ) 1 1 I j, = 1 2 p,j + p j, n j, ), 5) where n j, = n,j = n 2 + n2 j )/2. Note that this index incorporates cass sizes into it, which is desirabe since miscassification rates are affected by the reative sizes of groups. Aso, when n j = n = n then n j, = n. The index I j, takes vaues in [0, 1], with a vaue cose to unity indicating a we-separated dataset and vaues coser to zero indicating that the two groups have substantia overap. We ca the index in 5) the preiminary index, because we wi make suitabe adjustments to this obtained through simuations in Section 3. The cacuation of I requires the knowedge of the distribution of the random variabes defined in 3), which is summarized in the Theorem beow for different cases. Here and for the remainder of this paper, the notation X = d Y means that X has the same distribution as Y. Theorem 1: For any, j {1,..., K}, et Y j, X) be as defined in 3), for some distance metric D j and D as defined in 2). Further assume X N p µ k, Σ k ). Then for specific choices of, j and for µ j = µ µ j the foowing hod: a) For the Eucidean distance j = = I) Y j, X) Nµ j µj, 4µ j Σ µ j ). b) For the Mahaanobis distance j = Σ 1 j, = Σ 1 ), when both custers have identica covariance structures, i.e. Σ = Σ j Σ, then Y j, X) Nµ j Σ 1 µ j, 4µ j Σ 1 µ j ) c) For the Mahaanobis distance j = Σ 1 j, = Σ 1 ), when the two custers do not have identica covariance structures i.e. Σ Σ j ), et λ 1, λ 2,..., λ p be the eigenvaues of Σ j Σ 1 2 Σ 1 j Σ 1 2 and the corresponding eigenvectors be given by γ 1, γ 2,...γ p. Then Y j, X) has the distribution of p λ i 1) 2 U i λ iδ 2 i i=1:λ 1 λ i 1 + p i=1:λ=1 λ iδ i 2W i + δ i ), where U i s are independent non-centra χ2 random variabes with one degree of freedom and noncentraity parameter given by λ 2 δ2 /λ 1) 2 with δ i = γ i Σ 1 2 µ µ j ) for i {1, 2,..., p} {i : λ 1}, independent of W i s, which are independent N p 0, 1) random variabes, for i {1, 2,..., p} {i : λ i = 1}. Proof: For any p-variate vector X, D j X) D X) = X µ j ) j X µ j ) X µ ) X µ ) = X j )X + 2X µ j µ j ) +µ j j µ j µ µ ). 6) Therefore, when j = = I, Y j, X)=D j X) D X)=2X µ µ j )+µ j µ j µ µ, where X N p µ, Σ ). Simpe agebra competes the proof of part a). Simiar cacuation using 6), when j = = Σ 1, shows that Y j, X)=2X µ µ j ) + µ j Σ 1 µ j µ Σ 1 µ, where X N p µ, Σ ). This competes the proof of part b) using simiar arguments as above. For part c), et ξ N p 0, I). Since X d = Σ 1 2 ξ + µ, using 6), we have Y j, X) = X Σ 1 j Σ 1 +µ jσ 1 j )X + 2X Σ 1 µ Σ 1 j µ j ) µ j µ Σ 1 µ d = Σ 1 2 ξ + µ ) Σ 1 j Σ 1 )Σ 1 2 ξ + µ ) +2Σ 1 2 ξ + µ ) Σ 1 µ Σ 1 j µ j ) +µ jσ 1 j µ j µ Σ 1 µ = ξ Σ j I)ξ + 2ξ Σ 1 2 Σ 1 j )µ µ j ) +µ µ j ) Σ 1 j )µ µ j ) 7) where Σ j = Σ 1 2 Σ 1 j Σ 1 2. Let the spectra decomposition of Σ j be given by Σ j = Γ j Λ j Γ j, where Λ j is a diagona matrix containing the eigenvaues λ 1, λ 2,... λ p of Σ j, and Γ j is an orthogona matrix containing the eigenvectors γ 1, γ 2,...γ p of Σ j. Since Z Γ j ξ N p 0, I) as we, we get from 7) that Y j, X) d = ξ Γ j Λ j Γ j Γ j Γ j )ξ +2ξ Γ j Λ j Γ j Σ )µ µ j ) +µ µ j ) Σ Γ j Λ j Γ j Σ 1 2 )µ µ j ) = Γ j ξ) Λ j I)Γ j ξ) d = +2Γ j ξ) Λ j Γ j Σ 1 2 )µ µ j ) +µ µ j ) Σ 1 2 Γ j Λ j Γ j Σ 1 2 )µ µ j ) p λ i 1)Z 2 i + 2λ i δ i Z i + λ i δi 2, 8) i=1 where δ i, i = 1,..., p are as in the statement of the theorem. Depending on the vaues of λ, one can simpify the expression in 8). If λ i > 1: λ i 1)Z 2 i +2λ iδ i Z i +λ i δ 2 i = λ 1Z i + λ i δ i / λ i 1) 2 λ i δ 2 i /λ i 1), which is distributed as a λ i 1)χ 2,λ 2 δ2 /λi 1)2 -random variabe. If λ i < 1: λ i 1)Z i 2 + 2λ i δ i Z i +λ i δi 2 = 1 λz i +λ i δ i / λ i 1) 2 λ i δi 2/λ i 1), which is distributed as a 1 λ i )χ 2 -random,λ 2 δ2 /λi 1)2 variabe. In the case of λ i = 1, λ i 1)Zi 2 +2λ iδ i Z i +λ i δi 2 = 2λ i δ i Z i + λ i δi 2. The proof foows from incorporating these expressions and rearranging terms in 8). Remark 2: Computing Pr[Y j, X) < 0] in case a) and b) of the theorem invoves cacuating Gaussian probabiities. For the third case part c) of the theorem) we see that this invoves cacuating the probabiity of a inear combination of independent non-centra χ 2 and Gaussian variabe, for which we use Agorithm AS 155 of [36]. Once Pr[Y j, X) < 0] and Pr[Y,j X) < 0] are computed, the index can be cacuated from 5) using p j, and p,j computed from 4) Properties of our Preiminary Index In this section, we highight some properties of our preiminary index with regard to achievabe vaues and scaing that wi be used for our second objective of simuating random configurations satisfying a desired vaue of our index.

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 4 Theorem 3: Fix c > 0 α 0, 1) and consider two custers C and C j. Each C i has n i p-variate Gaussian observations with mean cµ i and covariance Σ i, i {j, }. Then for specific choices of the distance metric, the foowing properties hod: a) For the Eucidean distance j = = I), define θ i θj+θ ) 2 Φ n i 1 2α)), i {j, } and I j, 0 1 ) 1/nj,. Then for arge nj, n, im c 0 Ij, I j, 0 and im c Ij, = 1. 9) b) For the specia case of Mahaanobis distance with identica covariance structure i.e., j = = Σ 1 ), define θ j, θ, I j, 0 as in part a). Then for arge n j, n, im c 0 Ij, I j, 0 and im c Ij, = 1. 10) c) For the genera Mahaanobis distance i.e., j = Σ 1 j, = Σ 1 with Σ Σ j ), et λ 1, λ 2,..., λ p be the eigenvaues of Σ j Σ 1 2 Σ 1 j Σ 1 2 and Z,...Z be independent N0, 1) random variabes. Define for niκ α) i {j, }, θ i Φ ), where κ = κ1 κ) Pr [ p i=1 λ i 1)Z 2 i < 0 ]. Aso define I j, 0 1 ) 1/nj,. Then for arge nj, n. Then θj+θ ) 2 im c 0 Ij, I j, 0 and im c Ij, = 1. 11) Proof: First note that for arge n and the norma approximation to binomia probabiities, p j, = n i=n α Φ n ) Pr[Y j, X) < 0] i Pr[Y j, X) > 0] n i i n Pr[Y j, X) < 0] α ) Pr[Y j, X) < 0] Pr[Y j, X) > 0] ). 12) We get from Theorem 1a), Y j, X) Nc 2 µ µ j ) µ µ j ), 4c 2 µ µ j ) Σ µ µ j )) and hence, Pr[Y j, X) < 0] = Φ cµ µ j ) µ µ j ). 2 µ µ j ) Σ µ µ j )) Taking imits of both sides, we get using continuity of Φ ), im Pr[Y j, X) < 0] = 0.5, and c 0 im Pr[Y j, X) < 0] = 0. c This, together with 12) and the definition of I j, in 5), competes the proof of part a). The proof for Part b) is very simiar to that of part a) and is omitted. For part c), note that from Theorem 1c), p Y j, X) = d λ i 1)Z 2 i + 2λ i δ i cz i + λ i δi 2 c 2, 13) i=1 where δ i, i = 1,..., p are as defined thereand using the fact that the means are cµ j and cµ here). Then by continuity arguments, it foows that [ p ] im Pr [ Y j, X) < 0 ] = Pr c 0 =1 λ i 1)Z i 2 < 0 = κ. Note that the right-side of 13) is a quadratic in c with eading coefficient p i=1 λ iδ 2 i > 0. Hence, for arge vaues of c in particuar, when c is arger than the argest root of the quadratic equation), the right side in 13) is positive. Taking imits on both sides of 13) and using continuity arguments we get im c Pr [ Y j, X) < 0 ] = 0. This competes the proof of Theorem 3. The foowing coroary is immediate from the proof above. Coroary 4: Fix α 0, 1). Let for any c > 0, I j, c) = I j, be our preiminary index of separation between two groups C and C j, each having n i number of p-variate Gaussian observations with mean cµ i and covariance Σ i, i {j, }. For different choices of the distance metric, et I j, 0 be as defined in the three parts of Theorem 3. Then I j, c) is a continuous function of c with its range containing I j, 0, 1). From Theorem 1 see Remark 2 and Coroary 4), one can compute the vaue of the preiminary index for given vaues of the custer means and dispersions. In rea datasets, we can appy some custering method to obtain the custer memberships and estimate the custer means and the variance structures and compute an estimated version of the index. See Section 4 for one such appication. 2.2 Generating Data with given vaues of the Index From Coroary 4, it is cear that any vaue in I j, 0, 1) is a possibe vaue of I j,, for a given set of mode parameters cµ, Σ, n 1, n 2 ) by suitaby choosing the scaing parameter c > 0. This eads to the data-generation agorithm described in the appendix. The main idea is that for a given target vaue of the index, start with some initia sets of parameters p, µ, Σ, n 1, n 2 ), and compute our index for this initia configuration. Then find c > 0 iterativey so that the data with parameters cµ, Σ, n 1, n 2 ) attains the target vaue. The agorithm described in the appendix gives a more genera version of this for mutipe custers as we as for the adjusted version of the index discussed in Section 3. 3 ILLUSTRATIONS AND ADJUSTMENTS In this section, we provide some iustrative exampes obtained by simuating reaizations from groups under different scenarios using the preiminary index and the agorithm in the appendix. We study the reationship of reaizations obtained at these different vaues of the index with that of custering difficuty, which we measure in terms of the Adjusted Rand index R) of [37] obtained by custering the observations in each dataset and comparing the resut with the true cassification. Note that by design, R takes an average vaue of zero if a observations are assigned to a group competey at random. A perfect grouping of the observations matching the true, on the other hand, yieds R its highest vaue of unity. 3.1 Homogeneous Spherica Custers Here, it is assumed that a groups have the same spherica dispersion structure, i.e., Σ k = σ 2 I for a k {1, 2,..., K}. In this section, the ideaized R is cacuated on each simuated dataset by comparing the true cassification with that obtained

5 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 5 using the k-means agorithm of [35] started with the true known) group means in order to eiminate initiaization issues on the obtained custering). The obtained custering may thus be regarded as the best possibe grouping and its degree of abiity as measured by R) to recover the true casses may be considered to be an indication of the custering compexity of the simuated dataset The Two-Groups Case a) R = 0.83 d) R = 0.96 g) R = 1.0 I = 0.6 b) R = 0.86 I = 0.8 e) R = 0.96 I = 1.0 h) R = 1.0 c) R = 0.83 f) R = 0.92 i) R = 1.0 Fig. 1. Simuated datasets at three different vaues of I. Coors and symbos represent the true custer cass. Above each set of pots we give the vaue of I and beow each individua pot we give the obtained R. Figures 1a-i dispay simuated datasets for different vaues of our preiminary index I I 1,2 with α = See [38] for pots of additiona reaizations and at other vaues of I. In each case, 100 observations were generated from each of two groups separated according to I. In the figures, coor and character are used to distinguish the true grouping. Figure 1 demonstrates that as the vaue of I increases, the custers become more separated. This is confirmed by the vaues of R. The cassifications of the datasets in Figures 1a-c have the owest R between 0.83 and Each subsequent row down has a range of R higher than those in the previous row. Thus, there is some quaitative support for our measure of separation I) as an indicator of difficuty, however a more comprehensive anaysis is needed for using I as a quantitative measure. Therefore, we conducted a simuation study to investigate the vaidity of our index as a quantitative surrogate for custering compexity. Specificay, we simuated 25 datasets, each having observations from two groups at each of 15 eveny-spaced vaues of I using α = 0.75) in 0, 1), for nine different combinations of numbers of observations per custer and dimensions. For each dataset obtained at each vaue of I, we cacuated the R as outined above. Figures 2a-c dispay the resuts for p = 10, 100 and 1000, respectivey, with I on the x-axis and the corresponding R on the y-axis. Coor is used to indicate the number of observations per group in each dataset at setting. In each figure, the shaded region for each coor denotes the spread between the first and third quarties of R based on 25 datasets. From the figures, we note that I tracks R very we, again providing quaitative support for our index as a surrogate for custering compexity. However, the reationship is not inear. There is aso more variabiity in R when the number of observations is ow. Further, the bands representing the spread between the first and third quarties of R do not often overap with a change in dimension. Thus, there is some inconsistency in the reationship between our preiminary index I and R across dimensions. Indeed, this inconsistency is most apparent when the number of observations in the dataset is ess than the number of dimensions. This inconsistency is not terriby surprising, and in ine with the so-caed curse of dimensionaity and the need for arger sampe sizes with increasing p [39] to obtain the same kinds of custering performance as with ower dimensions. In order to maintain interpretabiity across dimensions, we therefore investigate adjustments to our preiminary index to account for the effect of n and p. We pursue this course in the remainder of this section An Initia Adjustment to I for group size and dimension: To understand further the reationship between I and R we simuated 25 datasets each with observations from two homogeneous spherica groups for a combinations of n 1, n 2, p), where n 1 n 2 assumed without oss of generaity) were the observations in the first and second groups, and p {2, 4, 5, 10,,, 100, 0, 0, 1000}, and n 1, n 2 ) {, ),, ), 75, 75), 100, 100), 0, 0), 0, 0), 1000, 1000), 30, 100),, ), 60, 75), 90, 100), 1, 2),, 600), 100, 1000)}. We simuated datasets according to ten vaues of I eveny spaced between 0 and 1, and for α = For each of these 1000 datasets thus obtained, we again obtained R using k-means. We expored severa reationships between I and R very extensivey. Of these exporations, the foowing mutipicative reationship between I and R was found to perform the best: og where δ α = θ α = ) )) θα R expδ α ) og, 14) 1 R 1 I 1 i j 4 1 i j k 4 ζ ωi,ω j,ω k ogω i ) ogω j ) ogω k ), β ωi,ω j ogω i ) ogω j ), and ω 1, ω 2, ω 3, ω 4 ) = n 1, n 2, p, e). Then for α =0.75 we fit the inear mode [ )] [ )] R og og θ α og og + δ α. 15) 1 R 1 I

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? a) p = b) p = c) p = 1000 Fig. 2. Pots for K = 2 custers comparing R y-axis) again I x-axis) for α = The three coors designate number of observations per custer such that n 1 = n 2. The ower and upper bound of the bands represent the first and third quartie of R cacuated on 25 datasets for severa different I. The estimates for the coefficients in 15) were obtained using east-squares to obtain the best possibe fit of I to R and are dispayed in Tabe 1 under the coumns abeed hom. See [38] for parameter estimates obtained at α = 0.25, 0.). This reationship has the property that a arge vaue of I corresponds to R that is cose to 1 whie vaues of I that are cose to zero correspond to Rs that are aso cose to zero. The sum of the estimates for the four three-way interaction terms for the ns is cose to zero for each α. This suggests that the when n 1 = n 2 the cubed term for the ns is not that different from zero. Therefore, based on the parameter estimates in the eft pane of Tabe 1, we define R I,α as foows: R I,α = exp expδ α ) og 1 1 I )) θα ) 1 exp expδ α ) og 1 1 I )) θα ) ) We ca R I,α as our initia adjusted index. We now investigate its performance in indexing custering compexity in a simiar framework to Figures 1 and 2. Figure 3 iustrates two-custer datasets simuated in a simiar manner as in Figure 1 but using our initia adjusted index R I,0.75. Simiar is the case with Figure 4 which mimics the setup of Figure 2 with the ony difference being that datasets are generated here using R I,0.75 instead of I). In both cases, we note that the agreement between the numerica vaues of R and R I,0.75 is substantiay improved. In particuar, Figures 3a-i show that the range of actua vaues of R contains the vaue of R I,0.75. Aso, Figures 4a-c demonstrate that R I,0.75 tracks R very we. Simiar is the case with α = 0.25 and 0. see [38]), providing support for the use of R I,α as a surrogate of custering compexity. This support is consistent for different numbers of observations and dimension. Note however, that Figure 4 continues to indicate greater variabiity in the obtained R when there are fewer observations in a group. This is expected because smaer sampe sizes ead to resuts with higher variabiity. R I,0.75 = 0.6 a) R = 0.65 b) R = 0.64 c) R = 0.55 R I,0.75 = 0.8 d) R = 0.79 e) R = 0.81 f) R = 0.83 R I,0.75 = 1.0 g) R = 1.0 h) R = 1.0 i) R = 1.0 Fig. 3. Simuated datasets at different R I,0.75 top of each row). Coor and symbo represent true cass. Obtained R vaues are provided beow each pot The Case with Many Groups K 2) So far we have anayzed and deveoped our index for twocass datasets. In this section, we extend it to the genera muti-group case. Ceary, separation between different pairs of custers impacts custering difficuty. We investigated severa

7 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 7 TABLE 1 Tabe of estimated parameter vaues to adjust index for n and p when α = 0.75, for custers with homogeneous spherica hom) and the genera heterogeneous het) dispersion structures. Fonts in the tabe represent bounds on the p-vaues bod and underine for a p-vaue < 0.001, bod and itaic for a p-vaue < 0.01, bod for a p-vaue < 0.05, itaic for a p-vaue < 0.1 and reguar font otherwise). hom het hom het hom het hom het hom het ζ ζ n1,p ζ n1,n 1,n β β n1,p ζ n ζ n2,p ζ n1,n 1,n β n β n2,p ζ n ζ n1,n ζ n,1n2,n β n β n1,n ζ p ζ n2,n ζ n2,n 2,n β p β n1,n ζ p,p ζ n1,n β n2,n a) p = 10 b) p = 100 c) p = 1000 Fig. 4. Pots for K = 2 custers comparing R against R I,0.75. Coors and bands are as in Figure 2. possibiities for summarizing custering difficuty based on a ) K 2 pairwise indices, however, they a possessed severa drawbacks. For instance, the average pairwise separation is typicay high because many of the R I,α s are cose to 1, whie the minimum is overy infuenced by the presence of ony) two cose groups. Therefore, we investigated an adaptation of the summarized mutipe Jaccard simiarity index proposed by [40]. The Jaccard coefficient of simiarity [41] measures simiarity or overap between two species or popuations. This was extended by [40] for summarizing many pairwise indices for the case of mutipe popuations. Note that both the Jaccard index and its summarized mutipe version address simiarity or overap between popuations whie our pairwise index measures separabiity. This needs to be adapted for our case. Therefore, we define R ii I,α = 0 for i = 1...., K. Further, for each pair R ij I,α : 1 i, j K of custers, et R ij I,α = Rji I,α R I,α, i.e., the adjusted index defined using custers C i and C j. Aso, et Υ = R ij I,α )) 1 i,j K be the matrix of corresponding R ij I,α s. Then we define the foowing summarized index for K custers: R I,α = 1 J K Υ) λ 1) 1, 17) K 1 where J K is a K K matrix with a entries equa to unity, and J K Υ) λ 1) is the argest eigenvaue of the matrix J K Υ). [40] motivates his summary using principa components anaysis PCA) in the context of a correation matrix where the first principa component is that orthogona projection of the dataset that captures the greatest variabiity in the K coordinates. Like his summary, our summary index 17) has some very appeaing properties. When the matrix of pairwise separation indices is J K I K, i.e., R j I,α = 1 j. then the first argest) eigenvaue captures ony 1/K proportion of the tota eigenvaues. In this case, R I,α = 1. On the other hand when every eement in the matrix is zero, there is perfect overap between a groups, and ony the argest eigenvaue is non-zero. In this case R I,α = 0. Finay, when K = 2, R I,α is consistent with the soe) pairwise index R ij I,α Fina Adjustments to Separation Index: Whie 17) provides a summarized pairwise measure that is appeaing in some specia cases, we sti need to investigate its performance as a surrogate measure of custering compexity. We therefore performed a detaied simuation study to study the reationship between the summarized R I,α and R. Specificay, we generated 25 K-custer datasets for K = 3, 5 and 10 each with 100 dimensions, at each of 10 different vaues of R I,0.75 between 0 and 1, for equa numbers of observations in each group, i.e., n k n 0 ) where n 0 {, 100, 1000}. For each dataset, we used the k-means agorithm, and computed R of the subsequent custering. Figures 5a-c pot the interquartie ranges of R for each combination of K and n 0. Here we have R I,α on the x-axis and R on the y-axis. Figures 5a-c demonstrate that R I,α tracks R. In addition, the reationship between R I,α and R is consistent for different

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? a) K = 3 b) K = 5 c) K = 10 Fig. 5. Pots for R against R I,0.75. The three coors designate numbers of observations per custer, set to be equa and {, 100, 1000}. Other aspects of the pots are as in Figure a) K = 3 b) K = 5 c) K = 10 Fig. 6. Pots of R against I for when α = Other aspects of the pot are as in Figure 5. numbers of observations. However, it is aso cear that the exact reationship between R I,α and R depends on K so some more adjustments are caed for. To study this issue further, we simuated 25 datasets for each combination of p {2, 4, 5, 10,,, 100, 0, 0, 1000}, n i = n j n 0 for a i, j, where n 0 {,, 75, 100, 0, 0, 1000}, at ten eveny-spaced vaues of R I,α in 0,1), α {0.25, 0.5, 0.75} and K {3, 5, 7, 10}. For each dataset we cacuated R. Using the R from each of these datasets, and for each combination of p, α, K) we fit the mutipicative mode: R R β k,p I,α. 18) Using the parameter estimates for β k,p for each combination of dimension and number of custers, we then fit the foowing inear mode separatey for each tuning parameter: β k,p = η + η 1 k + η 2 k 2 + η 3 p + η 4 p 2 + η 5 kp. 19) Parameter estimates for this mode for the case of α = 0.75 TABLE 2 Tabe of estimated parameter vaues to adjust index for K and p when α = 0.75, for custers with homogeneous spherica hom) and the genera heterogeneous het) dispersion structures. For the estimated parameters, two of the p-vaues are < 0.01 and the rest are < For any two constants a and b, ae-b means a 10 b. η η 1 η 2 η 3 η 4 η 5 hom E-4-1.9E E-5 het E E-4 are presented in Tabe 2 see [38] for the parameter estimates when α {0.25, 0.5}). Thus the fina version of our index, after a adjustments, for the case of homogeneous spherica custers is I = R β k,p I,α. )

9 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 9 Figures 6a-c are constructed simiary to Figures 5a-c except that the datasets are now generated using the agorithm in the appendix and using I instead of R I,0.75 ). Note that the effect of K has been argey addressed and the reationship between I and R is fairy simiar across dimensions. a) R = 0.57 d) R = 0.80 g) R = 1.0 I = 0.6 b) R = 0.58 I = 0.8 e) R = 0.82 I = 1.0 h) R = 1.0 c) R = 0.58 f) R = 0.76 i) R = 1.0 Fig. 7. Four-component simuated datasets at different I -vaues, with coor and potting character used to represent the true cass. Obtained R-vaues are dispayed beow each pot Iustrations: We now present, in Figures 7ai, some simuated four-cass datasets to demonstrate possibe configurations obtained using the agorithm in the appendix for three different vaues of I. Additiona reaizations and at different vaues of I are in [38]. In each case we used α = The different coors and characters in the figures represent the true casses. Note that the custers are we-separated as I increases. Custering compexity aso decreases as confirmed by the computed vaues of R. The datasets in the first row have the owest I, with each subsequent row down having a range of R higher than those in the previous row. In each row we see that I is within the range of the actua R index vaues. This provides further evidence that the transformed version of our index in the form of I is simiar to R. We concude this section with a few comments. Note that I is stricty between 0 and 1. Further, et us denote I c) to be the vaue of I that is defined using 15), 18) and ), but with I i,j repaced by Ic) i,j as in Coroary 4 for the custer C i with n i p-variate Gaussian observations with mean cµ i and covariance Σ i, i {1,..., K}. The foowing coroary impies that we can generate datasets with any vaue of the fina index between I 0) and unity using the agorithm in the appendix. Coroary 5: Fix α. Let c > 0, then for positive θ α and β k,p, I c) is a continuous function where its range contains I 0), 1). In addition, I c) is an increasing function of c. Proof: The resut foows directy from Theorem 3 and Coroary 4. Note that we found the adjustments in 15) and ) empiricay through simuations. In a of the cases we considered, θ α and β k,p are positive and thus Coroary 5 hods. Ideay, we shoud consider conducting further simuations if we desire adjustments for n i, n j, K or p that are very different from the cases we considered in our simuations. In this case we need to find other parameter estimates as in Tabes 1 and Custers with Genera Eipsoida Dispersions In this section, we assume that Σ k is any genera nonnegative definite covariance matrix, for each k {1,..., K}. In this case, the k-means agorithm is no onger appicabe so we used hierarchica custering with Eucidean distance and Ward s criterion [42] to obtain the tree, which was subsequenty cut to yied K custers. The R-vaue of this grouping reative to the true was taken to indicate the difficuty of custering. The focus in this section is therefore to broady reate our preiminary index and its adjustments to the R thus obtained The Two-Groups Case Figures 8a-i dispay simuated datasets for different vaues of our preiminary index I using the agorithm in the appendix with α = In each case, 100 observations were generated from each of two groups separated according to I. In these figures, coor and character distinguish the true grouping. For each simuated dataset, we aso obtained R as described above. Figures 8a-c have the owest R between 0.56 and 0.67, but each subsequent row down has R vaues, on the average, higher than in previous rows. In genera, therefore, Figures 8a-i demonstrate as the vaue of I increases, the custers become more separated. This coincides with an increase in the vaues of R. Thus ower vaues of I correspond to custering probems of higher difficuty whie higher vaues of I are associated with higher vaues of R and hence custering compexity. Simiar to Section 3.1, we investigated further the reationship between I and R in the case of nonhomogeneous groups. Specificay, we simuated 25 datasets each of observations from two nonhomogeneous popuations each with arbitrary eipsoida dispersion structures for a possibe combinations of n 1, n 2, p, I), where n 1 n 2 assumed w..o.g.), and n 1, n 2 are the numbers of observations in the two groups. In our experiments, p {2, 4, 5, 10,,, 100, 0, 0, 1000} and n 1, n 2 ) {, ),, ), 75, 75), 100, 100), 0, 0), 3, 3), 0, 100), 30, 100),, ), 60, 75), 90, 100), 1, 2),, 600), 100, 0)}. We simuated 25 datasets according to each of ten vaues of I spaced eveny between 0 and 1, and for α = Each

10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 10 I = 0.6 a) R = 0.64 b) R = 0.56 c) R = 0.67 I = 0.8 d) R = 0.81 e) R = 0.81 f) R = 0.83 I = 1.0 g) R = 1.0 h) R = 1.0 i) R = 1.0 Fig. 8. Simuated datasets at different I in the case with genera eipsoida-dispersed groups. Coor and potting character represent true groups. Under each pot we aso report the R comparing the true grouping with that using hierarchica custering with Ward s inkage and K = 2. dataset was partitioned into two groups using hierarchica custering with [42] s inkage and the resuting partition evauated with the true grouping of the simuated dataset in terms of R. We noticed noninearity in the genera reationship between I and R so as in Section 3.1.1, we expored a reationship between I and R aong with the parameters p, n. Simiar to 14), 15), 18) and 19) we found appropriate adjustments and defined R I,α for this case. See the coumns eveed het of Tabe 1 for the estimated vaues when α = 0.75 and [38] for estimates obtained when α equas 0.25 and 0..) The Case with Many Groups K 2) Summarizing the index for K > 2 groups brings the same issues outined in Section We propose adapting [40] s summarized mutipe Jaccard simiarity index in the same manner as before but noting that R I,α is cacuated within the setting of nonhomogeneous eipsoida custers. As in Section 3.1.2, we conducted an extensive simuation experiment to study the reationship between R I,α and R. We simuated 25 datasets for each combination of p, n i s, α, R I,α and K where p {2, 4, 5, 10,,, 100, 0}, α = 0.75, n i = n j = n 0 for a i, j with n 0 {,, 75, 100, 0, 0}, K {3, 5, 7, 10} and R I,α eveny-spaced over ten vaues in 0, 1). We partitioned each simuated dataset using hierarchica custering with [42] s inkage and cacuated R of the resuting partitioning reative to the true. We used the resuts to adjust our index as in 18) and 19) and obtained the fina summary I ) of simiar form to ), and coefficient estimates provided in Tabe 2. Figure 9 presents resuts of simuation experiments in the same spirit as Figures 5 and 6 except that the datasets are now generated using the agorithm in the appendix with I as described here. Note that as in Figures 6 the effect of K has been argey addressed and the reationship between I and R is argey consistent across dimensions. Thus the curse of dimensionaity no onger affects our index and as such it is interpretabe across dimensions. 3.3 Iustrative Exampes We now provide some iustrations of the range of muticustered datasets that can be obtained using the agorithm in the appendix and for different vaues of the summarized index 10 0 a) K = b) K = c) K = 10 Fig. 9. Pot of R against I for the case with genera eipsoida custers. The three coors represent the dimensions of the simuated datasets p {10,, 0}). Other aspects of the pot are as in Figure 2.

11 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 11 I. We first dispay reaizations obtained in two dimensions, and then move on to higher dimensions. Figures 10a-i mimic the setup of Figure 7 for nonhomogenous eipsoida custers. Here the observations are grouped using hierarchica custering with Ward s criterion [42] and cutting the tree hierarchy at K = 4. The grouping of the datasets in Figures 10a-c have the owest R between 0.54 and 0.64, but each subsequent row down has R vaues, on the average, higher than in previous rows. In genera, therefore, Figures 10a-i demonstrate as the vaue of I increases, the custers become more separated. This coincides with an increase in the vaues of R. Thus ower vaues of I correspond to custering probems of higher difficuty whie higher vaues of I are associated with higher vaues of R and hence custering compexity. Figures 10a-i demonstrate the various possibe configurations obtained using the agorithm in the appendix for three different vaues of I. Note that in some cases ony two of the four groups are we-separated whie in other cases none of the groups are we-separated. We used Radviz or radia visuaization pots [43] to dispay muti-dimensiona muti-cass datasets in Figure 11. We dispay three reaizations each of five-dimensiona five-custered datasets, obtained using our agorithm with I vaues of 0.6, 0.8 and 1. Additiona simuated reaizations and other vaues of I are presented in [38]. Beow each pot we provide R obtained upon comparing the partitioning obtained using a) R = 0.59 d) R = 0.84 I = 0.6 b) R = 0.60 I = 0.8 e) R = 0.75 I = 1 c) R = 0.60 f) R = 0.79 I = 0.6 g) R = 1 h) R = 1 i) R = 1 a) R = 0.54 d) R = 0.76 b) R = 0.64 I = 0.8 e) R = 0.84 c) R = 0.61 f) R = 0.80 Fig. 11. Radia visuaization pots of 5 dimension and 5 components at different vaues for I hierarchica custering with [42] s inkage. From Figure 11 we see that custers with higher vaues of I are more separated in genera than datasets with ower vaues of I. The custers with I = 0.6 appear to overap quite a bit; however, the datasets corresponding to I = 1 seem more separated. Additionay, I is cose to R in a cases. We cose this section here by commenting on the roe of α. Our preiminary index I α is dependent on the choice of α. However, our fina adjustments ensuring the fina index I to be approximatey inear in R means that the vaue of α is not that crucia in our cacuations. g) R = 1.0 I = 1.0 h) R = 1.0 i) R = 1.0 Fig. 10. Simuated datasets at different I in heterogeneous case for K = 4, drawn in the same spirit as Figure 8. 4 APPLICATION TO GROUPED DATA In this section we iustrate some appications of our index to some datasets which contains cass information for each observation. The datasets studied here are the textures [10], wine [11], Iris [12], crabs [13], image [14], E. coi [15] and [16] s synthetic two-dimensiona datasets. 4.1 Estimate I for cassification datasets Our first appication is in using I as a measure of how weseparated the groups in each dataset are, on the whoe, and how they reate to the difficuty in custering or cassification.

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.?, NO.?, JANUARY 1? 12 TABLE 3 The estimated index I Î ), the index derived by [27] QJ) and exact-c-separation c) are presented with adjusted Rand indices R Q, R E ) obtained using quadratic discriminant anaysis QDA) and EM-custering respectivey on the datasets S) i) textures, ii) Ruspini, iii) wine, iv) image, v) crabs, vi) Iris and vii) E. coi. S n p K Î R Q R E QJ c i) ii) iii) iv) 2, v) vi) vii) a) wine b) Iris c) crabs Tabe 3 provides summaries of our estimated vaues of I Î ) for each dataset. We use the true cassification and the actua observations to obtain estimates of the mean and covariance of each custer. Using the estimated mean, covariance and number of observations from each custer we find estimates of I for α = 0.75 and using Mahaanobis distance. We aso cacuated R based on the custering using quadratic discriminant anaysis QDA) and EM-custering done using the R package MCust. The corresponding cacuated R s were caed R Q and R E respectivey. Note that our EM agorithms were initiaized using the estimated mean and covariance matrix and the actua custer size proportions. Therefore we consider the fina custering as the best we coud possiby do over a other choices for initiaization vaues. Tabe 3 compares the estimated vaues of Î and R evauating the cassification on the corresponding dataset done using QDA and mode-based custering. The datasets are ordered in Tabe 3 from the argest to the smaest Î vaue. For each dataset, we aso cacuated [27] s index as we as exact-cseparation. With the exception of the Iris and image datasets, higher vaues of Î correspond to higher vaues of both R Q and R E. The reationship between ˆR and c-separation is not as cear. The textures dataset for exampe has ˆR = 1 but a very sma c-separation = There aso does not appear to be much reationship between the index of [27] and R. Both c-separation and [27] do however pick up image as the most difficut dataset to group. 4.2 Indexing Distinctiveness of Casses In this section, we discuss the use of our index to summarize the distinctiveness of each cass with respect to the other. To iustrate this point, we estimate the pairwise indices R I,α of Section for each pair of casses in each dataset. These pairwise indices are presented in Figure 12. For each dataset, we dispay the vaue quantitativey and quaitativey by means of a coor map. Darker vaues indicate we-separated groups with index vaues coser to 1 whie ighter regions represent pairs of groups that are harder to separate. The map of these pairwise indices provides us with an idea of the groups that are easier or harder to separate. In the Iris dataset exampe, ˆR I,0.75 = 1 for species 1 I. Setosa) and 2 I. Versicoor) and for species groups 1 I. Setosa) and 3 I. Virginica) d) E. coi e) image Fig. 12. Pots dispaying the the pairwise indices ˆR I,0.75 for five commony used cassification datasets. This indicates very-we separated groups. On the other hand, ˆR I,0.75 = 0.7 for the species groups 2 I. Versicoor and I. Virginica). This suggests that the difficuty in cassifying the Iris dataset is argey due to the simiarities between the species I. Virginica and I. Versicoor. The wine dataset is simiar to the Iris dataset in that most of the difficuty in cassification or custering appears due to just two groups, as evidenced by ˆR I,0.75 = for between Groups 1 and 2. A other pairs of groups in the wine dataset have a vaue of ˆRI,0.75 greater than The pairwise indices of the crabs dataset aso produces some interesting insights into the difficuty of cassification or custering. Note that this dataset has mae and femae crabs each of orange and bue coored crabs. An interesting finding is that ˆR I,0.75 is above 0.99 for a pairs of categories except for the two pairs of crab groups having the same coor. For the bue mae and femae crabs, we have ˆR I,0.75 = 0.61 whie ˆR I,0.75 = 0.90 for the orange mae and femae crabs. This suggests that coors separate the crab groups better than gender and thus the difficuty in custering this dataset is argey to the sma differences between genders. Both the Ruspini dataset and the textures dataset have ˆR I,0.75 = 1 for a pairs of groups and are not dispayed for brevity. Figure 12d dispays the pairwise vaues of ˆRI,0.75 for the 5 groups in the E. coi dataset. Once again, groups 2 and 3 are very difficut to separate, since ˆR I,0.75 = between these two groups. Figure 12e dispays the pairwise vaues for ˆR I,0.75 for the seven groups in the image dataset. Once again, there is very itte separation between groups 3 and 5 corresponding to images of foiage and a window as we as groups 4 and 6 corresponding to images of cement and a path. A other pairs of groups appear very we-separated. Our pairwise index thus provides an idea of how separated each individua grouping is reative to the others. Then, we can use our index to characterize the reationships between 0.1 0

A Separability Index for Distance-based Clustering and Classification Algorithms

A Separability Index for Distance-based Clustering and Classification Algorithms A Separabiity Index for Distance-based Custering and Cassification Agorithms Arka P. Ghosh, Ranjan Maitra and Anna D. Peterson 1 Department of Statistics, Iowa State University, Ames, IA 50011-1210, USA.