Clustering Ambiguity: An Overview

Size: px

Start display at page:

Download "Clustering Ambiguity: An Overview"

Claud Tyler
5 years ago
Views:

1 Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3 rd Joint Sheffield Conference on Chemoinformatics April 23, 2004

2 Outline The Problem: Clustering Ambiguity and Chemoinformatics Preliminaries: bit strings, measures, similarity distributions Ties in Proximities and, more generally, Decision Ties. Clustering Algorithms and Decision Ties Examples: Taylor-Butina, (leader algorithm) K-means and K-modes Jarvis Patrick, RNN Hierarchical (Wards, Complete Link, Group Average) Remarks

3 Clustering Ambiguity Problem Where: clustering algorithms that find distinct groups in data. However, a quantitative decision process ( Idiot Proof ) may lead to ambiguous results. Symptom: permute input data different results. Namely, not stable with respect to input order. Ambiguity it is not clear what belongs to what group Distinct from: fingerprint collisions (different compounds same fingerprints) Precision

4 Clustering Applications and Binary Fingerprints Lead selection in HTS data Diversity analysis Lead hopping Compound acquisition decisions Etc. Downs, G. M.; Barnard, J. M. Clustering Methods and Their Uses in Computational Chemistry, Reviews in Computational Chemistry; Vol. 18, Lipkowitz, K. B. and Boyd, D. B., Eds; Wiley-VCH: New York, 2002, 1-40.

5 Binary Fingerprints Descriptor Cl CH 3 NH 2 Encode NH 2 Fixed length bit strings such as Daylight MDL BCI etc.

6 Common (Dis)Similarity Coefficients Tanimoto Euclidean Cosine Hamman Tversky

7 Simple Bit String Similarity Measure Properties Symmetric (e.g.,tanimoto) Similarity from A to B is the same as the similarity from B to A. Asymmetric (e.g., Tversky) Similarity from A to B is not necessarily the same as the similarity from B to A. Clustering Compound Data: Asymmetric Clustering of Compound Data, MacCuish and MacCuish, Chemometrics and Chemoinformatics, ACS Symposium Series, in press Metric (e.g., Euclidean) Satisfies the triangle inequality Non-metric (e.g., Soergel) Does not satisfy the triangle inequality Note, the square root of the Soergel does satisfy the triangle inequality for binary bit strings. Gower and Legendre, Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 1986, 3, 5-48.

8 Tie in Proximity S H 2 C H N N H N H N O Euclidean Dist =.16 Euclidean Dist =.16 S S H 2 C N O H N N H N H H 2 C N O H N N H N H One structure (or Cluster!) equidistant from two others.

9 Are Proximity Ties Common? Example: Binary Fingerprints with the Tanimoto Here are all bit strings of length 5: strings Here are all possible Tanimoto similarities for distinct bit strings of length 5: 0, 1/5, 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5 All reduced fractions, denominators of 5 or less This is the Farey Sequence N, where N is 5 There are just 10 such distinct similarities And 496 all pairs similarities between these strings, given 32 distinct strings. And the distribution is

10 All possible Tanimoto Similarities for Bit Strings of Length 5 0 Frequency of Similarities /5 1/4 1/3 2/5 1/2 Average frequency /3 3/5 3/4 4/ Tanimoto Similarity

11 Finite Number of Proximities How many possible Tanimoto similarities are there given N bits in a fixed length fingerprint? 3 N 2 + O N N π 2 ( log ) Namely, the sum of the number of reduced fractions with denominators up to N. (Proof of above expected bound, 1883) How many possible Euclidean similarities? = N + 1 How many possible Cosine similarities? No known closed form in terms of N Any Number Theorists in the house?

12 For Fingerprints of Size 1024 How many possible Tanimoto similarities ~329,000 How many Euclidean similarities? 1,025 How many Cosine similarities? In the low millions (empirical estimate)

13 All possible Tanimoto Similarities for Bit Strings of Length 5 Exact Discrete Distributions vs Probabistic Discrete Distributions 380 NCI actives: Daylight Fingerprints w/ 512 bits Frequency of Similarities Tanimoto Similarity Tanimoto Similarity 380 NCI actives: Daylight Fingerprints w/ 512 bits Frequency of Similarities Frequency of Similarities All possible Ochiai Similarities for Bit Strings of Length 5 Frequency of Similarities Ochiai (1-Cosine) Similarity Ochiai Similarity

14 Clustering and Ties in Proximity Measures with small numbers of possible similarities (e.g., Euclidean), or distributions that lead to this same effect (e.g., Tanimoto, Ochiai), are prone to the problem of ties in proximity in clustering. This can effect derived measures as well, such as the square error of Wards merging criterion. Algorithms for Clustering Data, Jain and Dubes. Godden, et al, JCICS, 2001, 40, , and MacCuish, et al, JCICS, 2001, 41 (1), Namely, we are clustering in a space that is a rigid lattice of proximities and/or derived measures rather than a continuum. (Note: typically for the lengths of the binary descriptors of the vendors mentioned, this lattice is far more course than the lattice that would be created by the typical floating point machine representation of real numbers.)

15 In the literature beware of: We resolve ties arbitrarily

16 Decision Ties in Clustering Algorithms A simple decision tie tie in proximity Other decision ties may be algorithm dependent (can occur even with continuous data). In practice most decision ties lead to cluster ambiguity an inability to discriminate nondisjoint (overlapping) clusters. Namely, disjoint clusters don t reflect the amount of ambiguity identified by decision ties as the resulting non-disjoint clustering suggests.

17 Algorithms

18 Taylor-Butina (TB) Leader or Exclusion Cluster Sampling Algorithm 1. Create thresholded nearest neighbor table 2. Find true singletons: all those compounds with an empty nearest neighbor list. 3. Find the compound with the largest nearest neighbor list (representative compound or centrotype). This becomes a group and is excluded from consideration these compounds are removed from all nearest neighbor lists. 4. Repeat 3 until no compounds exist with a non-empty nearest neighbor list. Taylor, JCICS, 1995, 35, Butina, JCICS 1999, 39, Optional: 1. Assign remaining compounds, false singletons, to the group that contains their nearest neighbor; 2. Use other criterion to break exclusion region ties; 3. Use asymmetric measures; 4. Can be made to return overlapping clusters.

19 Representative Compound Tie Cases in TB Algorithm Exclusion Region Tie False Singleton Tie False Singleton, Which Region? Exclusion Regions Diameter Set by Threshold value True Singleton May form ambiguous clustering if sum of minimum distances is also tied False singleton tie, but regions not ambiguous, no need to sum minimum distances

20 K-Means and K-Modes and overlapping versions Continuous K-means with fingerprints (convert binary to real 0.0s, 1.0s) 1. Choose k seed centroids from data set (e.g., quasi-randomly via 1D Halton sequence) 2. Find nearest neighbors to the centroids -- TIES HERE -- Overlapping 3. Recompute new centroids 4. Repeat 2 until no neighbors change groups or some iterative threshold. K-modes with fingerprints (fingerprints remain binary) 1. Choose k seed modes from data set (or frequency of categories method, etc.) 2. Find nearest neighbors to the modes (euclidean, tanimoto, etc.) TIES HERE 3. Recompute new modes (simple matching coefficient) 4. Same as 4 in continuous K-means Continuous K-means, Los Alamos Science, Faber, Kelly, White, 1994

21 Jarvis Patrick Two common versions: 1. Kmin: Fixed length, k, NN list -- TIES HERE -- kth neighbor tied If two compound NN lists have j neighbors in common those compounds are in the same group. 2. Pmin: Fixed length, K, NN list -- TIES HERE -- kth neighbor tied If two compound NN lists have a percentage, p, of neighbors in common those compounds are in the same group. Improvements to Daylight Clustering, Delaney, Bradshaw, MUG 04

22 Reciprocal Nearest Neighbors (RNN) Hierarchical Clustering Wards, Group Average, Complete Link clustering algorithms can use RNN as a fast O( N method of obtaining the hierarchy. 2 ) vs O( N 3 ) Murtagh, A survey on recent advances in hierarchical clustering algorithms, Computer Journal, 26, , 1983 The RNN form of these algorithms contain specific decision ties unique to this method. The resulting ambiguity can be quantified by enumerating decision tie events.

23 RNN Algorithm Decision Ties 1. Form a nearest neighbor (NN) chain until a RNN (each is the NN of the other) is found. What if there is more than one NN? -- Ties in Proximity (or merging criterion) problem, increasing the ambiguity. What if in turn there is more than one RNN? Ties in Proximity and Algorithm Decision Tie problem, more Ambiguity. 2. Use another merge criterion than the criterion used in the algorithm to choose RNN in this case -- decrease the Ambiguity. What if the results of this new criterion is also tied? Another Algorithm Decision Tie, increasing the Ambiguity. For hierarchical algorithms that return overlapping clusterings based solely on ties in proximity, see Nicolaou, MacCuish, and Tamura, A new multi-domain clustering algorithm for lead discovery that exploits ties in proximity, Proceedings from the 13 th Euro-QSAR, Prous Science, 2000, pp

24 How can we address this problem?

25 Levels of Ambiguity Two Groups with Considerable Overlap.. Or, Smaller, More Distinct Groups Or, the difficulty of making sense of large numbers of overlapping clusters where the intersections are large.

26 Distinct Clusters Overlapping clusters, a result of combining all decision ties Distinct (Disjoint) Clusters: just one clustering of many possible Overlapping clusters, but understandable Overlapping clusters but Difficult to understand Fewer Decision Ties less Ambiguity More Decision Ties more Ambiguity

27 An Ambiguity Index Defined with TB Algorithm The difference between the disjoint and nondisjoint results of Taylor s algorithm can give us a sense of the ambiguity inherent in clustering fingerprints at a given Tanimoto or Tversky threshold. Many simple indices can be defined. We use an index that reflects the number of shared compounds in the non-disjoint clustering.

28 Increasing Ambiguity Index Clustering Ambiguity with Taylor's Algorithm 380 NCI-HIV Actives MACCS 166 Bits MACCS 320 Bits Daylight 512 Bits Approx. 10% of Cmpds shared among clusters Tanimoto Threshold

29 Jarvis-Patrick Results Summary The number of proximity ties are significant in both algorithms when reasonable values for k, j, and p are chosen -- on par with that of Taylor s and RNN algorithms. Kmin typically has more ties in general, though it is hard to make a one to one comparison with Pmin.

30 K-means, K-modes Results Summary K-means: Typically small number of ties depending on K on just the first iteration. Rarely are there ties after the first iteration. Very little overlap when the algorithm converges. Ambiguity confounded by local optima problem. K-mode with frequency method: Fewer ties overall than even K-means But -- ties occur more frequently in subsequent iterations Again, very little overlap when the algorithm converges. Ambiguity confounded by local optima problem.

31 Best Level? Level Selection and Ambiguity in Hierarchical Clustering Ambiguity? Ambiguity? Ambiguity? Total Ambiguity Kelly Level Selection Values Total Ambiguity in the Form of Ties Number of clusters Level Selection Heuristic Number of clusters Ambiguity Index

32 Ambiguity Index for Hierarchical RNN class Algorithms Count the number of decision ties as a rough estimate of the ambiguity. Use this in conjunction with level selection techniques (e.g., Kelley s), where the objective is to find the best non-trivial level selection value with the lowest ambiguity index. Kelley, L. A.; Gardner, S. P.; Sutcliffe, M. J. An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally-related subfamilies. Protein Eng. 1996, 9, Wild, D.J.; Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping Using Ward s Clustering, J. Chem. Inf. Comput. Sci. 2000, 40,

33 Two Wards Clusterings with Euclidean distance same data (482 NCI-HIV) -- entered in a different order Best Kelley Cuts 63 Clusters 136 Clusters Similar groups at the top of the dendrogram mask very different groups below

34 Complete Link -- Direct Understanding of Level Using Tanimoto as the measure we can inspect ambiguity and the similarity level or threshold directly. Namely, check ambiguity at various tanimoto similarity thresholds (levels) common in the field: 0.7, 0.85

35 Two Complete Link Clusterings with Soergel measure same data (482 NCI-HIV) -- entered in a different order Best Kelley Cuts 24 clusters 0.7 similarity 255 clusters (not all the same) For each clusters Same problem

36 Conclusions DON T PANIC Clustering is Good! Ambiguity important in terms of choosing K clusters Combining Level Selection information with ambiguity information can help to make sense of results. Modifying algorithms to use secondary grouping criterion when faced with decision ties can help reduce the ambiguity, often providing tighter more useful clusterings -- data, measure, and algorithm dependent however! BUT BE CAREFUL! Is it important for your application?? Determining ambiguity or adding secondary grouping criteria can have significant computational cost. In general, the choice of bit string length, measures, and algorithms can all lead to differing amounts of ambiguity.

37 Future Work Further work on Ambiguity Indices Ideally (FPLength)X(Measures)X(Algorithms)X(FindK)X(DataSetSize)X(DataSetDiversity) Explore other algorithms

38 Acknowledgements John Bradshaw Daylight, CIS John Blankley Pfizer (Retired) John Barnard BCI David Wild Wild Ideas OpenEye Scientific Software, Inc. This talk can be found at

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects