Clustering Ambiguity: An Overview

Size: px
Start display at page:

Download "Clustering Ambiguity: An Overview"

Transcription

1 Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3 rd Joint Sheffield Conference on Chemoinformatics April 23, 2004

2 Outline The Problem: Clustering Ambiguity and Chemoinformatics Preliminaries: bit strings, measures, similarity distributions Ties in Proximities and, more generally, Decision Ties. Clustering Algorithms and Decision Ties Examples: Taylor-Butina, (leader algorithm) K-means and K-modes Jarvis Patrick, RNN Hierarchical (Wards, Complete Link, Group Average) Remarks

3 Clustering Ambiguity Problem Where: clustering algorithms that find distinct groups in data. However, a quantitative decision process ( Idiot Proof ) may lead to ambiguous results. Symptom: permute input data different results. Namely, not stable with respect to input order. Ambiguity it is not clear what belongs to what group Distinct from: fingerprint collisions (different compounds same fingerprints) Precision

4 Clustering Applications and Binary Fingerprints Lead selection in HTS data Diversity analysis Lead hopping Compound acquisition decisions Etc. Downs, G. M.; Barnard, J. M. Clustering Methods and Their Uses in Computational Chemistry, Reviews in Computational Chemistry; Vol. 18, Lipkowitz, K. B. and Boyd, D. B., Eds; Wiley-VCH: New York, 2002, 1-40.

5 Binary Fingerprints Descriptor Cl CH 3 NH 2 Encode NH 2 Fixed length bit strings such as Daylight MDL BCI etc.

6 Common (Dis)Similarity Coefficients Tanimoto Euclidean Cosine Hamman Tversky

7 Simple Bit String Similarity Measure Properties Symmetric (e.g.,tanimoto) Similarity from A to B is the same as the similarity from B to A. Asymmetric (e.g., Tversky) Similarity from A to B is not necessarily the same as the similarity from B to A. Clustering Compound Data: Asymmetric Clustering of Compound Data, MacCuish and MacCuish, Chemometrics and Chemoinformatics, ACS Symposium Series, in press Metric (e.g., Euclidean) Satisfies the triangle inequality Non-metric (e.g., Soergel) Does not satisfy the triangle inequality Note, the square root of the Soergel does satisfy the triangle inequality for binary bit strings. Gower and Legendre, Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 1986, 3, 5-48.

8 Tie in Proximity S H 2 C H N N H N H N O Euclidean Dist =.16 Euclidean Dist =.16 S S H 2 C N O H N N H N H H 2 C N O H N N H N H One structure (or Cluster!) equidistant from two others.

9 Are Proximity Ties Common? Example: Binary Fingerprints with the Tanimoto Here are all bit strings of length 5: strings Here are all possible Tanimoto similarities for distinct bit strings of length 5: 0, 1/5, 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5 All reduced fractions, denominators of 5 or less This is the Farey Sequence N, where N is 5 There are just 10 such distinct similarities And 496 all pairs similarities between these strings, given 32 distinct strings. And the distribution is

10 All possible Tanimoto Similarities for Bit Strings of Length 5 0 Frequency of Similarities /5 1/4 1/3 2/5 1/2 Average frequency /3 3/5 3/4 4/ Tanimoto Similarity

11 Finite Number of Proximities How many possible Tanimoto similarities are there given N bits in a fixed length fingerprint? 3 N 2 + O N N π 2 ( log ) Namely, the sum of the number of reduced fractions with denominators up to N. (Proof of above expected bound, 1883) How many possible Euclidean similarities? = N + 1 How many possible Cosine similarities? No known closed form in terms of N Any Number Theorists in the house?

12 For Fingerprints of Size 1024 How many possible Tanimoto similarities ~329,000 How many Euclidean similarities? 1,025 How many Cosine similarities? In the low millions (empirical estimate)

13 All possible Tanimoto Similarities for Bit Strings of Length 5 Exact Discrete Distributions vs Probabistic Discrete Distributions 380 NCI actives: Daylight Fingerprints w/ 512 bits Frequency of Similarities Tanimoto Similarity Tanimoto Similarity 380 NCI actives: Daylight Fingerprints w/ 512 bits Frequency of Similarities Frequency of Similarities All possible Ochiai Similarities for Bit Strings of Length 5 Frequency of Similarities Ochiai (1-Cosine) Similarity Ochiai Similarity

14 Clustering and Ties in Proximity Measures with small numbers of possible similarities (e.g., Euclidean), or distributions that lead to this same effect (e.g., Tanimoto, Ochiai), are prone to the problem of ties in proximity in clustering. This can effect derived measures as well, such as the square error of Wards merging criterion. Algorithms for Clustering Data, Jain and Dubes. Godden, et al, JCICS, 2001, 40, , and MacCuish, et al, JCICS, 2001, 41 (1), Namely, we are clustering in a space that is a rigid lattice of proximities and/or derived measures rather than a continuum. (Note: typically for the lengths of the binary descriptors of the vendors mentioned, this lattice is far more course than the lattice that would be created by the typical floating point machine representation of real numbers.)

15 In the literature beware of: We resolve ties arbitrarily

16 Decision Ties in Clustering Algorithms A simple decision tie tie in proximity Other decision ties may be algorithm dependent (can occur even with continuous data). In practice most decision ties lead to cluster ambiguity an inability to discriminate nondisjoint (overlapping) clusters. Namely, disjoint clusters don t reflect the amount of ambiguity identified by decision ties as the resulting non-disjoint clustering suggests.

17 Algorithms

18 Taylor-Butina (TB) Leader or Exclusion Cluster Sampling Algorithm 1. Create thresholded nearest neighbor table 2. Find true singletons: all those compounds with an empty nearest neighbor list. 3. Find the compound with the largest nearest neighbor list (representative compound or centrotype). This becomes a group and is excluded from consideration these compounds are removed from all nearest neighbor lists. 4. Repeat 3 until no compounds exist with a non-empty nearest neighbor list. Taylor, JCICS, 1995, 35, Butina, JCICS 1999, 39, Optional: 1. Assign remaining compounds, false singletons, to the group that contains their nearest neighbor; 2. Use other criterion to break exclusion region ties; 3. Use asymmetric measures; 4. Can be made to return overlapping clusters.

19 Representative Compound Tie Cases in TB Algorithm Exclusion Region Tie False Singleton Tie False Singleton, Which Region? Exclusion Regions Diameter Set by Threshold value True Singleton May form ambiguous clustering if sum of minimum distances is also tied False singleton tie, but regions not ambiguous, no need to sum minimum distances

20 K-Means and K-Modes and overlapping versions Continuous K-means with fingerprints (convert binary to real 0.0s, 1.0s) 1. Choose k seed centroids from data set (e.g., quasi-randomly via 1D Halton sequence) 2. Find nearest neighbors to the centroids -- TIES HERE -- Overlapping 3. Recompute new centroids 4. Repeat 2 until no neighbors change groups or some iterative threshold. K-modes with fingerprints (fingerprints remain binary) 1. Choose k seed modes from data set (or frequency of categories method, etc.) 2. Find nearest neighbors to the modes (euclidean, tanimoto, etc.) TIES HERE 3. Recompute new modes (simple matching coefficient) 4. Same as 4 in continuous K-means Continuous K-means, Los Alamos Science, Faber, Kelly, White, 1994

21 Jarvis Patrick Two common versions: 1. Kmin: Fixed length, k, NN list -- TIES HERE -- kth neighbor tied If two compound NN lists have j neighbors in common those compounds are in the same group. 2. Pmin: Fixed length, K, NN list -- TIES HERE -- kth neighbor tied If two compound NN lists have a percentage, p, of neighbors in common those compounds are in the same group. Improvements to Daylight Clustering, Delaney, Bradshaw, MUG 04

22 Reciprocal Nearest Neighbors (RNN) Hierarchical Clustering Wards, Group Average, Complete Link clustering algorithms can use RNN as a fast O( N method of obtaining the hierarchy. 2 ) vs O( N 3 ) Murtagh, A survey on recent advances in hierarchical clustering algorithms, Computer Journal, 26, , 1983 The RNN form of these algorithms contain specific decision ties unique to this method. The resulting ambiguity can be quantified by enumerating decision tie events.

23 RNN Algorithm Decision Ties 1. Form a nearest neighbor (NN) chain until a RNN (each is the NN of the other) is found. What if there is more than one NN? -- Ties in Proximity (or merging criterion) problem, increasing the ambiguity. What if in turn there is more than one RNN? Ties in Proximity and Algorithm Decision Tie problem, more Ambiguity. 2. Use another merge criterion than the criterion used in the algorithm to choose RNN in this case -- decrease the Ambiguity. What if the results of this new criterion is also tied? Another Algorithm Decision Tie, increasing the Ambiguity. For hierarchical algorithms that return overlapping clusterings based solely on ties in proximity, see Nicolaou, MacCuish, and Tamura, A new multi-domain clustering algorithm for lead discovery that exploits ties in proximity, Proceedings from the 13 th Euro-QSAR, Prous Science, 2000, pp

24 How can we address this problem?

25 Levels of Ambiguity Two Groups with Considerable Overlap.. Or, Smaller, More Distinct Groups Or, the difficulty of making sense of large numbers of overlapping clusters where the intersections are large.

26 Distinct Clusters Overlapping clusters, a result of combining all decision ties Distinct (Disjoint) Clusters: just one clustering of many possible Overlapping clusters, but understandable Overlapping clusters but Difficult to understand Fewer Decision Ties less Ambiguity More Decision Ties more Ambiguity

27 An Ambiguity Index Defined with TB Algorithm The difference between the disjoint and nondisjoint results of Taylor s algorithm can give us a sense of the ambiguity inherent in clustering fingerprints at a given Tanimoto or Tversky threshold. Many simple indices can be defined. We use an index that reflects the number of shared compounds in the non-disjoint clustering.

28 Increasing Ambiguity Index Clustering Ambiguity with Taylor's Algorithm 380 NCI-HIV Actives MACCS 166 Bits MACCS 320 Bits Daylight 512 Bits Approx. 10% of Cmpds shared among clusters Tanimoto Threshold

29 Jarvis-Patrick Results Summary The number of proximity ties are significant in both algorithms when reasonable values for k, j, and p are chosen -- on par with that of Taylor s and RNN algorithms. Kmin typically has more ties in general, though it is hard to make a one to one comparison with Pmin.

30 K-means, K-modes Results Summary K-means: Typically small number of ties depending on K on just the first iteration. Rarely are there ties after the first iteration. Very little overlap when the algorithm converges. Ambiguity confounded by local optima problem. K-mode with frequency method: Fewer ties overall than even K-means But -- ties occur more frequently in subsequent iterations Again, very little overlap when the algorithm converges. Ambiguity confounded by local optima problem.

31 Best Level? Level Selection and Ambiguity in Hierarchical Clustering Ambiguity? Ambiguity? Ambiguity? Total Ambiguity Kelly Level Selection Values Total Ambiguity in the Form of Ties Number of clusters Level Selection Heuristic Number of clusters Ambiguity Index

32 Ambiguity Index for Hierarchical RNN class Algorithms Count the number of decision ties as a rough estimate of the ambiguity. Use this in conjunction with level selection techniques (e.g., Kelley s), where the objective is to find the best non-trivial level selection value with the lowest ambiguity index. Kelley, L. A.; Gardner, S. P.; Sutcliffe, M. J. An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally-related subfamilies. Protein Eng. 1996, 9, Wild, D.J.; Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping Using Ward s Clustering, J. Chem. Inf. Comput. Sci. 2000, 40,

33 Two Wards Clusterings with Euclidean distance same data (482 NCI-HIV) -- entered in a different order Best Kelley Cuts 63 Clusters 136 Clusters Similar groups at the top of the dendrogram mask very different groups below

34 Complete Link -- Direct Understanding of Level Using Tanimoto as the measure we can inspect ambiguity and the similarity level or threshold directly. Namely, check ambiguity at various tanimoto similarity thresholds (levels) common in the field: 0.7, 0.85

35 Two Complete Link Clusterings with Soergel measure same data (482 NCI-HIV) -- entered in a different order Best Kelley Cuts 24 clusters 0.7 similarity 255 clusters (not all the same) For each clusters Same problem

36 Conclusions DON T PANIC Clustering is Good! Ambiguity important in terms of choosing K clusters Combining Level Selection information with ambiguity information can help to make sense of results. Modifying algorithms to use secondary grouping criterion when faced with decision ties can help reduce the ambiguity, often providing tighter more useful clusterings -- data, measure, and algorithm dependent however! BUT BE CAREFUL! Is it important for your application?? Determining ambiguity or adding secondary grouping criteria can have significant computational cost. In general, the choice of bit string length, measures, and algorithms can all lead to differing amounts of ambiguity.

37 Future Work Further work on Ambiguity Indices Ideally (FPLength)X(Measures)X(Algorithms)X(FindK)X(DataSetSize)X(DataSetDiversity) Explore other algorithms

38 Acknowledgements John Bradshaw Daylight, CIS John Blankley Pfizer (Retired) John Barnard BCI David Wild Wild Ideas OpenEye Scientific Software, Inc. This talk can be found at

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

Universities of Leeds, Sheffield and York

Universities of Leeds, Sheffield and York promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Quantitative structure

More information

Similarity Search. Uwe Koch

Similarity Search. Uwe Koch Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance

More information

Fast Hierarchical Clustering from the Baire Distance

Fast Hierarchical Clustering from the Baire Distance Fast Hierarchical Clustering from the Baire Distance Pedro Contreras 1 and Fionn Murtagh 1,2 1 Department of Computer Science. Royal Holloway, University of London. 57 Egham Hill. Egham TW20 OEX, England.

More information

Fast similarity searching making the virtual real. Stephen Pickett, GSK

Fast similarity searching making the virtual real. Stephen Pickett, GSK Fast similarity searching making the virtual real Stephen Pickett, GSK Introduction Introduction to similarity searching Use cases Why is speed so crucial? Why MadFast? Some performance stats Implementation

More information

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors Rajarshi Guha, Debojyoti Dutta, Ting Chen and David J. Wild School of Informatics Indiana University and Dept.

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Molecular Descriptors Theory and tips for real-world applications

Molecular Descriptors Theory and tips for real-world applications Molecular Descriptors Theory and tips for real-world applications Francesca Grisoni University of Milano-Bicocca, Dept. of Earth and Environmental Sciences, Milan, Italy ETH Zurich, Dept. of Chemistry

More information

Similarity methods for ligandbased virtual screening

Similarity methods for ligandbased virtual screening Similarity methods for ligandbased virtual screening Peter Willett, University of Sheffield Computers in Scientific Discovery 5, 22 nd July 2010 Overview Molecular similarity and its use in virtual screening

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Investigation of the Use of Spectral Clustering for the Analysis of Molecular Data

Investigation of the Use of Spectral Clustering for the Analysis of Molecular Data This is an open access article published under a Creative Commons Attribution (CC-BY) License, which permits unrestricted use, distribution and reproduction in any medium, provided the author and source

More information

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK Chemoinformatics and information management Peter Willett, University of Sheffield, UK verview What is chemoinformatics and why is it necessary Managing structural information Typical facilities in chemoinformatics

More information

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms Alberto Fernández and Sergio Gómez arxiv:cs/0608049v2 [cs.ir] 0 Jun 2009 Departament d Enginyeria Informàtica i Matemàtiques,

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

has its own advantages and drawbacks, depending on the questions facing the drug discovery. 2013 First International Conference on Artificial Intelligence, Modelling & Simulation Comparison of Similarity Coefficients for Chemical Database Retrieval Mukhsin Syuib School of Information Technology

More information

Machine Learning Concepts in Chemoinformatics

Machine Learning Concepts in Chemoinformatics Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

This is a repository copy of Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures.

This is a repository copy of Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. This is a repository copy of Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/170/

More information

Applying cluster analysis to 2011 Census local authority data

Applying cluster analysis to 2011 Census local authority data Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables

More information

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013 Chemical Space Space, Diversity, and Synthesis Jeremy Henle, 4/23/2013 Computational Modeling Chemical Space As a diversity construct Outline Quantifying Diversity Diversity Oriented Synthesis Wolf and

More information

Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies

Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-aturwissenschaftlichen Fakultät der Rheinischen

More information

Multivariate Statistics: Hierarchical and k-means cluster analysis

Multivariate Statistics: Hierarchical and k-means cluster analysis Multivariate Statistics: Hierarchical and k-means cluster analysis Steffen Unkel Department of Medical Statistics University Medical Center Goettingen, Germany Summer term 217 1/43 What is a cluster? Proximity

More information

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering. 1 / 19 sscott@cse.unl.edu x1 If no label information is available, can still perform unsupervised learning Looking for structural information about instance space instead of label prediction function Approaches:

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Chemical Similarity Searching

Chemical Similarity Searching J. Chem. Inf. Comput. Sci. 1998, 38, 983-996 983 Chemical Similarity Searching Peter Willett* Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Sheffield

More information

Module 7-2 Decomposition Approach

Module 7-2 Decomposition Approach Module 7-2 Decomposition Approach Chanan Singh Texas A&M University Decomposition Approach l Now we will describe a method of decomposing the state space into subsets for the purpose of calculating the

More information

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze)

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze) Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze) Learning Individual Rules and Subgroup Discovery Introduction Batch Learning Terminology Coverage Spaces Descriptive vs. Predictive

More information

Chapter 5-2: Clustering

Chapter 5-2: Clustering Chapter 5-2: Clustering Jilles Vreeken Revision 1, November 20 th typo s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point x as a member of its own ε-neighborhood 12 Nov 2015

More information

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Iván Solt Solutions for Cheminformatics Drug Discovery Strategies for known targets High-Throughput Screening (HTS) Cells

More information

Ch. 10 Vector Quantization. Advantages & Design

Ch. 10 Vector Quantization. Advantages & Design Ch. 10 Vector Quantization Advantages & Design 1 Advantages of VQ There are (at least) 3 main characteristics of VQ that help it outperform SQ: 1. Exploit Correlation within vectors 2. Exploit Shape Flexibility

More information

Classifier Selection. Nicholas Ver Hoeve Craig Martek Ben Gardner

Classifier Selection. Nicholas Ver Hoeve Craig Martek Ben Gardner Classifier Selection Nicholas Ver Hoeve Craig Martek Ben Gardner Classifier Ensembles Assume we have an ensemble of classifiers with a well-chosen feature set. We want to optimize the competence of this

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Selecting Diversified Compounds to Build a Tangible Library for Biological and Biochemical Assays

Selecting Diversified Compounds to Build a Tangible Library for Biological and Biochemical Assays Molecules 2010, 15, 5031-5044; doi:10.3390/molecules15075031 OPEN ACCESS molecules ISSN 1420-3049 www.mdpi.com/journal/molecules Article Selecting Diversified Compounds to Build a Tangible Library for

More information

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring 2006 Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. Latombe Scribe: Neda Nategh How do you update the energy function during the

More information

Studying the effect of noise on Laplacian-modified Bayesian Analysis and Tanimoto Similarity

Studying the effect of noise on Laplacian-modified Bayesian Analysis and Tanimoto Similarity Studying the effect of noise on Laplacian-modified Bayesian nalysis and Tanimoto Similarity David Rogers, Ph.D. SciTegic, Inc. (Division of ccelrys, Inc.) drogers@scitegic.com Description of: nalysis methods

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

An Integrated Approach to in-silico

An Integrated Approach to in-silico An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577 Goals

More information

Heuristics for The Whitehead Minimization Problem

Heuristics for The Whitehead Minimization Problem Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

Proximity data visualization with h-plots

Proximity data visualization with h-plots The fifth international conference user! 2009 Proximity data visualization with h-plots Irene Epifanio Dpt. Matemàtiques, Univ. Jaume I (SPAIN) epifanio@uji.es; http://www3.uji.es/~epifanio Outline Motivating

More information

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-1 NATIONAL SECURITY COMPLEX J. A. Mullens, J. K. Mattingly, L. G. Chiang, R. B. Oberer, J. T. Mihalczo ABSTRACT This paper describes a template matching

More information

Data Mining in the Chemical Industry. Overview of presentation

Data Mining in the Chemical Industry. Overview of presentation Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry

More information

Research Article. Chemical compound classification based on improved Max-Min kernel

Research Article. Chemical compound classification based on improved Max-Min kernel Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(2):368-372 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Chemical compound classification based on improved

More information

Mid Term-1 : Practice problems

Mid Term-1 : Practice problems Mid Term-1 : Practice problems These problems are meant only to provide practice; they do not necessarily reflect the difficulty level of the problems in the exam. The actual exam problems are likely to

More information

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Jim Lambers MAT 610 Summer Session Lecture 2 Notes Jim Lambers MAT 610 Summer Session 2009-10 Lecture 2 Notes These notes correspond to Sections 2.2-2.4 in the text. Vector Norms Given vectors x and y of length one, which are simply scalars x and y, the

More information

Machine learning for ligand-based virtual screening and chemogenomics!

Machine learning for ligand-based virtual screening and chemogenomics! Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium

Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium Mapping of Science Bart Thijs ECOOM, K.U.Leuven, Belgium Introduction Definition: Mapping of Science is the application of powerful statistical tools and analytical techniques to uncover the structure

More information

8. Classification and Pattern Recognition

8. Classification and Pattern Recognition 8. Classification and Pattern Recognition 1 Introduction: Classification is arranging things by class or category. Pattern recognition involves identification of objects. Pattern recognition can also be

More information

Massive Experiments and Observational Studies: A Linearithmic Algorithm for Blocking/Matching/Clustering

Massive Experiments and Observational Studies: A Linearithmic Algorithm for Blocking/Matching/Clustering Massive Experiments and Observational Studies: A Linearithmic Algorithm for Blocking/Matching/Clustering Jasjeet S. Sekhon UC Berkeley June 21, 2016 Jasjeet S. Sekhon (UC Berkeley) Methods for Massive

More information

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Anthony Arvanites Daylight User Group Meeting March 10, 2005 Outline 1. Company Introduction

More information

Exploring the chemical space of screening results

Exploring the chemical space of screening results Exploring the chemical space of screening results Edmund Champness, Matthew Segall, Chris Leeding, James Chisholm, Iskander Yusof, Nick Foster, Hector Martinez ACS Spring 2013, 7 th April 2013 Optibrium,

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13 Indexes for Multimedia Data 13 Indexes for Multimedia

More information

Introduction to Signal Detection and Classification. Phani Chavali

Introduction to Signal Detection and Classification. Phani Chavali Introduction to Signal Detection and Classification Phani Chavali Outline Detection Problem Performance Measures Receiver Operating Characteristics (ROC) F-Test - Test Linear Discriminant Analysis (LDA)

More information

Introduction to Chemoinformatics

Introduction to Chemoinformatics Introduction to Chemoinformatics www.dq.fct.unl.pt/cadeiras/qc Prof. João Aires-de-Sousa Email: jas@fct.unl.pt Recommended reading Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH

More information

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options of the structure similarity

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options of the structure similarity OECD QSAR Toolbox v.4.1 Tutorial illustrating new options of the structure similarity Outlook Background Aims PubChem features The exercise Workflow 2 Background This presentation is designed to familiarize

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

arxiv: v1 [stat.ml] 27 Nov 2011

arxiv: v1 [stat.ml] 27 Nov 2011 arxiv:.6285v [stat.ml] 27 Nov 20 Ward s Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm Fionn Murtagh () and Pierre Legendre (2) () Science Foundation Ireland, Wilton Park

More information

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Measurement and Data Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Importance of Measurement Aim of mining structured data is to discover relationships that

More information

Universities of Leeds, Sheffield and York

Universities of Leeds, Sheffield and York promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Journal of Molecular

More information

Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding

Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding Fionn Murtagh (1), Geoff Downs (2), and Pedro Contreras (3) (1) Department of Computer Science, Royal

More information

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor

More information

Universities of Leeds, Sheffield and York

Universities of Leeds, Sheffield and York promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Organic & Biomolecular

More information

Fuzzy order-equivalence for similarity measures

Fuzzy order-equivalence for similarity measures Fuzzy order-equivalence for similarity measures Maria Rifqi, Marie-Jeanne Lesot and Marcin Detyniecki Abstract Similarity measures constitute a central component of machine learning and retrieval systems,

More information

Vector Quantization Encoder Decoder Original Form image Minimize distortion Table Channel Image Vectors Look-up (X, X i ) X may be a block of l

Vector Quantization Encoder Decoder Original Form image Minimize distortion Table Channel Image Vectors Look-up (X, X i ) X may be a block of l Vector Quantization Encoder Decoder Original Image Form image Vectors X Minimize distortion k k Table X^ k Channel d(x, X^ Look-up i ) X may be a block of l m image or X=( r, g, b ), or a block of DCT

More information

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data 1/29/2010 13 Indexes for Multimedia Data 13 Indexes for Multimedia Data 13.1 R-Trees Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre Dr. Sander B. Nabuurs Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre The road to new drugs. How to find new hits? High Throughput

More information

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov CADD Group Chemical Biology Laboratory Frederick National Laboratory for Cancer Research National Cancer Institute, National Institutes

More information

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, blue, orange)} 1 / 16 k-means clustering km15 = kmeans(x[g==0,],5) km25 = kmeans(x[g==1,],5) for(i in 1:6831){ md = c(mydist(xnew[i,],km15$center[1,]),mydist(xnew[i,],km15$center[2, mydist(xnew[i,],km15$center[3,]),mydist(xnew[i,],km15$center[4,]),

More information

Cheminformatics analysis and learning in a data pipelining environment

Cheminformatics analysis and learning in a data pipelining environment Molecular Diversity (2006) 10: 283 299 DOI: 10.1007/s11030-006-9041-5 c Springer 2006 Review Cheminformatics analysis and learning in a data pipelining environment Moises Hassan 1,, Robert D. Brown 1,

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal Representation of molecular structures Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal A hierarchy of structure representations Name (S)-Tryptophan 2D Structure 3D Structure Molecular

More information

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

This is a repository copy of Chemoinformatics techniques for data mining in files of two-dimensional and three-dimensional chemical molecules.

This is a repository copy of Chemoinformatics techniques for data mining in files of two-dimensional and three-dimensional chemical molecules. This is a repository copy of Chemoinformatics techniques for data mining in files of two-dimensional and three-dimensional chemical molecules. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/8425/

More information

A Pseudo-Boolean Set Covering Machine

A Pseudo-Boolean Set Covering Machine A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université

More information

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics

More information

Errors. Intensive Computation. Annalisa Massini 2017/2018

Errors. Intensive Computation. Annalisa Massini 2017/2018 Errors Intensive Computation Annalisa Massini 2017/2018 Intensive Computation - 2017/2018 2 References Scientific Computing: An Introductory Survey - Chapter 1 M.T. Heath http://heath.cs.illinois.edu/scicomp/notes/index.html

More information

Chemical Space: Modeling Exploration & Understanding

Chemical Space: Modeling Exploration & Understanding verview Chemical Space: Modeling Exploration & Understanding Rajarshi Guha School of Informatics Indiana University 16 th August, 2006 utline verview 1 verview 2 3 CDK R utline verview 1 verview 2 3 CDK

More information

Molecular Similarity Searching Using Inference Network

Molecular Similarity Searching Using Inference Network Molecular Similarity Searching Using Inference Network Ammar Abdo, Naomie Salim* Faculty of Computer Science & Information Systems Universiti Teknologi Malaysia Molecular Similarity Searching Search for

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems. Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems. Roberto Todeschini Milano Chemometrics and QSAR Research Group - Dept. of

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

Universal Similarity Measure for Comparing Protein Structures

Universal Similarity Measure for Comparing Protein Structures Marcos R. Betancourt Jeffrey Skolnick Laboratory of Computational Genomics, The Donald Danforth Plant Science Center, 893. Warson Rd., Creve Coeur, MO 63141 Universal Similarity Measure for Comparing Protein

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Imago: open-source toolkit for 2D chemical structure image recognition

Imago: open-source toolkit for 2D chemical structure image recognition Imago: open-source toolkit for 2D chemical structure image recognition Viktor Smolov *, Fedor Zentsev and Mikhail Rybalkin GGA Software Services LLC Abstract Different chemical databases contain molecule

More information

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA 315 C. K. Lowe-Ma, A. E. Chen, D. Scholl Physical & Environmental Sciences, Research and Advanced Engineering Ford Motor Company, Dearborn, Michigan, USA

More information

Path Testing and Test Coverage. Chapter 9

Path Testing and Test Coverage. Chapter 9 Path Testing and Test Coverage Chapter 9 Structural Testing Also known as glass/white/open box testing Structural testing is based on using specific knowledge of the program source text to define test

More information

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Riemannian Metric Learning for Symmetric Positive Definite Matrices CMSC 88J: Linear Subspaces and Manifolds for Computer Vision and Machine Learning Riemannian Metric Learning for Symmetric Positive Definite Matrices Raviteja Vemulapalli Guide: Professor David W. Jacobs

More information

Encoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models. Igor V.

Encoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models. Igor V. Encoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models Igor V. Tetko IBPC, Ukrainian Academy of Sciences, Kyiv, Ukraine and Institute

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information