Clustering Ambiguity: An Overview
|
|
- Claud Tyler
- 5 years ago
- Views:
Transcription
1 Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3 rd Joint Sheffield Conference on Chemoinformatics April 23, 2004
2 Outline The Problem: Clustering Ambiguity and Chemoinformatics Preliminaries: bit strings, measures, similarity distributions Ties in Proximities and, more generally, Decision Ties. Clustering Algorithms and Decision Ties Examples: Taylor-Butina, (leader algorithm) K-means and K-modes Jarvis Patrick, RNN Hierarchical (Wards, Complete Link, Group Average) Remarks
3 Clustering Ambiguity Problem Where: clustering algorithms that find distinct groups in data. However, a quantitative decision process ( Idiot Proof ) may lead to ambiguous results. Symptom: permute input data different results. Namely, not stable with respect to input order. Ambiguity it is not clear what belongs to what group Distinct from: fingerprint collisions (different compounds same fingerprints) Precision
4 Clustering Applications and Binary Fingerprints Lead selection in HTS data Diversity analysis Lead hopping Compound acquisition decisions Etc. Downs, G. M.; Barnard, J. M. Clustering Methods and Their Uses in Computational Chemistry, Reviews in Computational Chemistry; Vol. 18, Lipkowitz, K. B. and Boyd, D. B., Eds; Wiley-VCH: New York, 2002, 1-40.
5 Binary Fingerprints Descriptor Cl CH 3 NH 2 Encode NH 2 Fixed length bit strings such as Daylight MDL BCI etc.
6 Common (Dis)Similarity Coefficients Tanimoto Euclidean Cosine Hamman Tversky
7 Simple Bit String Similarity Measure Properties Symmetric (e.g.,tanimoto) Similarity from A to B is the same as the similarity from B to A. Asymmetric (e.g., Tversky) Similarity from A to B is not necessarily the same as the similarity from B to A. Clustering Compound Data: Asymmetric Clustering of Compound Data, MacCuish and MacCuish, Chemometrics and Chemoinformatics, ACS Symposium Series, in press Metric (e.g., Euclidean) Satisfies the triangle inequality Non-metric (e.g., Soergel) Does not satisfy the triangle inequality Note, the square root of the Soergel does satisfy the triangle inequality for binary bit strings. Gower and Legendre, Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 1986, 3, 5-48.
8 Tie in Proximity S H 2 C H N N H N H N O Euclidean Dist =.16 Euclidean Dist =.16 S S H 2 C N O H N N H N H H 2 C N O H N N H N H One structure (or Cluster!) equidistant from two others.
9 Are Proximity Ties Common? Example: Binary Fingerprints with the Tanimoto Here are all bit strings of length 5: strings Here are all possible Tanimoto similarities for distinct bit strings of length 5: 0, 1/5, 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5 All reduced fractions, denominators of 5 or less This is the Farey Sequence N, where N is 5 There are just 10 such distinct similarities And 496 all pairs similarities between these strings, given 32 distinct strings. And the distribution is
10 All possible Tanimoto Similarities for Bit Strings of Length 5 0 Frequency of Similarities /5 1/4 1/3 2/5 1/2 Average frequency /3 3/5 3/4 4/ Tanimoto Similarity
11 Finite Number of Proximities How many possible Tanimoto similarities are there given N bits in a fixed length fingerprint? 3 N 2 + O N N π 2 ( log ) Namely, the sum of the number of reduced fractions with denominators up to N. (Proof of above expected bound, 1883) How many possible Euclidean similarities? = N + 1 How many possible Cosine similarities? No known closed form in terms of N Any Number Theorists in the house?
12 For Fingerprints of Size 1024 How many possible Tanimoto similarities ~329,000 How many Euclidean similarities? 1,025 How many Cosine similarities? In the low millions (empirical estimate)
13 All possible Tanimoto Similarities for Bit Strings of Length 5 Exact Discrete Distributions vs Probabistic Discrete Distributions 380 NCI actives: Daylight Fingerprints w/ 512 bits Frequency of Similarities Tanimoto Similarity Tanimoto Similarity 380 NCI actives: Daylight Fingerprints w/ 512 bits Frequency of Similarities Frequency of Similarities All possible Ochiai Similarities for Bit Strings of Length 5 Frequency of Similarities Ochiai (1-Cosine) Similarity Ochiai Similarity
14 Clustering and Ties in Proximity Measures with small numbers of possible similarities (e.g., Euclidean), or distributions that lead to this same effect (e.g., Tanimoto, Ochiai), are prone to the problem of ties in proximity in clustering. This can effect derived measures as well, such as the square error of Wards merging criterion. Algorithms for Clustering Data, Jain and Dubes. Godden, et al, JCICS, 2001, 40, , and MacCuish, et al, JCICS, 2001, 41 (1), Namely, we are clustering in a space that is a rigid lattice of proximities and/or derived measures rather than a continuum. (Note: typically for the lengths of the binary descriptors of the vendors mentioned, this lattice is far more course than the lattice that would be created by the typical floating point machine representation of real numbers.)
15 In the literature beware of: We resolve ties arbitrarily
16 Decision Ties in Clustering Algorithms A simple decision tie tie in proximity Other decision ties may be algorithm dependent (can occur even with continuous data). In practice most decision ties lead to cluster ambiguity an inability to discriminate nondisjoint (overlapping) clusters. Namely, disjoint clusters don t reflect the amount of ambiguity identified by decision ties as the resulting non-disjoint clustering suggests.
17 Algorithms
18 Taylor-Butina (TB) Leader or Exclusion Cluster Sampling Algorithm 1. Create thresholded nearest neighbor table 2. Find true singletons: all those compounds with an empty nearest neighbor list. 3. Find the compound with the largest nearest neighbor list (representative compound or centrotype). This becomes a group and is excluded from consideration these compounds are removed from all nearest neighbor lists. 4. Repeat 3 until no compounds exist with a non-empty nearest neighbor list. Taylor, JCICS, 1995, 35, Butina, JCICS 1999, 39, Optional: 1. Assign remaining compounds, false singletons, to the group that contains their nearest neighbor; 2. Use other criterion to break exclusion region ties; 3. Use asymmetric measures; 4. Can be made to return overlapping clusters.
19 Representative Compound Tie Cases in TB Algorithm Exclusion Region Tie False Singleton Tie False Singleton, Which Region? Exclusion Regions Diameter Set by Threshold value True Singleton May form ambiguous clustering if sum of minimum distances is also tied False singleton tie, but regions not ambiguous, no need to sum minimum distances
20 K-Means and K-Modes and overlapping versions Continuous K-means with fingerprints (convert binary to real 0.0s, 1.0s) 1. Choose k seed centroids from data set (e.g., quasi-randomly via 1D Halton sequence) 2. Find nearest neighbors to the centroids -- TIES HERE -- Overlapping 3. Recompute new centroids 4. Repeat 2 until no neighbors change groups or some iterative threshold. K-modes with fingerprints (fingerprints remain binary) 1. Choose k seed modes from data set (or frequency of categories method, etc.) 2. Find nearest neighbors to the modes (euclidean, tanimoto, etc.) TIES HERE 3. Recompute new modes (simple matching coefficient) 4. Same as 4 in continuous K-means Continuous K-means, Los Alamos Science, Faber, Kelly, White, 1994
21 Jarvis Patrick Two common versions: 1. Kmin: Fixed length, k, NN list -- TIES HERE -- kth neighbor tied If two compound NN lists have j neighbors in common those compounds are in the same group. 2. Pmin: Fixed length, K, NN list -- TIES HERE -- kth neighbor tied If two compound NN lists have a percentage, p, of neighbors in common those compounds are in the same group. Improvements to Daylight Clustering, Delaney, Bradshaw, MUG 04
22 Reciprocal Nearest Neighbors (RNN) Hierarchical Clustering Wards, Group Average, Complete Link clustering algorithms can use RNN as a fast O( N method of obtaining the hierarchy. 2 ) vs O( N 3 ) Murtagh, A survey on recent advances in hierarchical clustering algorithms, Computer Journal, 26, , 1983 The RNN form of these algorithms contain specific decision ties unique to this method. The resulting ambiguity can be quantified by enumerating decision tie events.
23 RNN Algorithm Decision Ties 1. Form a nearest neighbor (NN) chain until a RNN (each is the NN of the other) is found. What if there is more than one NN? -- Ties in Proximity (or merging criterion) problem, increasing the ambiguity. What if in turn there is more than one RNN? Ties in Proximity and Algorithm Decision Tie problem, more Ambiguity. 2. Use another merge criterion than the criterion used in the algorithm to choose RNN in this case -- decrease the Ambiguity. What if the results of this new criterion is also tied? Another Algorithm Decision Tie, increasing the Ambiguity. For hierarchical algorithms that return overlapping clusterings based solely on ties in proximity, see Nicolaou, MacCuish, and Tamura, A new multi-domain clustering algorithm for lead discovery that exploits ties in proximity, Proceedings from the 13 th Euro-QSAR, Prous Science, 2000, pp
24 How can we address this problem?
25 Levels of Ambiguity Two Groups with Considerable Overlap.. Or, Smaller, More Distinct Groups Or, the difficulty of making sense of large numbers of overlapping clusters where the intersections are large.
26 Distinct Clusters Overlapping clusters, a result of combining all decision ties Distinct (Disjoint) Clusters: just one clustering of many possible Overlapping clusters, but understandable Overlapping clusters but Difficult to understand Fewer Decision Ties less Ambiguity More Decision Ties more Ambiguity
27 An Ambiguity Index Defined with TB Algorithm The difference between the disjoint and nondisjoint results of Taylor s algorithm can give us a sense of the ambiguity inherent in clustering fingerprints at a given Tanimoto or Tversky threshold. Many simple indices can be defined. We use an index that reflects the number of shared compounds in the non-disjoint clustering.
28 Increasing Ambiguity Index Clustering Ambiguity with Taylor's Algorithm 380 NCI-HIV Actives MACCS 166 Bits MACCS 320 Bits Daylight 512 Bits Approx. 10% of Cmpds shared among clusters Tanimoto Threshold
29 Jarvis-Patrick Results Summary The number of proximity ties are significant in both algorithms when reasonable values for k, j, and p are chosen -- on par with that of Taylor s and RNN algorithms. Kmin typically has more ties in general, though it is hard to make a one to one comparison with Pmin.
30 K-means, K-modes Results Summary K-means: Typically small number of ties depending on K on just the first iteration. Rarely are there ties after the first iteration. Very little overlap when the algorithm converges. Ambiguity confounded by local optima problem. K-mode with frequency method: Fewer ties overall than even K-means But -- ties occur more frequently in subsequent iterations Again, very little overlap when the algorithm converges. Ambiguity confounded by local optima problem.
31 Best Level? Level Selection and Ambiguity in Hierarchical Clustering Ambiguity? Ambiguity? Ambiguity? Total Ambiguity Kelly Level Selection Values Total Ambiguity in the Form of Ties Number of clusters Level Selection Heuristic Number of clusters Ambiguity Index
32 Ambiguity Index for Hierarchical RNN class Algorithms Count the number of decision ties as a rough estimate of the ambiguity. Use this in conjunction with level selection techniques (e.g., Kelley s), where the objective is to find the best non-trivial level selection value with the lowest ambiguity index. Kelley, L. A.; Gardner, S. P.; Sutcliffe, M. J. An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally-related subfamilies. Protein Eng. 1996, 9, Wild, D.J.; Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping Using Ward s Clustering, J. Chem. Inf. Comput. Sci. 2000, 40,
33 Two Wards Clusterings with Euclidean distance same data (482 NCI-HIV) -- entered in a different order Best Kelley Cuts 63 Clusters 136 Clusters Similar groups at the top of the dendrogram mask very different groups below
34 Complete Link -- Direct Understanding of Level Using Tanimoto as the measure we can inspect ambiguity and the similarity level or threshold directly. Namely, check ambiguity at various tanimoto similarity thresholds (levels) common in the field: 0.7, 0.85
35 Two Complete Link Clusterings with Soergel measure same data (482 NCI-HIV) -- entered in a different order Best Kelley Cuts 24 clusters 0.7 similarity 255 clusters (not all the same) For each clusters Same problem
36 Conclusions DON T PANIC Clustering is Good! Ambiguity important in terms of choosing K clusters Combining Level Selection information with ambiguity information can help to make sense of results. Modifying algorithms to use secondary grouping criterion when faced with decision ties can help reduce the ambiguity, often providing tighter more useful clusterings -- data, measure, and algorithm dependent however! BUT BE CAREFUL! Is it important for your application?? Determining ambiguity or adding secondary grouping criteria can have significant computational cost. In general, the choice of bit string length, measures, and algorithms can all lead to differing amounts of ambiguity.
37 Future Work Further work on Ambiguity Indices Ideally (FPLength)X(Measures)X(Algorithms)X(FindK)X(DataSetSize)X(DataSetDiversity) Explore other algorithms
38 Acknowledgements John Bradshaw Daylight, CIS John Blankley Pfizer (Retired) John Barnard BCI David Wild Wild Ideas OpenEye Scientific Software, Inc. This talk can be found at
University of Florida CISE department Gator Engineering. Clustering Part 1
Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects
More informationUniversities of Leeds, Sheffield and York
promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Quantitative structure
More informationSimilarity Search. Uwe Koch
Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance
More informationFast Hierarchical Clustering from the Baire Distance
Fast Hierarchical Clustering from the Baire Distance Pedro Contreras 1 and Fionn Murtagh 1,2 1 Department of Computer Science. Royal Holloway, University of London. 57 Egham Hill. Egham TW20 OEX, England.
More informationFast similarity searching making the virtual real. Stephen Pickett, GSK
Fast similarity searching making the virtual real Stephen Pickett, GSK Introduction Introduction to similarity searching Use cases Why is speed so crucial? Why MadFast? Some performance stats Implementation
More informationA Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors
A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors Rajarshi Guha, Debojyoti Dutta, Ting Chen and David J. Wild School of Informatics Indiana University and Dept.
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher
More information2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.
Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand
More informationMolecular Descriptors Theory and tips for real-world applications
Molecular Descriptors Theory and tips for real-world applications Francesca Grisoni University of Milano-Bicocca, Dept. of Earth and Environmental Sciences, Milan, Italy ETH Zurich, Dept. of Chemistry
More informationSimilarity methods for ligandbased virtual screening
Similarity methods for ligandbased virtual screening Peter Willett, University of Sheffield Computers in Scientific Discovery 5, 22 nd July 2010 Overview Molecular similarity and its use in virtual screening
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationInvestigation of the Use of Spectral Clustering for the Analysis of Molecular Data
This is an open access article published under a Creative Commons Attribution (CC-BY) License, which permits unrestricted use, distribution and reproduction in any medium, provided the author and source
More informationChemoinformatics and information management. Peter Willett, University of Sheffield, UK
Chemoinformatics and information management Peter Willett, University of Sheffield, UK verview What is chemoinformatics and why is it necessary Managing structural information Typical facilities in chemoinformatics
More informationSolving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms
Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms Alberto Fernández and Sergio Gómez arxiv:cs/0608049v2 [cs.ir] 0 Jun 2009 Departament d Enginyeria Informàtica i Matemàtiques,
More informationPrinciples of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata
Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision
More informationhas its own advantages and drawbacks, depending on the questions facing the drug discovery.
2013 First International Conference on Artificial Intelligence, Modelling & Simulation Comparison of Similarity Coefficients for Chemical Database Retrieval Mukhsin Syuib School of Information Technology
More informationMachine Learning Concepts in Chemoinformatics
Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationThis is a repository copy of Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures.
This is a repository copy of Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/170/
More informationApplying cluster analysis to 2011 Census local authority data
Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables
More informationChemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013
Chemical Space Space, Diversity, and Synthesis Jeremy Henle, 4/23/2013 Computational Modeling Chemical Space As a diversity construct Outline Quantifying Diversity Diversity Oriented Synthesis Wolf and
More informationMolecular Complexity Effects and Fingerprint-Based Similarity Search Strategies
Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-aturwissenschaftlichen Fakultät der Rheinischen
More informationMultivariate Statistics: Hierarchical and k-means cluster analysis
Multivariate Statistics: Hierarchical and k-means cluster analysis Steffen Unkel Department of Medical Statistics University Medical Center Goettingen, Germany Summer term 217 1/43 What is a cluster? Proximity
More informationClustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.
1 / 19 sscott@cse.unl.edu x1 If no label information is available, can still perform unsupervised learning Looking for structural information about instance space instead of label prediction function Approaches:
More informationClustering Lecture 1: Basics. Jing Gao SUNY Buffalo
Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering
More informationChemical Similarity Searching
J. Chem. Inf. Comput. Sci. 1998, 38, 983-996 983 Chemical Similarity Searching Peter Willett* Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Sheffield
More informationModule 7-2 Decomposition Approach
Module 7-2 Decomposition Approach Chanan Singh Texas A&M University Decomposition Approach l Now we will describe a method of decomposing the state space into subsets for the purpose of calculating the
More informationData Mining and Machine Learning (Machine Learning: Symbolische Ansätze)
Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze) Learning Individual Rules and Subgroup Discovery Introduction Batch Learning Terminology Coverage Spaces Descriptive vs. Predictive
More informationChapter 5-2: Clustering
Chapter 5-2: Clustering Jilles Vreeken Revision 1, November 20 th typo s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point x as a member of its own ε-neighborhood 12 Nov 2015
More informationVirtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME
Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Iván Solt Solutions for Cheminformatics Drug Discovery Strategies for known targets High-Throughput Screening (HTS) Cells
More informationCh. 10 Vector Quantization. Advantages & Design
Ch. 10 Vector Quantization Advantages & Design 1 Advantages of VQ There are (at least) 3 main characteristics of VQ that help it outperform SQ: 1. Exploit Correlation within vectors 2. Exploit Shape Flexibility
More informationClassifier Selection. Nicholas Ver Hoeve Craig Martek Ben Gardner
Classifier Selection Nicholas Ver Hoeve Craig Martek Ben Gardner Classifier Ensembles Assume we have an ensemble of classifiers with a well-chosen feature set. We want to optimize the competence of this
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to
More informationSelecting Diversified Compounds to Build a Tangible Library for Biological and Biochemical Assays
Molecules 2010, 15, 5031-5044; doi:10.3390/molecules15075031 OPEN ACCESS molecules ISSN 1420-3049 www.mdpi.com/journal/molecules Article Selecting Diversified Compounds to Build a Tangible Library for
More informationCS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.
CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring 2006 Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. Latombe Scribe: Neda Nategh How do you update the energy function during the
More informationStudying the effect of noise on Laplacian-modified Bayesian Analysis and Tanimoto Similarity
Studying the effect of noise on Laplacian-modified Bayesian nalysis and Tanimoto Similarity David Rogers, Ph.D. SciTegic, Inc. (Division of ccelrys, Inc.) drogers@scitegic.com Description of: nalysis methods
More informationData Exploration and Unsupervised Learning with Clustering
Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a
More informationAn Integrated Approach to in-silico
An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577 Goals
More informationHeuristics for The Whitehead Minimization Problem
Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve
More informationBayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI
Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache
More informationProximity data visualization with h-plots
The fifth international conference user! 2009 Proximity data visualization with h-plots Irene Epifanio Dpt. Matemàtiques, Univ. Jaume I (SPAIN) epifanio@uji.es; http://www3.uji.es/~epifanio Outline Motivating
More informationAUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX
AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-1 NATIONAL SECURITY COMPLEX J. A. Mullens, J. K. Mattingly, L. G. Chiang, R. B. Oberer, J. T. Mihalczo ABSTRACT This paper describes a template matching
More informationData Mining in the Chemical Industry. Overview of presentation
Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry
More informationResearch Article. Chemical compound classification based on improved Max-Min kernel
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(2):368-372 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Chemical compound classification based on improved
More informationMid Term-1 : Practice problems
Mid Term-1 : Practice problems These problems are meant only to provide practice; they do not necessarily reflect the difficulty level of the problems in the exam. The actual exam problems are likely to
More informationJim Lambers MAT 610 Summer Session Lecture 2 Notes
Jim Lambers MAT 610 Summer Session 2009-10 Lecture 2 Notes These notes correspond to Sections 2.2-2.4 in the text. Vector Norms Given vectors x and y of length one, which are simply scalars x and y, the
More informationMachine learning for ligand-based virtual screening and chemogenomics!
Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13
More informationMapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium
Mapping of Science Bart Thijs ECOOM, K.U.Leuven, Belgium Introduction Definition: Mapping of Science is the application of powerful statistical tools and analytical techniques to uncover the structure
More information8. Classification and Pattern Recognition
8. Classification and Pattern Recognition 1 Introduction: Classification is arranging things by class or category. Pattern recognition involves identification of objects. Pattern recognition can also be
More informationMassive Experiments and Observational Studies: A Linearithmic Algorithm for Blocking/Matching/Clustering
Massive Experiments and Observational Studies: A Linearithmic Algorithm for Blocking/Matching/Clustering Jasjeet S. Sekhon UC Berkeley June 21, 2016 Jasjeet S. Sekhon (UC Berkeley) Methods for Massive
More informationRapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value
Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Anthony Arvanites Daylight User Group Meeting March 10, 2005 Outline 1. Company Introduction
More informationExploring the chemical space of screening results
Exploring the chemical space of screening results Edmund Champness, Matthew Segall, Chris Leeding, James Chisholm, Iskander Yusof, Nick Foster, Hector Martinez ACS Spring 2013, 7 th April 2013 Optibrium,
More informationBiological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor
Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms
More informationDescriptive Data Summarization
Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning
More informationWolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig
Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13 Indexes for Multimedia Data 13 Indexes for Multimedia
More informationIntroduction to Signal Detection and Classification. Phani Chavali
Introduction to Signal Detection and Classification Phani Chavali Outline Detection Problem Performance Measures Receiver Operating Characteristics (ROC) F-Test - Test Linear Discriminant Analysis (LDA)
More informationIntroduction to Chemoinformatics
Introduction to Chemoinformatics www.dq.fct.unl.pt/cadeiras/qc Prof. João Aires-de-Sousa Email: jas@fct.unl.pt Recommended reading Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH
More informationOECD QSAR Toolbox v.4.1. Tutorial illustrating new options of the structure similarity
OECD QSAR Toolbox v.4.1 Tutorial illustrating new options of the structure similarity Outlook Background Aims PubChem features The exercise Workflow 2 Background This presentation is designed to familiarize
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationarxiv: v1 [stat.ml] 27 Nov 2011
arxiv:.6285v [stat.ml] 27 Nov 20 Ward s Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm Fionn Murtagh () and Pierre Legendre (2) () Science Foundation Ireland, Wilton Park
More informationMeasurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality
Measurement and Data Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Importance of Measurement Aim of mining structured data is to discover relationships that
More informationUniversities of Leeds, Sheffield and York
promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Journal of Molecular
More informationHierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding
Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding Fionn Murtagh (1), Geoff Downs (2), and Pedro Contreras (3) (1) Department of Computer Science, Royal
More informationNotion of Distance. Metric Distance Binary Vector Distances Tangent Distance
Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor
More informationUniversities of Leeds, Sheffield and York
promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Organic & Biomolecular
More informationFuzzy order-equivalence for similarity measures
Fuzzy order-equivalence for similarity measures Maria Rifqi, Marie-Jeanne Lesot and Marcin Detyniecki Abstract Similarity measures constitute a central component of machine learning and retrieval systems,
More informationVector Quantization Encoder Decoder Original Form image Minimize distortion Table Channel Image Vectors Look-up (X, X i ) X may be a block of l
Vector Quantization Encoder Decoder Original Image Form image Vectors X Minimize distortion k k Table X^ k Channel d(x, X^ Look-up i ) X may be a block of l m image or X=( r, g, b ), or a block of DCT
More informationMultimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data
1/29/2010 13 Indexes for Multimedia Data 13 Indexes for Multimedia Data 13.1 R-Trees Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationOverview of clustering analysis. Yuehua Cui
Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationDr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre
Dr. Sander B. Nabuurs Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre The road to new drugs. How to find new hits? High Throughput
More informationQSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov
QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov CADD Group Chemical Biology Laboratory Frederick National Laboratory for Cancer Research National Cancer Institute, National Institutes
More informationk-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}
1 / 16 k-means clustering km15 = kmeans(x[g==0,],5) km25 = kmeans(x[g==1,],5) for(i in 1:6831){ md = c(mydist(xnew[i,],km15$center[1,]),mydist(xnew[i,],km15$center[2, mydist(xnew[i,],km15$center[3,]),mydist(xnew[i,],km15$center[4,]),
More informationCheminformatics analysis and learning in a data pipelining environment
Molecular Diversity (2006) 10: 283 299 DOI: 10.1007/s11030-006-9041-5 c Springer 2006 Review Cheminformatics analysis and learning in a data pipelining environment Moises Hassan 1,, Robert D. Brown 1,
More informationCSE 546 Final Exam, Autumn 2013
CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,
More informationRepresentation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal
Representation of molecular structures Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal A hierarchy of structure representations Name (S)-Tryptophan 2D Structure 3D Structure Molecular
More informationAnalysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing
Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver
More informationReview of Statistics 101
Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods
More informationThis is a repository copy of Chemoinformatics techniques for data mining in files of two-dimensional and three-dimensional chemical molecules.
This is a repository copy of Chemoinformatics techniques for data mining in files of two-dimensional and three-dimensional chemical molecules. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/8425/
More informationA Pseudo-Boolean Set Covering Machine
A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université
More informationChemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller
Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics
More informationErrors. Intensive Computation. Annalisa Massini 2017/2018
Errors Intensive Computation Annalisa Massini 2017/2018 Intensive Computation - 2017/2018 2 References Scientific Computing: An Introductory Survey - Chapter 1 M.T. Heath http://heath.cs.illinois.edu/scicomp/notes/index.html
More informationChemical Space: Modeling Exploration & Understanding
verview Chemical Space: Modeling Exploration & Understanding Rajarshi Guha School of Informatics Indiana University 16 th August, 2006 utline verview 1 verview 2 3 CDK R utline verview 1 verview 2 3 CDK
More informationMolecular Similarity Searching Using Inference Network
Molecular Similarity Searching Using Inference Network Ammar Abdo, Naomie Salim* Faculty of Computer Science & Information Systems Universiti Teknologi Malaysia Molecular Similarity Searching Search for
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationMolecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.
Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems. Roberto Todeschini Milano Chemometrics and QSAR Research Group - Dept. of
More informationBayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory
Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology
More informationComputational Learning Theory
1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number
More informationUniversal Similarity Measure for Comparing Protein Structures
Marcos R. Betancourt Jeffrey Skolnick Laboratory of Computational Genomics, The Donald Danforth Plant Science Center, 893. Warson Rd., Creve Coeur, MO 63141 Universal Similarity Measure for Comparing Protein
More informationClassification and Regression Trees
Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationImago: open-source toolkit for 2D chemical structure image recognition
Imago: open-source toolkit for 2D chemical structure image recognition Viktor Smolov *, Fedor Zentsev and Mikhail Rybalkin GGA Software Services LLC Abstract Different chemical databases contain molecule
More informationDATA MINING WITH DIFFERENT TYPES OF X-RAY DATA
DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA 315 C. K. Lowe-Ma, A. E. Chen, D. Scholl Physical & Environmental Sciences, Research and Advanced Engineering Ford Motor Company, Dearborn, Michigan, USA
More informationPath Testing and Test Coverage. Chapter 9
Path Testing and Test Coverage Chapter 9 Structural Testing Also known as glass/white/open box testing Structural testing is based on using specific knowledge of the program source text to define test
More informationRiemannian Metric Learning for Symmetric Positive Definite Matrices
CMSC 88J: Linear Subspaces and Manifolds for Computer Vision and Machine Learning Riemannian Metric Learning for Symmetric Positive Definite Matrices Raviteja Vemulapalli Guide: Professor David W. Jacobs
More informationEncoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models. Igor V.
Encoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models Igor V. Tetko IBPC, Ukrainian Academy of Sciences, Kyiv, Ukraine and Institute
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More information