Chemical Space: Modeling Exploration & Understanding

Similar documents
A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

Machine Learning Concepts in Chemoinformatics

Improving usability and accessibility of cheminformatics tools for chemists through cyberinfrastructure and education

Integrated Cheminformatics to Guide Drug Discovery

Structure-Activity Modeling - QSAR. Uwe Koch

Chemical Data Retrieval and Management

Exploring the black box: structural and functional interpretation of QSAR models.

Structural interpretation of QSAR models a universal approach

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

Using Self-Organizing maps to accelerate similarity search

Using Web Technologies for Integrative Drug Discovery

CDK & Mass Spectrometry

Orbital Development Kit

Encoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models. Igor V.

DivCalc: A Utility for Diversity Analysis and Compound Sampling

Machine learning for ligand-based virtual screening and chemogenomics!

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Course in Data Science

(S1) (S2) = =8.16

Introduction. OntoChem

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

An Integrated Approach to in-silico

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Data Mining in the Chemical Industry. Overview of presentation

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013

est Drive K20 GPUs! Experience The Acceleration Run Computational Chemistry Codes on Tesla K20 GPU today

Integrative Metabolic Fate Prediction and Retrosynthetic Analysis the Semantic Way

Pose and affinity prediction by ICM in D3R GC3. Max Totrov Molsoft

Linear and Logistic Regression. Dr. Xiaowei Huang

Global Scene Representations. Tilke Judd

Drug Informatics for Chemical Genomics...

Capturing Chemistry. What you see is what you get In the world of mechanism and chemical transformations

1 [15 points] Frequent Itemsets Generation With Map-Reduce

(e.g.training and prediction set, algorithm, ecc...). 2.9.Availability of another QMRF for exactly the same model: No other information available

Kristin P. Bennett. Rensselaer Polytechnic Institute

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

Applying the Semantic Web to Computational Chemistry

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov

Semi Supervised Distance Metric Learning

Coefficient Symbol Equation Limits

Practical QSAR and Library Design: Advanced tools for research teams

Structural biology and drug design: An overview

Fragment based drug discovery in teams of medicinal and computational chemists. Carsten Detering

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

Building blocks for automated elucidation of metabolites: Machine learning methods for NMR prediction

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

A New Space for Comparing Graphs

The PhilOEsophy. There are only two fundamental molecular descriptors

Similarity Search. Uwe Koch

Using Phase for Pharmacophore Modelling. 5th European Life Science Bootcamp March, 2017

MultiscaleMaterialsDesignUsingInformatics. S. R. Kalidindi, A. Agrawal, A. Choudhary, V. Sundararaghavan AFOSR-FA

Analytical data, the web, and standards for unified laboratory informatics databases

Modeling Mutagenicity Status of a Diverse Set of Chemical Compounds by Envelope Methods

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Translating Methods from Pharma to Flavours & Fragrances

Materials Informatics: Statistical Modeling in Material Science

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

KNIME applications at Syngenta

ALOGPS is a free on-line program to predict lipophilicity and aqueous solubility of chemical compounds

What is a property-based similarity?

Rick Ebert & Joseph Mazzarella For the NED Team. Big Data Task Force NASA, Ames Research Center 2016 September 28-30

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Large scale classification of chemical reactions from patent data

Performing a Pharmacophore Search using CSD-CrossMiner

BLAST: Target frequencies and information content Dannie Durand

Efficient query evaluation

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Reaxys The Highlights

Gene Ontology and overrepresentation analysis

ArchaeoKM: Managing Archaeological data through Archaeological Knowledge

Computational chemical biology to address non-traditional drug targets. John Karanicolas

Characterization of Pharmacophore Multiplet Fingerprints as Molecular Descriptors. Robert D. Clark 2004 Tripos, Inc.

Kinome-wide Activity Models from Diverse High-Quality Datasets

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters

Xia Ning,*, Huzefa Rangwala, and George Karypis

In silico pharmacology for drug discovery

CheS-Mapper 2.0 for visual validation of (Q)SAR models

Organometallics & InChI. August 2017

An Introduction to Machine Learning

Statistics Toolbox 6. Apply statistical algorithms and probability models

DATA MINING LECTURE 10

QSAR in Green Chemistry

GIS at UCAR. The evolution of NCAR s GIS Initiative. Olga Wilhelmi ESIG-NCAR Unidata Workshop 24 June, 2003

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Building innovative drug discovery alliances. Just in KNIME: Successful Process Driven Drug Discovery

Molecular Descriptors Theory and tips for real-world applications

Application integration: Providing coherent drug discovery solutions

Spatializing Research Hypotheses a long- term research vision for

Behavioral Data Mining. Lecture 2

Keywords: anti-coagulants, factor Xa, QSAR, Thrombosis. Introduction

cheminformatics toolkits: a personal perspective

Transcription:

verview Chemical Space: Modeling Exploration & Understanding Rajarshi Guha School of Informatics Indiana University 16 th August, 2006

utline verview 1 verview 2 3 CDK R

utline verview 1 verview 2 3 CDK R

verview Goals of Cheminformatics Research at IU Extend the state of the art in cheminformatics Extract chemistry, not just R 2, q 2 et al. Provide intuitive and efficient ways to handle chemical information Supply expertise to bench chemists and non-computational chemists

verview Statistical Modeling of Chemical Information QSAR model development Lazy regression Ensemble descriptor selection Wavelet-based spectral descriptors Interpretation of QSAR models Measuring model applicability

verview Cheminformatics Algorithms R- curves utlier detection Cluster cardinality LSH based applications Ensemble descriptor selection Interpretation techniques

verview Tools and Pipelines for Cheminformatics Contributions to the CDK Packages for R rcdk spe fingerprint spectral clustering rpubchem (to come) Automated QSAR pipeline (collaboration with P&G) Development of workflows

utline verview 1 verview 2 3 CDK R

verview QSAR Model Interpretation Why interpret? Predictive ability is useful for screening Interpretation provides extra value Interpretability might be more useful than predictive ability Interpretability depends on modeling technique and descriptors involved

verview QSAR Model Interpretation The trade-offs in interpretation Interpretability is usually a trade-off with accuracy LS models are easily interpretable, not always accurate C models are usually black boxes, more accurate Some methods (RF) lie in between

verview Detailed Interpretation Methods LS Models Build a good model Run it through PLS The X-loadings indicate which descriptors are important magnitude lets us rank them sign lets us indicate the nature of their effects Allows us to identify effects of descriptors on a molecule-wise basis C Models Analogous to PLS based interpretations Linearizes the C Information is lost Resultant interpretations match the corresponding interpretations for an LS model quite well Guha, R. et al., J. Chem. Inf. Model., 2005, 45, 321 333 Guha, R. et al., J. Chem. Inf. Comput. Sci., 2004, 44, 1440 1449

verview Extensions of Interpretation Methods The C interpretation uses significant approximations and does not make full use of biases Develop interpretation techniques for other methods, RF in particular Ensemble descriptor selection to choose a subset that is simultaneously good for multiple model types Use these methods on real datasets

verview QSAR Model Applicability Model Validation Goal is to test the reliability of the model Ensures that the model is not due to chance factors Based on dataset used to develop the model Model Applicability Goal is to test the applicability of the model to new compounds Tells us: The model will predict the activity well (or not) Similar to confidence measures

verview What is Model Applicability Question? How will a model perform when faced with molecules that it has not been trained on or validated with? Aspects Similarity to the TSET? Can we consider a global chemistry space? Structural or statistical similarity? Quantitative or qualitative?

verview How to Assess Model Applicability Define Model Performance Performance is measured by prediction residuals. The model performs well on a new molecule if it predicts its activity with low residual error. Correlate X With Performance X could be similarity between a query molecule and the original training set X could be derived from a cluster membership approach Alternatively, predict performance itself Guha, R.; Jurs, P.C; J. Chem. Inf. Model., 2005, 45, 65 73

verview earest eighbor Methods Traditional k methods are simple, fast, intuitive Applications in regression & classification diversity analysis Can be misleading if the nearest neighbor is far away R- methods may be more suitable

verview R- Curves for Diversity Analysis The question... Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset? The answer... R- curves D max max pairwise distance for molecule in dataset do R 0.01 D max while R D max do Find s within radius R Increment R end while end for Plot count vs. R Guha, R., et al., J. Chem. Inf. Model., 2006, 46, 1713 1772

verview R- Curves for Diversity Analysis The question... Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset? The answer... R- curves D max max pairwise distance for molecule in dataset do R 0.01 D max while R D max do Find s within radius R Increment R end while end for Plot count vs. R umber of eighbors umber of eighbors 0 50 100 150 200 250 0 50 100 150 200 250 R R Guha, R., et al., J. Chem. Inf. Model., 2006, 46, 1713 1772

verview Characterizing R- Curves

H2 H 2 H H H H H H H H H H H2 S H2 H H H H H H H H H H H H 2 H 2 H H H H H H H 2 H H H 2 H H H H S S H H S H H S H H2 H H H H H H2 H H H H H2 H H H H H H H H2 H2 verview R- Curves and utliers 3904 R maxs 20 40 60 80 1861 3904 0 1000 2000 3000 4000 Serial umber A single plot identifies the location characteristics of all the molecules 1861 H Kazius, J.; McGuire, R.; Bursi, R.; J. Med. Chem. 2005, 48, 312 320

verview R- Curves and Clusters 251 84 0 5 10 15 88 137 251 84 0 50 100 150 200 250 300 0 50 100 150 200 250 300 88 0 50 100 150 200 250 300 0 50 100 150 200 250 300 137 5 0 5 10 15 20 25 Smoothed R- Curves R- curves are indicative of the number of clusters

verview R- Curves and Clusters Counting the steps Essentially a curve matching problem All points will not be indicative of the number of clusters ot applicable for concentric clusters Approaches Hausdorff / Fréchet distance - requires canonical curves RMSE from distance matrix Slope analysis

verview R- Curves and Their Slopes 0 50 100 150 200 250 300 251 0 50 100 150 200 250 300 84 Slope of the R Curve 0 2 4 6 8 10 12 14 251 Slope of the R Curve 0 5 10 15 20 84 0 50 100 150 200 250 300 88 0 50 100 150 200 250 300 137 Slope of the R Curve 0 5 10 15 88 Slope of the R Curve 0 5 10 15 137 Smoothed first derivative of the R- Curves

verview R- Curves and Their Slopes 0 50 100 150 200 250 300 251 0 50 100 150 200 250 300 84 Slope of the R Curve 0 2 4 6 8 10 12 14 251 Slope of the R Curve 0 5 10 15 20 84 0 50 100 150 200 250 300 88 0 50 100 150 200 250 300 137 Slope of the R Curve 0 5 10 15 88 Slope of the R Curve 0 5 10 15 137 Smoothed first derivative of the R- Curves Identifying peaks identifies the number of clusters Automated picking can identify spurious peaks

verview Slope Analysis of R- Curves Procedure for i in molecules do Evaluate R- curve F smoothed R- curve Evaluate F Smooth F root,i no. of roots of F end for cluster = [max( root ) + 1]/2 Possible improvements Sample from the collection of R- curves Improve handling of concentric clusters 0 50 100 150 200 250 300 0 2 4 6 8 0.5 0.0 0.5 R Curve D' D''

Preliminary Results verview Simulated 2D data Predicted k, followed by kmeans clustering using k Investigated similar values of k k ASW 4 0.61 3 0.74 5 0.70 k ASW 3 0.44 2 0.65 4 0.47 k ASW 4 0.64 3 0.48 5 0.56 ASW - average silhouette width, higher is better; k - number of clusters

ntologies verview What is an ontology? Defines a controlled vocabulary (keywords) Defines a set of relationships between them Allows for meaning to be added to data and algorithms automated inference of relationships An example is the Gene ntology

verview ntologies in Chemistry What s available? o really comprehensive ontology available Some work is on to add chemistry semantics to biology-related ontologies What can we do? Start small - focus on one area of chemistry, descriptors The CDK currently provides an ontology for implemented descriptors Vocabulary includes author literature reference class (molecular, atomic, bond) type (geometric, electronic,... )

verview How Can We Use ntologies? Allow interoperability of software Different program use different naming schemes Programs may be open- or close-source Annotation via dictionaries allows us to make conclusions regarding descriptors, suggest similar descriptors etc. Utilize expertise from chemists Some descriptors are useful for certain properties Chemists who use models can tag descriptors as useful for a property ver time the tagging can be indicative of the utility of certain descriptors

utline verview CDK R 1 verview 2 3 CDK R

verview CDK R The Chemistry Development Kit What does it do? Reads multiple file formats Calculates molecular descriptors (in progress) Topologicals (Chi, Kappa,... ) Geometric (Gravitational indices, MI,... ) Hybrid (CPSA) Global (BCUT, WHIM, RDF,... ) Provides access to statistical engines (R & Weka) Fingerprint calculation, 2D structure diagrams Kabsch alignment 3D coordinates and force fields (in progress) Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2111 2120 Steinbeck, C. et al., J. Chem. Inf. Sci., 2003, 43, 493 500

Contributions verview CDK R Main areas of contributions Descriptor framework QSAR modeling framework (interface to R) Web service functionality ther areas include umerical surface areas Rigid alignment Descriptor ontology Build system QA, debugging, support http://almost.cubic.uni-koeln.de/cdk/cdk top

verview Cheminformatics and R CDK R What is R? An open-source statistical environment for model development algorithm prototyping pen source version of Splus Splus code runs (mostly) unchanged on R Provides a wide array of mathematical and statistical functionality Linear models (LS, robust regression, GLM, PLS) eural networks, random forests, SM, SVM Clustering methods (kmeans, agnes, pam,... ) ptimization routines Database interfaces When working with chemical data it would be nice to have access to cheminformatics functionality inside R

verview Cheminformatics and R CDK R What is R? An open-source statistical environment for model development algorithm prototyping pen source version of Splus Splus code runs (mostly) unchanged on R Provides a wide array of mathematical and statistical functionality Linear models (LS, robust regression, GLM, PLS) eural networks, random forests, SM, SVM Clustering methods (kmeans, agnes, pam,... ) ptimization routines Database interfaces When working with chemical data it would be nice to have access to cheminformatics functionality inside R

verview CDK R Some Cheminformatics Related Packages Connecting R and the CDK R can access Java code via the SJava package This allows us to use CDK functionality within R The rcdk package provides user friendly wrappers to CDK classes and methods The result is that we can stay inside R and handle molecules directly Fingerprints The fingerprint package handles binary fingerprint data from CDK and ME Calculates similarity matrices using the Tanimoto metric Converts binary fingeprints to Euclidean vectors Guha, R., CDK ews, 2005, 2, 2 6

R Miscellanea verview CDK R R as a Web Service A number of packages are available to access R via the web An effort is also underway to provide an explicit SAP interface Very simple to access a remote R process via RServe Currently under investigation as our statistical backend for workflows

Summary verview CDK R Broadly focused on 2 areas... Modeling Predictive model development Model interpreation and applicability Algorithm development Exploring nearest neighbor methods Descriptor selection and interpretation Dictionaries and ontologies Underlying motiviation is the extraction of chemistry from the numbers and making it available Collaborations... More brains are useful Always on the lookout for data

Summary verview CDK R Broadly focused on 2 areas... Modeling Predictive model development Model interpreation and applicability Algorithm development Exploring nearest neighbor methods Descriptor selection and interpretation Dictionaries and ontologies Underlying motiviation is the extraction of chemistry from the numbers and making it available Collaborations... More brains are useful Always on the lookout for data

Summary verview CDK R Broadly focused on 2 areas... Modeling Predictive model development Model interpreation and applicability Algorithm development Exploring nearest neighbor methods Descriptor selection and interpretation Dictionaries and ontologies Underlying motiviation is the extraction of chemistry from the numbers and making it available Collaborations... More brains are useful Always on the lookout for data

verview CDK R