An Integrated Approach to in-silico

Similar documents
Machine Learning Concepts in Chemoinformatics

Introduction. OntoChem

Data Mining in the Chemical Industry. Overview of presentation

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Structure-Activity Modeling - QSAR. Uwe Koch

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

Machine learning for ligand-based virtual screening and chemogenomics!

Computational chemical biology to address non-traditional drug targets. John Karanicolas

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Drug Informatics for Chemical Genomics...

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

SCULPT 3.0. Using SCULPT to Gain Competitive Insights. Brings 3D Visualization to the Lab Bench SPECIAL REPORT. 4 Molecular Connection Fall 1999

DivCalc: A Utility for Diversity Analysis and Compound Sampling

Similarity Search. Uwe Koch

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

QSAR in Green Chemistry

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

JCICS Major Research Areas

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

The Changing Requirements for Informatics Systems During the Growth of a Collaborative Drug Discovery Service Company. Sally Rose BioFocus plc

Exploring the black box: structural and functional interpretation of QSAR models.

Introduction to Chemoinformatics and Drug Discovery

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Chemical library design

Development of a Structure Generator to Explore Target Areas on Chemical Space

1. Some examples of coping with Molecular informatics data legacy data (accuracy)

The Schrödinger KNIME extensions

Practical QSAR and Library Design: Advanced tools for research teams

In Silico Investigation of Off-Target Effects

Building innovative drug discovery alliances. Just in KNIME: Successful Process Driven Drug Discovery

Chemical Space: Modeling Exploration & Understanding

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Fast similarity searching making the virtual real. Stephen Pickett, GSK

De Novo molecular design with Deep Reinforcement Learning

Priority Setting of Endocrine Disruptors Using QSARs

Medicinal Chemist s Relationship with Additivity: Are we Taking the Fundamentals for Granted?

Structural biology and drug design: An overview

est Drive K20 GPUs! Experience The Acceleration Run Computational Chemistry Codes on Tesla K20 GPU today

Design and Synthesis of the Comprehensive Fragment Library

Biologically Relevant Molecular Comparisons. Mark Mackey

EMPIRICAL VS. RATIONAL METHODS OF DISCOVERING NEW DRUGS

EasySDM: A Spatial Data Mining Platform

Ultra High Throughput Screening using THINK on the Internet

Computational Methods and Drug-Likeness. Benjamin Georgi und Philip Groth Pharmakokinetik WS 2003/2004

Using Self-Organizing maps to accelerate similarity search

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE

Xia Ning,*, Huzefa Rangwala, and George Karypis

Ignasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015

Hit Finding and Optimization Using BLAZE & FORGE

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters

Statistical concepts in QSAR.

In silico pharmacology for drug discovery

Research Article. Chemical compound classification based on improved Max-Min kernel

The Conformation Search Problem

(S1) (S2) = =8.16

Kinome-wide Activity Models from Diverse High-Quality Datasets

COMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA

The use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

Fragment based drug discovery in teams of medicinal and computational chemists. Carsten Detering

CS145: INTRODUCTION TO DATA MINING

Chemoinformatics and Drug Discovery

Classification Using Decision Trees

Jonathan S. Mason,, Isabelle Morize, Paul R. Menard,*, Daniel L. Cheney, Christopher Hulme, and Richard F. Labaudiniere

Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies

Building blocks for automated elucidation of metabolites: Machine learning methods for NMR prediction

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology

Induction of Decision Trees

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

Similarity methods for ligandbased virtual screening

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS

PROVIDING CHEMINFORMATICS SOLUTIONS TO SUPPORT DRUG DISCOVERY DECISIONS

Progress of Compound Library Design Using In-silico Approach for Collaborative Drug Discovery

Will it rain tomorrow?

Tautomerism in chemical information management systems

Docking. GBCB 5874: Problem Solving in GBCB

Pharmacophore Fingerprinting. 1. Application to QSAR and Focused Library Design

October 6 University Faculty of pharmacy Computer Aided Drug Design Unit

Introducing a Bioinformatics Similarity Search Solution

QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

LigandScout. Automated Structure-Based Pharmacophore Model Generation. Gerhard Wolber* and Thierry Langer

Generating Small Molecule Conformations from Structural Data

An Enhancement of Bayesian Inference Network for Ligand-Based Virtual Screening using Features Selection

Relative Drug Likelihood: Going beyond Drug-Likeness

Interactive Feature Selection with

Studying the effect of noise on Laplacian-modified Bayesian Analysis and Tanimoto Similarity

Plan. Day 2: Exercise on MHC molecules.

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov

Introduction to Machine Learning Midterm Exam

Transcription:

An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577

Goals of in-silico screening Lead discovery/design Lead optimization Maximize potency Minimize toxicity Optimize ADME characteristics Fail-early paradigm - try to predict clinical success/failure Maximize use of prior information

Challenges of in-silico screening Combinatorial explosion of chemical and biological data Must insure data currency, validity, cleanliness Must accommodate multiple heterogeneoous sources of data Wide expanse of chemical space dictates a wide variety of tools to analyze them Need an integrated chemical informatics framework to utilize the tools efficiently and conveniently

Chemical Informatics Framework Databases, data marts, and data warehouses of chemical structure information - and the software to build and manage them (including in-house, commercial, and internet data sources) Software tools to extract, normalize, enumerate, and compute properties from structures Analysis applications to visualize, model, and predict activity from structure (structural data mining)

Chemical Informatics: : Database Chemical Warehousing - e.g., Reagent Selector, Compound Warehouse Oracle 8i architecture Star Schema - One or more large fact table(s) linked to several smaller dimension tables Serviced by schedulers and database daemon processes Scalable from project data mart (100K structures) to corporate warehouse (10M structures) Multiple structure sources, including commercial databases (MDDR, ACD, Beilstein), in-house, and online data sources

Chemical Informatics: : Software Tools Cheshire - MDL chemical scripting language Object-oriented, Java-like syntax - associative arrays for structures, queries, data; collections Perceive and set atom, bond properties, rings, fragments, sgroup data Supports mapping and structure-walking applications Exposes MDL s SSKeys; allows user-defined similarity searches

Cheshire: An Example Purpose: normalize nitro group representation function no2s() { var q1=createmol( N+(=O)-O) ); var ic = 0; if ((c=map(q1)).count() > 0 ) { c.set(a_charge, 0); c.set(b_type,b_type_double); ic = ic + 1; } return (ic); }

Topological Distances

Gasteiger Charges

Common Substructures Algorithm O O C H C O C O O O C H C C O N O N O O Find the Stereocenters Collect the Stereocenters, atoms alpha to them, rings connected to them Compare the new structure with those already collected. Is it unique? (No: Add a hit to the list) Is it a substructure of an already existing structure (or vise versa)? (Yes: Add a hit to the substructure's list, remove the "superstructure"; No: Add it to the list)

Structural Diversity

SSKeys Structure-based keys 166 bits - subset of unique and identified features 960 bits - optimized for searching; contains many-to-one mappings 2275 bits - many-to-one mappings removed

Modal Fingerprints Capture information about key bits set by a collection of molecules threshold weighting Similarity searches N(q&t)/N(q) ( sub -similarity) N(q&t)/N(q t) (Tanimoto) see also: Shemetulskis, et al., J. Chem. Inf. Comput. Sci. 1996, 36, 862-871.

Chemical Informatics: : Analysis Application Visualization Unsupervised data mining - clustering Supervised data mining - classification, decision trees, recursive partitioning QSAR modeling - regression, neural networks, genetic algorithms Pharmacophore searching and docking

Chemical Informatics: : Clustering Goal: rapid, simple UI, select cluster count Possibilities: hierarchical, Jarvis-Patrick, monothetic, kmeans, binning Choice: Sample-based kmeans Fast, scalable, controllable, use 166 keys Published algorithm: L. Kauffman & P. J. Rousseeuw, Finding Groups in Data, Wiley, 1991 (win-www.uia.ac.be/u/statis/programs)

Chemical Informatics: : Classification Goal: rapid, simple UI, interpretable results Possibilities: C4.5, FIRM, CART, various Bayesian methods, etc. Choices: C4.5/C4.5rules, FIRM Available, fast, simple I/O Published algorithms: C4.5: J. R. Quinlan, C4.5, Morgan-Kaufmann, 1992. FIRM: D. Hawkins, UMN Statistics Dept. (ftp://ftp.stat.umn.edu/pub/firm/ibmfirm.exe) Others: WEKA - I. H. Witten et al (pure JAVA) (http://www.cs.waikato.ac.nz/ml/weka/)

Example: 5-HT Antagonists 3 Bind to ligand-gated ion-channel receptor for serotonin. Therapeutically useful in treatment of emesis, anxiety, schizophrenia, drug dependence, and age-related memory loss. Often show mixed agonist/antagonist properties, depending on site of action. Example refs: Cappelli, et.al., J. Med. Chem., 1998, 41, 728-741 Rival, et.al., J. Med. Chem., 1998, 41, 311-317

Approaches to Library Construction Reagent Based Product Based

Reagent Based Library Construction Build around a known scaffold N N N + Y. Rival, et al., J. Med. Chem. 1998, 41, 311-317.

Arylpiperazine Interactions Limited Space Attractive Short- Range interactions Aromatic Interactions Limited Space N H O N O N + O Limited Space

One-Step Chemistry + N Cl N N N N N

Diversity Around The Ring ACD Available Chemicals Directory R2 R1 70 Imidoyl chlorides 26 Mw < 250 Small molecular volume substitution at R2 & R3 Electron withdrawing & donating substitution at R1 R3 N Cl

Substructure Search Selecting Reagents Filter on M. Wt. Cluster Filtered Reagents

Product Based Library Construction Use modal fingerprints to characterize active compounds Choose a backbone (quinoline) for the compounds Search ACDSC for quinoline-containing molecules; refine the search with the modal fingerprint

Clustering Extracted Structures Cluster Analysis on substructure keys Drilldown into cluster

Decision Tree Classification Method: Select 200 5-HT3 antagonists from MDDR20001 Select an equal number of most similar structures lacking reported 5-HT3 activity Use 166 User keys as descriptors Run a variety of classification programs, using 10-fold cross validation Example results: Method % correct Number of Rules C45 91 14 C45Rules 68 2 Ripper 85 6 Naïve Bayes 76 -- Regression 77 --

Classification Results Training Set Classification Results: 166 Keys 960 Keys Method correct # rules correct # rules C4.5 81% 80 84% 67 C4.5Rules 68 2 69 12 FIRM 73 5 78 7 402 structures (204 active, 198 inactive ) Default program parameters 10-fold cross-validation results

Library Evaluation ACDSC - 1.4M structure screening database Pick structures similar to 960-key modal f.p. and those similar to 2275-key modal f.p. Test structures against decision tree models Number of structures predicted to be active: Similarity basis Model 960 key m.f.p. 2275 key m.f.p. C4.5 343/2908 379/3067 FIRM 193/2908 472/3067

Conclusions 166 and 960-key sets performed similarly; 960 keys were marginally better for training set predictions. Decision tree performance correlated with rule-size. FIRM gave the best performance/rule-size ratio. Very little overlap was observed between keys selected by the decision tree programs and those selected in the modal fingerprints - suggesting the approaches are complementary. The 960-key set can be expanded to 2275 unrolled keys; similarity to these keys and the 960-key set yields different selections of structures, which behave differently in decision tree predictions.

Future Work Run classifications using the 2275-key set. Investigate other classifiers (knn, Bayesian, CART, etc.) Investigate relationship between similarity to modal fingerprint and level of biological activity. Develop classification and fingerprint models for ADME properties. Utilize query features in modal fingerprints....

Acknowledgments Doug Henry, David Evans, and Maurizio Bronzetti Richard Briggs, Burt Leland, and the rest of the Cheshire Team Ali Ozkabak and the rest of the Reagent Selector team