Data Mining in the Chemical Industry. Overview of presentation

Similar documents
Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Introduction to Chemoinformatics and Drug Discovery

Machine Learning Concepts in Chemoinformatics

JCICS Major Research Areas

An Integrated Approach to in-silico

Solved and Unsolved Problems in Chemoinformatics

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Early Stages of Drug Discovery in the Pharmaceutical Industry

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Computational Methods and Drug-Likeness. Benjamin Georgi und Philip Groth Pharmakokinetik WS 2003/2004

Introduction. OntoChem

Similarity methods for ligandbased virtual screening

Introduction to Chemoinformatics

Data Quality Issues That Can Impact Drug Discovery

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE

EMPIRICAL VS. RATIONAL METHODS OF DISCOVERING NEW DRUGS

In silico pharmacology for drug discovery

Biologically Relevant Molecular Comparisons. Mark Mackey

Data Analysis in the Life Sciences - The Fog of Data -

FRAUNHOFER IME SCREENINGPORT

Using AutoDock for Virtual Screening

The Changing Requirements for Informatics Systems During the Growth of a Collaborative Drug Discovery Service Company. Sally Rose BioFocus plc

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Chemoinformatics and Drug Discovery

Receptor Based Drug Design (1)

The Case for Use Cases

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Structural biology and drug design: An overview

Structure-Activity Modeling - QSAR. Uwe Koch

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Drug Informatics for Chemical Genomics...

Building innovative drug discovery alliances. Just in KNIME: Successful Process Driven Drug Discovery

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

Similarity Search. Uwe Koch

QSAR in Green Chemistry

Introduction to Chemoinformatics

Capturing Chemistry. What you see is what you get In the world of mechanism and chemical transformations

Bioisosteres in Medicinal Chemistry

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a

Introducing a Bioinformatics Similarity Search Solution

In Silico Investigation of Off-Target Effects

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Dispensing Processes Profoundly Impact Biological, Computational and Statistical Analyses

Comprehensive Chemoinformatics since Web-based, client/server, and toolkit approaches. Native Oracle (cartridge) and Microsoft technology.

DivCalc: A Utility for Diversity Analysis and Compound Sampling

Ultra High Throughput Screening using THINK on the Internet

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS

COMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA

Pooling Experiments for High Throughput Screening in Drug Discovery

Chemical library design

Analysis of Activity Landscapes, Activity Cliffs, and Selectivity Cliffs. Jürgen Bajorath Life Science Informatics University of Bonn

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Searching Substances in Reaxys

Ignasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015

Interactive Feature Selection with

Development of a Structure Generator to Explore Target Areas on Chemical Space

Using Self-Organizing maps to accelerate similarity search

Large scale classification of chemical reactions from patent data

Design and Synthesis of the Comprehensive Fragment Library

Analyzing Small Molecule Data in R

QSAR of Microtubule Stabilizing Dictyostatins

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

DISCOVERING new drugs is an expensive and challenging

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre

Performing a Pharmacophore Search using CSD-CrossMiner

Bioinformatics Workshop - NM-AIST

Fast similarity searching making the virtual real. Stephen Pickett, GSK

Control Strategies for Small Molecule Components of Antibody-Drug Conjugates

Structure-Based Drug Discovery An Overview

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

Chemical Databases: Encoding, Storage and Search of Chemical Structures

Building blocks for automated elucidation of metabolites: Machine learning methods for NMR prediction

The Schrödinger KNIME extensions

This is a repository copy of Chemoinformatics techniques for data mining in files of two-dimensional and three-dimensional chemical molecules.

Use of data mining and chemoinformatics in the identification and optimization of high-throughput screening hits for NTDs

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

Reaxys Medicinal Chemistry Fact Sheet

COMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE

Chemoinformatics: the first half century

Molecular Graphics. Molecular Graphics Expt. 1 1

PROVIDING CHEMINFORMATICS SOLUTIONS TO SUPPORT DRUG DISCOVERY DECISIONS

Canonical Line Notations

Clustering Ambiguity: An Overview

Statistical concepts in QSAR.

Exploring the chemical space of screening results

Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies

Advanced Medicinal Chemistry SLIDES B

Patent Searching using Bayesian Statistics

Translating Methods from Pharma to Flavours & Fragrances

RECENT TRENDS IN PHARMACEUTICAL CHEMISTRY FOR DRUG DISCOVERY

Plan. Day 2: Exercise on MHC molecules.

Developing CAS Products for Substructure Searching by Chemists. Linda Toler

Progress of Compound Library Design Using In-silico Approach for Collaborative Drug Discovery

Transcription:

Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry Chemistry-based data mining Parsing, representation, matching, generating descriptors, data mining approaches Examples of how data mining is used within the chemical industry Identifying diverse compounds Analysis of high throughput screening Safety prediction

verview of the chemical industry Chemical industry overview The chemical industry refers to an industry involved in the production of chemicals The industry includes: petrochemicals agrochemicals pharmaceuticals polymers paints oleochemicals 2

Pharmaceutical industry World s largest manufacturing industry Pharmaceutical sales: $357 billion (2) USA(4%); Europe(24%); Japan(3%) R&D Time and Costs 3.6 years to discover a new drug Rising costs of drug discovery 976 - $54 million 987 - $23 million 22 - $82 million nly /3 of known diseases can be treated effective Examples of drugs Zyflo by Abbott Anti-asthmatic agent Cozaar by Merck Treatment for hypertension and congestive heart failure Protonix by Wyeth Anti-ulcer agent Prozac by Lilly Antidepressive agent 3

Drug discovery process Discovery Development 6.5 years Approval.9 years ID Submission DA Submission Target Discovery (2 years) Phase I Lead identification (9 months) Phase II Lead optimization (8 months) Phase III Pre-clinical (2 years) Phase IV Target Hits Leads Candidates Target discovery A target is a protein that plays a role in the disease The goal of drug discovery is to identify a new chemical (drug) that modifies the behavior of the protein 4

Lead identification Pharmaceutical companies have historical collections of chemicals ( million approx.) These chemicals will be screened against the target assay (test that indicates whether a chemical modifies the behavior of the target) A chemical showing a positive response is considered a hit at this stage Lead series (sets of similar chemicals) will be uncovered through an analysis of the data (both chemical and biological screening data) High throughput screening The test is performed on small plates The results of the testing are automatically read The process of screening the chemicals is automated 5

Screening results By analyzing the screening data, lead series can be identified Class Biological screening data Chemical 8. 8. 7.92 7.66 6.92 6.49 6.8 Lead optimization Lead series Design new compounds Screen Synthesize or acquire compounds 6

Lead optimization Synthesize close analogs to determine how substituents affect the biological response data Lead optimization ptimize over multiple rows of data: () primary screening data (2) selected screening data (3) ADME screening data 7

Pre-clinical Use of model system Absorption Distribution Metabolism Elimination Toxicity (safety) ID submission and patent application Prior to clinical trials submit ID to Regulatory Agencies (FDA) Allow company to conduct a clinical trial Patent Exclusivity period from approval date 8

Drug development Issues 6 3 CE A p p ro vals 5 4 3 2 CE approvals CE approvals fit line R&D expenditure 962 967 972 977 982 987 992 997 22 2 R& D E xpenditures (Billions $) 9

Compounds withdrawals Source: IBM Why drugs fail?

Issues Making sense of large volumes of data Genomics, HTS, Lead optimization, ADME(T), Clinical trials Integration of data across silos Accelerating the pace of drug discovery Reducing compound attrition Chemistry-based data mining

Data mining in chemistry - Chemoinformatics Journals Journal of Chemical Information and Computer Sciences Journal of Computer-Aided Molecular Design Journal of Molecular Graphics and Modelling Books An Introduction to Chemoinformatics (Hardcover) by Andrew R. Leach, Valerie J. Gillet Chemoinformatics: A Textbook (Hardcover) by Johann Gasteiger (Editor), Thomas Engel (Editor) Meetings International Conference on Chemoinformatics Discovery Knowledge & Informatics 27 The Fourth Joint Sheffield Conference on Chemoinformatics Virtual Discovery. Computer-Aided Drug Design and Screening Web sites http://www.chemoinf.com/ http://www.iainm.demon.co.uk/indexnew.htm Issues to consider when data mining chemical data Parsing Chemical data Related information Matching Exact Substructure Similarity Descriptors Data mining methods 2

How to convert a chemical into a form to be read by a computer H Cl Computer readable representations of chemicals Describe the connection table in a computer readable form: MLFILE and SD File Contain information on the atoms (including coordinates), bonds, connections and associated information SMILES, WL, CML,. 3

MolFile example Cl H Structure- Structure- ACD/Labs933757 V2 6.75-7.6823. C 6.75-8.7337. C 5.645-7.567. C 5.645-9.2593. C 4.254-7.6823. C 4.254-8.7337. C 7.8959-8.7337. C 7.8959-7.6824. C 6.9855-7.567. 3.49-9.3. Cl 3 2 4 3 5 2 4 6 2 5 6 8 7 2 9 8 2 2 9 2 7 6 M ED Matching chemical structures Aromaticity Cl Cl Representation + A B A B Stereochemistry H Tautomerism H A B A B 4

eed to annotate atoms and bonds (and the whole compound) to effectively search C H C C C Cl Acyclic. C C C Cl Acyclic Hs= C Cyclic Aromatic Hs= All atoms and bonds can be annotated with additional information Graphs odes (atoms) and edges (bonds) have properties on-calculated E.g.; Charge, bond type, atoms type,. Calculated Cyclic, number of hydrogens,. 5

Perception of chemical features Rings and chains Aromaticity Stereocenters lefinic Double Bonds Hydrogens Hybridisation levels Canonicalization Symmetry. Hydrogen perception example ow compound is in a graph we can easily determine the number hydrogens using the atom type, attached bonds and charge Cl 3 2 7 4 5 8 6 9 Atom# Type Hydrogens Cl 2 C 3 C 4 C 5 C 6 C 7 C 8 C 9 6

Exact matching Exact Search H H 2 Query Chemicals selected from a database Substructure queries Structure drawing packages Add additional restriction on the atom and bonds: Cyclic/acyclic umber of hydrogens Closed to substitution.. MLFILE 7

MLFILE representation of a query -ISIS- 293732D Aromatic bond Ch Bond in chain only 999 V2 4.8-2.883..7444-2.825. C.739-3.6357. C 2.4482-4.5. C 3.63-3.6424. C 3.645-2.86. C 2.4547-2.453. C 3.875-2.3958. C 5.525-2.397. C 6.225-2.842. C 5 6 3 4 6 7 2 7 2 2 3 2 6 8 8 4 4 5 2 9 9 2 M ED Query Substructure search example Aromatic bond Ch Bond in chain only A B E C D 8

Substructure search example Query Molecular Descriptors umber of hydrogen bond acceptors umber of hydrogen bond donors umber of rings umber of rotatable bonds Molecular weight Hydrophobicity Molar refractivity Topological indices Kappa shape indices Electrotopological state indices Polar surface area 2D fingerprints Atom-pairs and topological torsions Pharmacophore keys 9

Single Acyclic on-terminal Rotatable Bonds Rotatable bond Calculating rotatable bonds Rotatable bonds = Rotatable bonds = Rotatable bonds = 6 Rotatable bonds = Rotatable bonds = 3 2

2D Fingerprints Makes use of a dictionary of fragments (pre-defined substructures) Ak Any H Describing chemicals using fragment descriptors Fragment dictionary H H... Chemicals H H H H H H H 2. 2

Data Mining Methods Clustering Decision trees Principal component analysis Support vector machines k Decision forests Bayesian networks Genetic algorithms Identifying diverse chemicals Selecting diverse chemicals from commercially available chemical collections is often used to supplement in-house screening sets 22

Approach to identifying diverse chemicals Select and calculate chemical descriptors Determine the similarity between chemicals Cluster chemicals based on these descriptors Select representative chemicals from each group generated Describing chemicals using fragment descriptors Fragment dictionary H H... Chemicals H H H H H H H 2. 23

Similarity Tanimoto is one example SAB = c / (a + b - c) a is the number of bits set to one in A b is the the number of bits set to one in B c is the number of bits that are in both A and B Calculating similarity Fragment dictionary H H... Chemicals H H H H H H H 2. S = c / (a + b - c) S = / (5 + 3 - ) S =.4 24

Using 7 chemicals to illustrate B D G F E C A This type of analysis is usually performed on tens of thousands of chemicals Fingerprint table Chemicals Fragment dictionary ids 25

Hierarchical agglomerative clustering based on the fingerprints C D A B E F G Uses of the Euclidean distance and clusters using the average linkage joining rule Generating clusters C D Cluster 3 A B Cluster E F Cluster 2 G Adjusts the distance cut-off to change number of clusters 26

Cluster results at.7 cut-off Cluster B A Cluster 2 G F E Cluster 3 D C Selecting a representative from each cluster B G D 27

Analyzing HTS data Use of decision trees Supervised learning approach Partitions the set based on the descriptors Uses the biological data to determine the groups Used to quickly identify biologically interesting groups of chemicals 28

Generating decision trees. Find the most significant feature to split Feature-A 2. Partition the set according to those compounds containing the feature and those without 3. Partition each child node until a threshold is reached Whole Set Feature-A has the highest score Feature-B has the highest score Chemicals without feature-a Chemicals with feature-a Feature-C has the highest score Compounds without feature-b Compounds with feature-b Compounds without feature-c Compounds with feature-c Using 7 chemicals to illustrate B D G F E C A This type of analysis is usually performed on hundreds of thousands of chemicals 29

Potency data Structure ID Data A 4.44 B 3.5 C 5.2 D 5.4 E 5.23 F 9.33 G 8.7 Fingerprints Chemical acridine sec-amine(h) nitro pyridine alcohol thiane(h), 4-oxo- imidazole A B C D E F G 3

Decision Tree RP Tree G 8.7 F 9.33 3

RP Tree 4.44 A B 3.5 Predicting chemical safety 32

Safety prediction example Prepare data Integrate, normalize data, generate descriptors Prune descriptors Remove constants, desciptors lacking in information Understand chemical space Subset data Understand mechanisms of action Build and optimize model(s) Assess models Evaluate quality Combine models Applicability domain Apply to untested chemicals, where chemical is within the domain of the model Building model (dev tox) Structural descriptors Response: positive/negative/equivocal/unknown Use a classification tree model http://ecb.jrc.it/qsar/home.php?cteu=/qsar/information_sources/information_datasets.php 33

Table used to build prediction model Classification tree 34

Classification tree assessment The quality of the predictive model is ultimately dependent on the quality of the data. In this example, the model is not very good at predicting safe chemicals since the original data lacked negative data points. Summary The chemical industry generates information about chemicals and its relationship to drug potency, safety, agrochemicals, Data mining is used extensively to accelerate the development of new products Representing and describing chemicals is a large part of the challenge of data mining chemical information 35