Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry Chemistry-based data mining Parsing, representation, matching, generating descriptors, data mining approaches Examples of how data mining is used within the chemical industry Identifying diverse compounds Analysis of high throughput screening Safety prediction
verview of the chemical industry Chemical industry overview The chemical industry refers to an industry involved in the production of chemicals The industry includes: petrochemicals agrochemicals pharmaceuticals polymers paints oleochemicals 2
Pharmaceutical industry World s largest manufacturing industry Pharmaceutical sales: $357 billion (2) USA(4%); Europe(24%); Japan(3%) R&D Time and Costs 3.6 years to discover a new drug Rising costs of drug discovery 976 - $54 million 987 - $23 million 22 - $82 million nly /3 of known diseases can be treated effective Examples of drugs Zyflo by Abbott Anti-asthmatic agent Cozaar by Merck Treatment for hypertension and congestive heart failure Protonix by Wyeth Anti-ulcer agent Prozac by Lilly Antidepressive agent 3
Drug discovery process Discovery Development 6.5 years Approval.9 years ID Submission DA Submission Target Discovery (2 years) Phase I Lead identification (9 months) Phase II Lead optimization (8 months) Phase III Pre-clinical (2 years) Phase IV Target Hits Leads Candidates Target discovery A target is a protein that plays a role in the disease The goal of drug discovery is to identify a new chemical (drug) that modifies the behavior of the protein 4
Lead identification Pharmaceutical companies have historical collections of chemicals ( million approx.) These chemicals will be screened against the target assay (test that indicates whether a chemical modifies the behavior of the target) A chemical showing a positive response is considered a hit at this stage Lead series (sets of similar chemicals) will be uncovered through an analysis of the data (both chemical and biological screening data) High throughput screening The test is performed on small plates The results of the testing are automatically read The process of screening the chemicals is automated 5
Screening results By analyzing the screening data, lead series can be identified Class Biological screening data Chemical 8. 8. 7.92 7.66 6.92 6.49 6.8 Lead optimization Lead series Design new compounds Screen Synthesize or acquire compounds 6
Lead optimization Synthesize close analogs to determine how substituents affect the biological response data Lead optimization ptimize over multiple rows of data: () primary screening data (2) selected screening data (3) ADME screening data 7
Pre-clinical Use of model system Absorption Distribution Metabolism Elimination Toxicity (safety) ID submission and patent application Prior to clinical trials submit ID to Regulatory Agencies (FDA) Allow company to conduct a clinical trial Patent Exclusivity period from approval date 8
Drug development Issues 6 3 CE A p p ro vals 5 4 3 2 CE approvals CE approvals fit line R&D expenditure 962 967 972 977 982 987 992 997 22 2 R& D E xpenditures (Billions $) 9
Compounds withdrawals Source: IBM Why drugs fail?
Issues Making sense of large volumes of data Genomics, HTS, Lead optimization, ADME(T), Clinical trials Integration of data across silos Accelerating the pace of drug discovery Reducing compound attrition Chemistry-based data mining
Data mining in chemistry - Chemoinformatics Journals Journal of Chemical Information and Computer Sciences Journal of Computer-Aided Molecular Design Journal of Molecular Graphics and Modelling Books An Introduction to Chemoinformatics (Hardcover) by Andrew R. Leach, Valerie J. Gillet Chemoinformatics: A Textbook (Hardcover) by Johann Gasteiger (Editor), Thomas Engel (Editor) Meetings International Conference on Chemoinformatics Discovery Knowledge & Informatics 27 The Fourth Joint Sheffield Conference on Chemoinformatics Virtual Discovery. Computer-Aided Drug Design and Screening Web sites http://www.chemoinf.com/ http://www.iainm.demon.co.uk/indexnew.htm Issues to consider when data mining chemical data Parsing Chemical data Related information Matching Exact Substructure Similarity Descriptors Data mining methods 2
How to convert a chemical into a form to be read by a computer H Cl Computer readable representations of chemicals Describe the connection table in a computer readable form: MLFILE and SD File Contain information on the atoms (including coordinates), bonds, connections and associated information SMILES, WL, CML,. 3
MolFile example Cl H Structure- Structure- ACD/Labs933757 V2 6.75-7.6823. C 6.75-8.7337. C 5.645-7.567. C 5.645-9.2593. C 4.254-7.6823. C 4.254-8.7337. C 7.8959-8.7337. C 7.8959-7.6824. C 6.9855-7.567. 3.49-9.3. Cl 3 2 4 3 5 2 4 6 2 5 6 8 7 2 9 8 2 2 9 2 7 6 M ED Matching chemical structures Aromaticity Cl Cl Representation + A B A B Stereochemistry H Tautomerism H A B A B 4
eed to annotate atoms and bonds (and the whole compound) to effectively search C H C C C Cl Acyclic. C C C Cl Acyclic Hs= C Cyclic Aromatic Hs= All atoms and bonds can be annotated with additional information Graphs odes (atoms) and edges (bonds) have properties on-calculated E.g.; Charge, bond type, atoms type,. Calculated Cyclic, number of hydrogens,. 5
Perception of chemical features Rings and chains Aromaticity Stereocenters lefinic Double Bonds Hydrogens Hybridisation levels Canonicalization Symmetry. Hydrogen perception example ow compound is in a graph we can easily determine the number hydrogens using the atom type, attached bonds and charge Cl 3 2 7 4 5 8 6 9 Atom# Type Hydrogens Cl 2 C 3 C 4 C 5 C 6 C 7 C 8 C 9 6
Exact matching Exact Search H H 2 Query Chemicals selected from a database Substructure queries Structure drawing packages Add additional restriction on the atom and bonds: Cyclic/acyclic umber of hydrogens Closed to substitution.. MLFILE 7
MLFILE representation of a query -ISIS- 293732D Aromatic bond Ch Bond in chain only 999 V2 4.8-2.883..7444-2.825. C.739-3.6357. C 2.4482-4.5. C 3.63-3.6424. C 3.645-2.86. C 2.4547-2.453. C 3.875-2.3958. C 5.525-2.397. C 6.225-2.842. C 5 6 3 4 6 7 2 7 2 2 3 2 6 8 8 4 4 5 2 9 9 2 M ED Query Substructure search example Aromatic bond Ch Bond in chain only A B E C D 8
Substructure search example Query Molecular Descriptors umber of hydrogen bond acceptors umber of hydrogen bond donors umber of rings umber of rotatable bonds Molecular weight Hydrophobicity Molar refractivity Topological indices Kappa shape indices Electrotopological state indices Polar surface area 2D fingerprints Atom-pairs and topological torsions Pharmacophore keys 9
Single Acyclic on-terminal Rotatable Bonds Rotatable bond Calculating rotatable bonds Rotatable bonds = Rotatable bonds = Rotatable bonds = 6 Rotatable bonds = Rotatable bonds = 3 2
2D Fingerprints Makes use of a dictionary of fragments (pre-defined substructures) Ak Any H Describing chemicals using fragment descriptors Fragment dictionary H H... Chemicals H H H H H H H 2. 2
Data Mining Methods Clustering Decision trees Principal component analysis Support vector machines k Decision forests Bayesian networks Genetic algorithms Identifying diverse chemicals Selecting diverse chemicals from commercially available chemical collections is often used to supplement in-house screening sets 22
Approach to identifying diverse chemicals Select and calculate chemical descriptors Determine the similarity between chemicals Cluster chemicals based on these descriptors Select representative chemicals from each group generated Describing chemicals using fragment descriptors Fragment dictionary H H... Chemicals H H H H H H H 2. 23
Similarity Tanimoto is one example SAB = c / (a + b - c) a is the number of bits set to one in A b is the the number of bits set to one in B c is the number of bits that are in both A and B Calculating similarity Fragment dictionary H H... Chemicals H H H H H H H 2. S = c / (a + b - c) S = / (5 + 3 - ) S =.4 24
Using 7 chemicals to illustrate B D G F E C A This type of analysis is usually performed on tens of thousands of chemicals Fingerprint table Chemicals Fragment dictionary ids 25
Hierarchical agglomerative clustering based on the fingerprints C D A B E F G Uses of the Euclidean distance and clusters using the average linkage joining rule Generating clusters C D Cluster 3 A B Cluster E F Cluster 2 G Adjusts the distance cut-off to change number of clusters 26
Cluster results at.7 cut-off Cluster B A Cluster 2 G F E Cluster 3 D C Selecting a representative from each cluster B G D 27
Analyzing HTS data Use of decision trees Supervised learning approach Partitions the set based on the descriptors Uses the biological data to determine the groups Used to quickly identify biologically interesting groups of chemicals 28
Generating decision trees. Find the most significant feature to split Feature-A 2. Partition the set according to those compounds containing the feature and those without 3. Partition each child node until a threshold is reached Whole Set Feature-A has the highest score Feature-B has the highest score Chemicals without feature-a Chemicals with feature-a Feature-C has the highest score Compounds without feature-b Compounds with feature-b Compounds without feature-c Compounds with feature-c Using 7 chemicals to illustrate B D G F E C A This type of analysis is usually performed on hundreds of thousands of chemicals 29
Potency data Structure ID Data A 4.44 B 3.5 C 5.2 D 5.4 E 5.23 F 9.33 G 8.7 Fingerprints Chemical acridine sec-amine(h) nitro pyridine alcohol thiane(h), 4-oxo- imidazole A B C D E F G 3
Decision Tree RP Tree G 8.7 F 9.33 3
RP Tree 4.44 A B 3.5 Predicting chemical safety 32
Safety prediction example Prepare data Integrate, normalize data, generate descriptors Prune descriptors Remove constants, desciptors lacking in information Understand chemical space Subset data Understand mechanisms of action Build and optimize model(s) Assess models Evaluate quality Combine models Applicability domain Apply to untested chemicals, where chemical is within the domain of the model Building model (dev tox) Structural descriptors Response: positive/negative/equivocal/unknown Use a classification tree model http://ecb.jrc.it/qsar/home.php?cteu=/qsar/information_sources/information_datasets.php 33
Table used to build prediction model Classification tree 34
Classification tree assessment The quality of the predictive model is ultimately dependent on the quality of the data. In this example, the model is not very good at predicting safe chemicals since the original data lacked negative data points. Summary The chemical industry generates information about chemicals and its relationship to drug potency, safety, agrochemicals, Data mining is used extensively to accelerate the development of new products Representing and describing chemicals is a large part of the challenge of data mining chemical information 35