Large scale classification of chemical reactions from patent data

Similar documents
Standardized Representations of ELN Reactions for Categorization and Duplicate/Variation Identification

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Analyzing Success Rates of Supposedly Easy Reactions

Machine Learning Concepts in Chemoinformatics

Visualization and manipulation of Matched Molecular Series for decision support

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

In silico generation of novel, drug-like chemical matter using the LSTM deep neural network

How to add your reactions to generate a Chemistry Space in KNIME

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

OECD QSAR Toolbox v.3.4

Chemical Data Retrieval and Management

Molecular Graphics. Molecular Graphics Expt. 1 1

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov

Extracting Knowledge from Reaction Databases: Developments from InfoChem

c. Oxidizing agent shown here oxidizes 2º alcohols to ketones and 1º alcohols to carboxylic acids. 3º alcohols DO NOT REACT.

cheminformatics toolkits: a personal perspective

Introduction to Chemoinformatics

Chemistry Informatics in Academic Laboratories: Lessons Learned

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

A first model of learning

Manual for a computer class in ML

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining in the Chemical Industry. Overview of presentation

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Using NMR and IR Spectroscopy to Determine Structures Dr. Carl Hoeger, UCSD

Basic Techniques in Structure and Substructure

The Schrödinger KNIME extensions

Electrical and Computer Engineering Department University of Waterloo Canada

More information can be found in Chapter 12 in your textbook for CHEM 3750/ 3770 and on pages in your laboratory manual.

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Building blocks for automated elucidation of metabolites: Machine learning methods for NMR prediction

Basic Organic Chemistry Nomenclature CHEM 104 B

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Spark

CS145: INTRODUCTION TO DATA MINING

Mirabilis 2.0. Lhasa Limited vicgm. 11 January Martin Ott

Lecture 4 Chapter 13 - Polymers. Functional Groups Condensation Rxns Free Radical Rxns

Predicting flight on-time performance

Structure-Activity Modeling - QSAR. Uwe Koch

C.-A. Azencott, M. A. Kayala, and P. Baldi. Institute for Genomics and Bioinformatics Donald Bren School of Information and Computer Sciences

Loudon Chapter 19 Review: Aldehydes and Ketones CHEM 3331, Jacquie Richardson, Fall Page 1

Administrative notes. Computational Thinking ct.cs.ubc.ca

Quantum Classification of Malware. John Seymour

When we deprotonate we generate enolates or enols. Mechanism for deprotonation: Resonance form of the anion:

Identifying Functional Groups. Why is this necessary? Alkanes. Why is this so important? What is a functional group? 2/1/16

ChemSpider Reactions: Delivering a free community resource of chemical syntheses

Data Mining und Maschinelles Lernen

Ignasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology

CSD. Unlock value from crystal structure information in the CSD

CLRG Biocreative V

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

When planning an organic synthesis there are usually different questions that one must ask.

DAMIETTA UNIVERSITY. Energy Diagram of One-Step Exothermic Reaction

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

Strategies for Organic Synthesis

Chemical Reactions and Enzymes. (Pages 49-59)

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

Joana Pereira Lamzin Group EMBL Hamburg, Germany. Small molecules How to identify and build them (with ARP/wARP)

An Integrated Approach to in-silico

Patent Searching using Bayesian Statistics

Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods

CDK & Mass Spectrometry

DATA ANALYTICS IN NANOMATERIALS DISCOVERY

Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit

Early Stages of Drug Discovery in the Pharmaceutical Industry

Bridging the Dimensions:

FREQUENTLY ASKED QUESTIONS ABOUT SINGLE AND DOUBLE UNKNOWN ANALYSES

Chemical Space: Modeling Exploration & Understanding

FIRST EXAMINATION. Name: CHM 332

Decision Trees: Overfitting

Chemists are from Mars, Biologists from Venus. Originally published 7th November 2006

DANIEL WILSON AND BEN CONKLIN. Integrating AI with Foundation Intelligence for Actionable Intelligence

Data Mining and Knowledge Discovery: Practice Notes

EASTERN ARIZONA COLLEGE General Chemistry II

Suggested solutions for Chapter 14

FARMINGDALE STATE COLLEGE DEPARTMENT OF CHEMISTRY. CONTACT HOURS: Lecture: 3 Laboratory: 4

OECD QSAR Toolbox v.3.0

Functional Group Fingerprints CNS Chemistry Wilmington, USA

The Molecule Cloud - compact visualization of large collections of molecules

Wiley ChemPlanner predicts experimentally verified synthesis routes in medicinal chemistry

10. Amines (text )

DivCalc: A Utility for Diversity Analysis and Compound Sampling

Similarity Search. Uwe Koch

Linear and Logistic Regression. Dr. Xiaowei Huang

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Solved and Unsolved Problems in Chemoinformatics

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Lecture 4: Training a Classifier

Nonlinear Classification

CS145: INTRODUCTION TO DATA MINING

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Identification of functional groups in the unknown Will take in lab today

Loudon Chapter 23 Review: Amines Jacquie Richardson, CU Boulder Last updated 4/22/2018

Transcription:

Large scale classification of chemical reactions from patent data Gregory Landrum NIBR Informatics, Basel Novartis Institutes for BioMedical Research 10th International Conference on Chemical Structures/ 10th German Conference on Chemoinformatics

Outline Public data sources and reactions Fingerprints for reactions Validation: Machine learning Clustering Application: models for predicting yield 2

Public data sources in cheminformatics an aside at the beginning Publicly available data sources for small molecules and their biological activities/interactions: 3 PDB, PubChem, ChEMBL, etc. Publicly available data sources for the chemistry behind how those molecules were actually made (i.e. reactions): pretty much nothing until recently Plenty of data locked up in large commercial databases, and pharmaceutical companies ELNs, very very little in the open The public/open point is important for collaboration and reproducibility

A large, public source of chemical reactions Not just what we made, but how we made it Text-mining applied to open patent data to extract chemical reactions : 1.12 million reactions [1] Reactions classified using namerxn, when possible, into 318 standard types : >599000 classified reactions [2] [1] Lowe DM: Extraction of chemical structures and reactions from the literature. PhD thesis. University of Cambridge: Cambridge, UK; 2012. [2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software) http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-thewild/ 4

More about the classes Frequency of reaction classes: 5 20 most common classes: 44675 2.1.2 Carboxylic acid + amine reaction 39297 1.7.9 Williamson ether synthesis 28194 2.1.1 Amide Schotten-Baumann 26739 1.3.7 Chloro N-arylation 22400 1.6.2 Bromo N-alkylation 20465 7.1.1 Nitro to amino 20405 1.6.4 Chloro N-alkylation 17226 6.2.2 CO2H-Me deprotection 16602 6.1.1 N-Boc deprotection 16021 6.2.1 CO2H-Et deprotection 12952 1.2.1 Aldehyde reductive amination 12250 2.2.3 Sulfonamide Schotten-Baumann 10659 11.9 Separation 8538 3.1.5 Bromo Suzuki-type coupling 7261 1.7.7 Mitsunobu aryl ether synthesis 7102 6.3.7 Methoxy to hydroxy 7071 3.3.1 Sonogashira coupling 6472 3.1.1 Bromo Suzuki coupling 6383 1.8.5 Thioether synthesis 5791 9.1.6 Hydroxy to chloro

Got the reactions, what about reaction fingerprints? Criteria for them to be useful Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together? 6

Our toolbox: the RDKit Open-source C++ toolkit for cheminformatics Wrappers for Python (2.x), Java, C# Functionality: 2D and 3D molecular operations Descriptor generation for machine learning PostgreSQL database cartridge for substructure and similarity searching Knime nodes IPython integration Lucene integration (experimental) Supports Mac/Windows/Linux Releases every 6 months business-friendly BSD license Code: https://github.com/rdkit http://www.rdkit.org

Similarity and reactions What are we talking about? These two reactions are both type: 1.2.5 Ketone reductive amination 8 It s obvious that these are the same, right?

Similarity and reactions What are we talking about? These two reactions are both type: 1.2.5 Ketone reductive amination 9 It s obvious that these are the same, right?

Got the reactions, what about reaction fingerprints? Start simple: use difference fingerprints: FP Reacts = FP Products = i Reactants i Products FP i FP i FP Rxn = FP Prods FP Reacts Similar idea here: 1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of Metabolites. ChemMedChem 3, 821 832 (2008). 2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction Vectors. 10 J. Chem. Inf. Model. 49, 1163 1184 (2009).

Refine the fingerprints a bit Text-mined reactions often include catalysts, reagents, or solvents in the reactants Explore two options for handling this: 1. Decrease the weight of reactant molecules where too many of the bits are not present in the product fingerprint 2. Decrease the weight of reactant molecules where too many atoms are unmapped 11

Are the fingerprints useful? Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together? 12

Machine learning and chemical reactions Validation set: The 68 reaction types with at least 2000 instances from the patent data set - Resolution reaction types removed (e.g. 11.9 Separation and 11.1 Chiral separation) - Final: 66 reaction types Process: Training set is 200 random instances of each reaction type Test set is 800 random instances of each reaction type Learning: random forest (scikit-learn) 13

Learning reaction classes Results for test data Overall: Recall: 0.94 Precision: 0.94 Accuracy: 0.94 14 For a 66-class classifier, this looks pretty good!

Learning reaction classes Confusion matrix for test data ~94% accuracy much of the confusion is between related types Bromo Suzuki coupling Bromo N-arylation Bromo Suzuki-type coupling 15

Are the fingerprints useful? Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together? 16

Clustering reactions Reaction similarity validation set: The 66 most common reaction types from the patent data set Look at the homogeneity of clusters with at least 10 members 1.2.5 Ketone reductive amination 1.2.5 Ketone reductive amination 1.2.5 Ketone reductive amination Integration Interpretation: <30% of clusters are <90% homogeneous 17 Interpretation: <40% of clusters are <80% homogeneous

Using the fingerprints Can we help classify the remaining 600K reactions? Apply the 66 class random forest to generate class predictions for the unclassified compounds in order to find reactions we missed Cluster the unclassified molecules, look for big clusters of unclassified molecules, and (manually) assign classes to them. Both of these approaches have been successful 18

Predicting yields The data set includes text-mined yield information as well as calculated yields. For modeling: prefer the text-mined value, but take the calculated one if that s the only thing available Look at stats for the 93 reaction classes that have at least 500 members with yields, a min yield > 0 and a max yield < 110 %: 19

Predicting yields Look at the most populated classes: 20

Try building models for yield Start with class 7.1.1 nitro to amino Break into low-yield (<50%) and high-yield (>70%) classes. 14% are low-yield 21

Try building models for yield things that don t work Try building a random forest using the atom-pair based reaction fingerprints That s performance on the training set 22

Try building models for yield things that don t work Try building a random forest using the atom-pair based reactant fingerprints That s performance on the training set 23

Try building models for yield things that don t work? Look at the ROC curve for the training-set data nine wrong low-yield predictions first wrong low-yield prediction The model is doing a great job of ordering compounds, but a bad job of classifying compounds 24

Unbalanced data and ensemble classifiers an aside Usual decision rule for a two-class ensemble classifier: take the result that the the majority of the models (decision trees for random forests) vote for. That s a decision boundary = 0.5 If the dataset is unbalanced, why should we expect balanced behavior from the classifier? Idea: use the composition of the training set to decide what the decision boundary should be. For example: if the data set is ~20% low yield, then assign low yield to any example where at least 20% of the trees say low yield 25

Try building models for yield Getting close to working Try building a random forest using the atom-pair based reactant fingerprints That s performance on the training set What about moving the decision boundary to 0.2 to reflect the unbalanced data set? 26 Starting to look ok. What about the test set?

Try building models for yield Getting close to working Results from a random forest using the atom-pair based reactant fingerprints with the shifted decision boundary test set Not too terrible. 27

Try building models for yield Some more models Aldehyde reductive amination (no shift): test set Williamson ether synthesis (boundary 0.3) test set 28

Try building models for yield Some more models Chloro N-Alkylation (no shift): test set Chloro N-Alkylation (0.4 shift) test set 29

Wrapping up Dataset: 1+ million reactions text mined from patents (publically available) with reaction classes assigned Fingerprints: weighted atom-pair delta and functionalgroup delta fingerprints implemented using the RDKit Fingerprint Validation: 30 Multiclass random-forest classifier ~94% accurate Similarity measure works: similar reactions cluster together Combination of clustering + functional group analysis allows identification of new reaction classes We re also able to use the fingerprints to build reasonable models for yield

Acknowledgements NextMove Software: Roger Sayle Daniel Lowe NIBR: Anna Pelliccioli Sereina Riniker Mike Tarselli 31

Advertising 3 rd RDKit User Group Meeting 22-24 October 2014 Merck KGaA, Darmstadt, Germany Talks, talktorials, lightning talks, social activities, and a hackathon on the 24 th. Registration: http://goo.gl/z6qzwd Full announcement: http://goo.gl/zum2wm We re looking for speakers. Please contact greg.landrum@gmail.com 32