Large scale classification of chemical reactions from patent data

Size: px
Start display at page:

Download "Large scale classification of chemical reactions from patent data"

Transcription

1 Large scale classification of chemical reactions from patent data Gregory Landrum NIBR Informatics, Basel Novartis Institutes for BioMedical Research 10th International Conference on Chemical Structures/ 10th German Conference on Chemoinformatics

2 Outline Public data sources and reactions Fingerprints for reactions Validation: Machine learning Clustering Application: models for predicting yield 2

3 Public data sources in cheminformatics an aside at the beginning Publicly available data sources for small molecules and their biological activities/interactions: 3 PDB, PubChem, ChEMBL, etc. Publicly available data sources for the chemistry behind how those molecules were actually made (i.e. reactions): pretty much nothing until recently Plenty of data locked up in large commercial databases, and pharmaceutical companies ELNs, very very little in the open The public/open point is important for collaboration and reproducibility

4 A large, public source of chemical reactions Not just what we made, but how we made it Text-mining applied to open patent data to extract chemical reactions : 1.12 million reactions [1] Reactions classified using namerxn, when possible, into 318 standard types : > classified reactions [2] [1] Lowe DM: Extraction of chemical structures and reactions from the literature. PhD thesis. University of Cambridge: Cambridge, UK; [2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software) 4

5 More about the classes Frequency of reaction classes: 5 20 most common classes: Carboxylic acid + amine reaction Williamson ether synthesis Amide Schotten-Baumann Chloro N-arylation Bromo N-alkylation Nitro to amino Chloro N-alkylation CO2H-Me deprotection N-Boc deprotection CO2H-Et deprotection Aldehyde reductive amination Sulfonamide Schotten-Baumann Separation Bromo Suzuki-type coupling Mitsunobu aryl ether synthesis Methoxy to hydroxy Sonogashira coupling Bromo Suzuki coupling Thioether synthesis Hydroxy to chloro

6 Got the reactions, what about reaction fingerprints? Criteria for them to be useful Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together? 6

7 Our toolbox: the RDKit Open-source C++ toolkit for cheminformatics Wrappers for Python (2.x), Java, C# Functionality: 2D and 3D molecular operations Descriptor generation for machine learning PostgreSQL database cartridge for substructure and similarity searching Knime nodes IPython integration Lucene integration (experimental) Supports Mac/Windows/Linux Releases every 6 months business-friendly BSD license Code:

8 Similarity and reactions What are we talking about? These two reactions are both type: Ketone reductive amination 8 It s obvious that these are the same, right?

9 Similarity and reactions What are we talking about? These two reactions are both type: Ketone reductive amination 9 It s obvious that these are the same, right?

10 Got the reactions, what about reaction fingerprints? Start simple: use difference fingerprints: FP Reacts = FP Products = i Reactants i Products FP i FP i FP Rxn = FP Prods FP Reacts Similar idea here: 1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of Metabolites. ChemMedChem 3, (2008). 2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction Vectors. 10 J. Chem. Inf. Model. 49, (2009).

11 Refine the fingerprints a bit Text-mined reactions often include catalysts, reagents, or solvents in the reactants Explore two options for handling this: 1. Decrease the weight of reactant molecules where too many of the bits are not present in the product fingerprint 2. Decrease the weight of reactant molecules where too many atoms are unmapped 11

12 Are the fingerprints useful? Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together? 12

13 Machine learning and chemical reactions Validation set: The 68 reaction types with at least 2000 instances from the patent data set - Resolution reaction types removed (e.g Separation and 11.1 Chiral separation) - Final: 66 reaction types Process: Training set is 200 random instances of each reaction type Test set is 800 random instances of each reaction type Learning: random forest (scikit-learn) 13

14 Learning reaction classes Results for test data Overall: Recall: 0.94 Precision: 0.94 Accuracy: For a 66-class classifier, this looks pretty good!

15 Learning reaction classes Confusion matrix for test data ~94% accuracy much of the confusion is between related types Bromo Suzuki coupling Bromo N-arylation Bromo Suzuki-type coupling 15

16 Are the fingerprints useful? Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together? 16

17 Clustering reactions Reaction similarity validation set: The 66 most common reaction types from the patent data set Look at the homogeneity of clusters with at least 10 members Ketone reductive amination Ketone reductive amination Ketone reductive amination Integration Interpretation: <30% of clusters are <90% homogeneous 17 Interpretation: <40% of clusters are <80% homogeneous

18 Using the fingerprints Can we help classify the remaining 600K reactions? Apply the 66 class random forest to generate class predictions for the unclassified compounds in order to find reactions we missed Cluster the unclassified molecules, look for big clusters of unclassified molecules, and (manually) assign classes to them. Both of these approaches have been successful 18

19 Predicting yields The data set includes text-mined yield information as well as calculated yields. For modeling: prefer the text-mined value, but take the calculated one if that s the only thing available Look at stats for the 93 reaction classes that have at least 500 members with yields, a min yield > 0 and a max yield < 110 %: 19

20 Predicting yields Look at the most populated classes: 20

21 Try building models for yield Start with class nitro to amino Break into low-yield (<50%) and high-yield (>70%) classes. 14% are low-yield 21

22 Try building models for yield things that don t work Try building a random forest using the atom-pair based reaction fingerprints That s performance on the training set 22

23 Try building models for yield things that don t work Try building a random forest using the atom-pair based reactant fingerprints That s performance on the training set 23

24 Try building models for yield things that don t work? Look at the ROC curve for the training-set data nine wrong low-yield predictions first wrong low-yield prediction The model is doing a great job of ordering compounds, but a bad job of classifying compounds 24

25 Unbalanced data and ensemble classifiers an aside Usual decision rule for a two-class ensemble classifier: take the result that the the majority of the models (decision trees for random forests) vote for. That s a decision boundary = 0.5 If the dataset is unbalanced, why should we expect balanced behavior from the classifier? Idea: use the composition of the training set to decide what the decision boundary should be. For example: if the data set is ~20% low yield, then assign low yield to any example where at least 20% of the trees say low yield 25

26 Try building models for yield Getting close to working Try building a random forest using the atom-pair based reactant fingerprints That s performance on the training set What about moving the decision boundary to 0.2 to reflect the unbalanced data set? 26 Starting to look ok. What about the test set?

27 Try building models for yield Getting close to working Results from a random forest using the atom-pair based reactant fingerprints with the shifted decision boundary test set Not too terrible. 27

28 Try building models for yield Some more models Aldehyde reductive amination (no shift): test set Williamson ether synthesis (boundary 0.3) test set 28

29 Try building models for yield Some more models Chloro N-Alkylation (no shift): test set Chloro N-Alkylation (0.4 shift) test set 29

30 Wrapping up Dataset: 1+ million reactions text mined from patents (publically available) with reaction classes assigned Fingerprints: weighted atom-pair delta and functionalgroup delta fingerprints implemented using the RDKit Fingerprint Validation: 30 Multiclass random-forest classifier ~94% accurate Similarity measure works: similar reactions cluster together Combination of clustering + functional group analysis allows identification of new reaction classes We re also able to use the fingerprints to build reasonable models for yield

31 Acknowledgements NextMove Software: Roger Sayle Daniel Lowe NIBR: Anna Pelliccioli Sereina Riniker Mike Tarselli 31

32 Advertising 3 rd RDKit User Group Meeting October 2014 Merck KGaA, Darmstadt, Germany Talks, talktorials, lightning talks, social activities, and a hackathon on the 24 th. Registration: Full announcement: We re looking for speakers. Please contact greg.landrum@gmail.com 32

Standardized Representations of ELN Reactions for Categorization and Duplicate/Variation Identification

Standardized Representations of ELN Reactions for Categorization and Duplicate/Variation Identification Standardized Representations of ELN Reactions for Categorization and Duplicate/Variation Identification Roger Sayle and daniel lowe NextMove Software, Cambridge, UK Overview Electronic Lab Notebooks (ELNs)

More information

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland Navigation in Chemical Space Towards Biological Activity Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland Data Explosion in Chemistry CAS 65 million molecules CCDC 600 000 structures

More information

Analyzing Success Rates of Supposedly Easy Reactions

Analyzing Success Rates of Supposedly Easy Reactions Analyzing Success Rates of Supposedly Easy Reactions Roger Sayle and Daniel Lowe NextMove Software, Cambridge, UK Symposium overview This symposium is entitled Retrosynthesis, synthesis planning, reaction

More information

Machine Learning Concepts in Chemoinformatics

Machine Learning Concepts in Chemoinformatics Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics

More information

Visualization and manipulation of Matched Molecular Series for decision support

Visualization and manipulation of Matched Molecular Series for decision support 250 th ACS National Meeting, Boston 16 th Aug 2015 Visualization and manipulation of Matched Molecular Series for decision support Noel O Boyle and Roger Sayle NextMove Software Matched (Molecular) Pairs

More information

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES DIFFERENT LEVELS OF KNOWLEDGE REPRESENTATION IN CHEMISTRY Michael Braden, PhD ACS / San Diego/ 2016 Overview ChemAxon Who are we? Examples/use cases: Create

More information

In silico generation of novel, drug-like chemical matter using the LSTM deep neural network

In silico generation of novel, drug-like chemical matter using the LSTM deep neural network In silico generation of novel, drug-like chemical matter using the LSTM deep neural network Peter Ertl Novartis Institutes for BioMedical Research, Basel, CH September 2018 Neural networks in cheminformatics

More information

How to add your reactions to generate a Chemistry Space in KNIME

How to add your reactions to generate a Chemistry Space in KNIME How to add your reactions to generate a Chemistry Space in KNIME Introduction to CoLibri This tutorial is supposed to show how normal drawings of reactions can be easily edited to yield precise reaction

More information

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Anthony Arvanites Daylight User Group Meeting March 10, 2005 Outline 1. Company Introduction

More information

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK Chemoinformatics and information management Peter Willett, University of Sheffield, UK verview What is chemoinformatics and why is it necessary Managing structural information Typical facilities in chemoinformatics

More information

OECD QSAR Toolbox v.3.4

OECD QSAR Toolbox v.3.4 OECD QSAR Toolbox v.3.4 Predicting developmental and reproductive toxicity of Diuron (CAS 330-54-1) based on DART categorization tool and DART SAR model Outlook Background Objectives The exercise Workflow

More information

Chemical Data Retrieval and Management

Chemical Data Retrieval and Management Chemical Data Retrieval and Management ChEMBL, ChEBI, and the Chemistry Development Kit Stephan A. Beisken What is EMBL-EBI? Part of the European Molecular Biology Laboratory International, non-profit

More information

Molecular Graphics. Molecular Graphics Expt. 1 1

Molecular Graphics. Molecular Graphics Expt. 1 1 Molecular Graphics Expt. 1 1 Molecular Graphics The study of organic chemistry has for more than a century and a half focussed on the relationship between the structure of an organic molecule (its three-dimensional

More information

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov CADD Group Chemical Biology Laboratory Frederick National Laboratory for Cancer Research National Cancer Institute, National Institutes

More information

Extracting Knowledge from Reaction Databases: Developments from InfoChem

Extracting Knowledge from Reaction Databases: Developments from InfoChem 1 Extracting Knowledge from Reaction Databases: Developments from InfoChem CICAG meeting 3rd July 2013 Stephanie North, Allyl Consulting Ltd, representing InfoChem in the UK H. Kraut, H. Matuszczyk, H.

More information

c. Oxidizing agent shown here oxidizes 2º alcohols to ketones and 1º alcohols to carboxylic acids. 3º alcohols DO NOT REACT.

c. Oxidizing agent shown here oxidizes 2º alcohols to ketones and 1º alcohols to carboxylic acids. 3º alcohols DO NOT REACT. Exam 1 (Ch 17 and Review of CEM 331) Answer Key: 1. ne-step Questions: You need to know reagents for reagent arrows and to be able to draw products. I know a lot of them seem to look alike its your job

More information

cheminformatics toolkits: a personal perspective

cheminformatics toolkits: a personal perspective cheminformatics toolkits: a personal perspective Roger Sayle Nextmove software ltd Cambridge uk 1 st RDKit User Group Meeting, London, UK, 4 th October 2012 overview Models of Chemistry Implicit and Explicit

More information

Introduction to Chemoinformatics

Introduction to Chemoinformatics Introduction to Chemoinformatics Dr. Igor V. Tetko Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) Institute of Bioinformatics & Systems Biology (HMGU) Kyiv, 10 August

More information

Chemistry Informatics in Academic Laboratories: Lessons Learned

Chemistry Informatics in Academic Laboratories: Lessons Learned Chemistry Informatics in Academic Laboratories: Lessons Learned Michael Hudock Center for Biophysics & Computational Biology University of Illinois at Urbana-Champaign My Background Ph.D. candidate, Biophysics

More information

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr. Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, 2006 Dr. Overview Brief introduction Chemical Structure Recognition (chemocr) Manual conversion

More information

A first model of learning

A first model of learning A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers

More information

Manual for a computer class in ML

Manual for a computer class in ML Manual for a computer class in ML November 3, 2015 Abstract This document describes a tour of Machine Learning (ML) techniques using tools in MATLAB. We point to the standard implementations, give example

More information

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture 21 K - Nearest Neighbor V In this lecture we discuss; how do we evaluate the

More information

Data Mining in the Chemical Industry. Overview of presentation

Data Mining in the Chemical Industry. Overview of presentation Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry

More information

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 3 (2017) pp. 461-469 Research India Publications http://www.ripublication.com Predictive Analytics on Accident Data Using

More information

Using NMR and IR Spectroscopy to Determine Structures Dr. Carl Hoeger, UCSD

Using NMR and IR Spectroscopy to Determine Structures Dr. Carl Hoeger, UCSD Using NMR and IR Spectroscopy to Determine Structures Dr. Carl Hoeger, UCSD The following guidelines should be helpful in assigning a structure from NMR (both PMR and CMR) and IR data. At the end of this

More information

Basic Techniques in Structure and Substructure

Basic Techniques in Structure and Substructure Truncating Molecules Basic Techniques in Structure and Substructure Searching for Information Professionals Judith Currano Head, Chemistry Library University of Pennsylvania currano@pobox.upenn.edu Acknowledgements

More information

The Schrödinger KNIME extensions

The Schrödinger KNIME extensions The Schrödinger KNIME extensions Computational Chemistry and Cheminformatics in a workflow environment Jean-Christophe Mozziconacci Volker Eyrich Topics What are the Schrödinger extensions? Workflow application

More information

Electrical and Computer Engineering Department University of Waterloo Canada

Electrical and Computer Engineering Department University of Waterloo Canada Predicting a Biological Response of Molecules from Their Chemical Properties Using Diverse and Optimized Ensembles of Stochastic Gradient Boosting Machine By Tarek Abdunabi and Otman Basir Electrical and

More information

More information can be found in Chapter 12 in your textbook for CHEM 3750/ 3770 and on pages in your laboratory manual.

More information can be found in Chapter 12 in your textbook for CHEM 3750/ 3770 and on pages in your laboratory manual. CHEM 3780 rganic Chemistry II Infrared Spectroscopy and Mass Spectrometry Review More information can be found in Chapter 12 in your textbook for CHEM 3750/ 3770 and on pages 13-28 in your laboratory manual.

More information

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007 Computational Chemistry in Drug Design Xavier Fradera Barcelona, 17/4/2007 verview Introduction and background Drug Design Cycle Computational methods Chemoinformatics Ligand Based Methods Structure Based

More information

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics... 1 1.1 Chemoinformatics... 2 1.1.1 Open-Source Tools... 2 1.1.2 Introduction to Programming Languages... 3 1.2 Chemical Structure

More information

Building blocks for automated elucidation of metabolites: Machine learning methods for NMR prediction

Building blocks for automated elucidation of metabolites: Machine learning methods for NMR prediction Building blocks for automated elucidation of metabolites: Machine learning methods for NMR prediction Stefan Kuhn 1, Björn Egert 2, Steffen Neumann 2, Christoph Steinbeck 1European Bioinformatics Institute

More information

Basic Organic Chemistry Nomenclature CHEM 104 B

Basic Organic Chemistry Nomenclature CHEM 104 B Basic Organic Chemistry Nomenclature CHEM 104 B I have gone ahead and compiled all of the basic naming rules that we will be dealing with into one worksheet. I hope this will be helpful to you as you work

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Introduction to Spark

Introduction to Spark 1 As you become familiar or continue to explore the Cresset technology and software applications, we encourage you to look through the user manual. This is accessible from the Help menu. However, don t

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Mirabilis 2.0. Lhasa Limited vicgm. 11 January Martin Ott

Mirabilis 2.0. Lhasa Limited vicgm. 11 January Martin Ott Mirabilis 2.0 Lhasa Limited vicgm 11 January 2017 Martin Ott Overview Introduction What is Mirabilis? Impurities Purge parameters and factors Scientific prioritisation Mirabilis reactivity knowledge matrix

More information

Lecture 4 Chapter 13 - Polymers. Functional Groups Condensation Rxns Free Radical Rxns

Lecture 4 Chapter 13 - Polymers. Functional Groups Condensation Rxns Free Radical Rxns Lecture 4 Chapter 13 - Polymers Functional Groups Condensation Rxns Free Radical Rxns Chemistry the whole year on one page Last semester Basic atomic theory Stoichiometry, balancing reactions Thermodynamics

More information

Predicting flight on-time performance

Predicting flight on-time performance 1 Predicting flight on-time performance Arjun Mathur, Aaron Nagao, Kenny Ng I. INTRODUCTION Time is money, and delayed flights are a frequent cause of frustration for both travellers and airline companies.

More information

Structure-Activity Modeling - QSAR. Uwe Koch

Structure-Activity Modeling - QSAR. Uwe Koch Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity

More information

C.-A. Azencott, M. A. Kayala, and P. Baldi. Institute for Genomics and Bioinformatics Donald Bren School of Information and Computer Sciences

C.-A. Azencott, M. A. Kayala, and P. Baldi. Institute for Genomics and Bioinformatics Donald Bren School of Information and Computer Sciences Learning Scoring Functions for Chemical Expert Systems C.-A. Azencott, M. A. Kayala, and P. Baldi Institute for Genomics and Bioinformatics Donald Bren School of Information and Computer Sciences 237th

More information

Loudon Chapter 19 Review: Aldehydes and Ketones CHEM 3331, Jacquie Richardson, Fall Page 1

Loudon Chapter 19 Review: Aldehydes and Ketones CHEM 3331, Jacquie Richardson, Fall Page 1 Loudon Chapter 19 eview: Aldehydes and Ketones CEM 3331, Jacquie ichardson, Fall 2010 - Page 1 Beginning with this chapter, we re looking at a very important functional group: the carbonyl. We ve seen

More information

Administrative notes. Computational Thinking ct.cs.ubc.ca

Administrative notes. Computational Thinking ct.cs.ubc.ca Administrative notes Labs this week: project time. Remember, you need to pass the project in order to pass the course! (See course syllabus.) Clicker grades should be on-line now Administrative notes March

More information

Quantum Classification of Malware. John Seymour

Quantum Classification of Malware. John Seymour Quantum Classification of Malware John Seymour (seymour1@umbc.edu) 2015-08-09 whoami Ph.D. student at the University of Maryland, Baltimore County (UMBC) Actively studying/researching infosec for about

More information

When we deprotonate we generate enolates or enols. Mechanism for deprotonation: Resonance form of the anion:

When we deprotonate we generate enolates or enols. Mechanism for deprotonation: Resonance form of the anion: Lecture 5 Carbonyl Chemistry III September 26, 2013 Ketone substrates form tertiary alcohol products, and aldehyde substrates form secondary alcohol products. The second step (treatment with aqueous acid)

More information

Identifying Functional Groups. Why is this necessary? Alkanes. Why is this so important? What is a functional group? 2/1/16

Identifying Functional Groups. Why is this necessary? Alkanes. Why is this so important? What is a functional group? 2/1/16 Identifying Functional Groups The Key to Survival Why is this so important? ver and over again, you will be asked to do reactions, the details to which you will receive in lecture and via your textbook.

More information

ChemSpider Reactions: Delivering a free community resource of chemical syntheses

ChemSpider Reactions: Delivering a free community resource of chemical syntheses ChemSpider Reactions: Delivering a free community resource of chemical syntheses Valery Tkachenko, Colin Batchelor, Daniel Lowe, Ken Karapetyan, David Sharpe and Antony Williams ACS New Orleans April 2013

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Ignasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015

Ignasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015 Ignasi Belda, PhD CEO HPC Advisory Council Spain Conference 2015 Business lines Molecular Modeling Services We carry out computational chemistry projects using our selfdeveloped and third party technologies

More information

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology Serge P. Parel, PhD ChemAxon User Group Meeting, Budapest 21 st May, 2014 Outline Exquiron Who

More information

CSD. Unlock value from crystal structure information in the CSD

CSD. Unlock value from crystal structure information in the CSD CSD CSD-System Unlock value from crystal structure information in the CSD The Cambridge Structural Database (CSD) is the world s most comprehensive and up-todate knowledge base of crystal structure data,

More information

CLRG Biocreative V

CLRG Biocreative V CLRG ChemTMiner @ Biocreative V Sobha Lalitha Devi., Sindhuja Gopalan., Vijay Sundar Ram R., Malarkodi C.S., Lakshmi S., Pattabhi RK Rao Computational Linguistics Research Group, AU-KBC Research Centre

More information

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom

More information

When planning an organic synthesis there are usually different questions that one must ask.

When planning an organic synthesis there are usually different questions that one must ask. CE 132 Work Shop Exercise Strategies for rganic Synthesis ne of the things that makes chemistry unique among the sciences is synthesis. Chemists make things. New pharmaceuticals, food additives, materials,

More information

DAMIETTA UNIVERSITY. Energy Diagram of One-Step Exothermic Reaction

DAMIETTA UNIVERSITY. Energy Diagram of One-Step Exothermic Reaction DAMIETTA UNIVERSITY CHEM-103: BASIC ORGANIC CHEMISTRY LECTURE 5 Dr Ali El-Agamey 1 Energy Diagram of One-Step Exothermic Reaction The vertical axis in this graph represents the potential energy. The transition

More information

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors Rajarshi Guha, Debojyoti Dutta, Ting Chen and David J. Wild School of Informatics Indiana University and Dept.

More information

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS Cheminformatics Role in Pharmaceutical Industry Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS Agenda The big picture for pharmaceutical industry Current technological/scientific issues Types

More information

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös KIME-based scoring functions in Muse 3.0 KIME User Group Meeting 2013 Fabian Bös Certara Mission: End-to-End Model-Based Drug Development Certara was formed by acquiring and integrating Tripos, Pharsight,

More information

Strategies for Organic Synthesis

Strategies for Organic Synthesis Strategies for rganic Synthesis ne of the things that makes chemistry unique among the sciences is synthesis. Chemists make things. New pharmaceuticals, food additives, materials, agricultural chemicals,

More information

Chemical Reactions and Enzymes. (Pages 49-59)

Chemical Reactions and Enzymes. (Pages 49-59) Chemical Reactions and Enzymes (Pages 49-59) Chemical Reactions Chemistry of Life Not just what life is made of. What life does! Chemical Reactions Chemistry of Life Not just what life is made of. What

More information

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined OECD QSAR Toolbox v.3.3 Step-by-step example of how to build a userdefined QSAR Background Objectives The exercise Workflow of the exercise Outlook 2 Background This is a step-by-step presentation designed

More information

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Frank Hoonakker 1,3, Nicolas Lachiche 2, Alexandre Varnek 3, and Alain Wagner 3,4 1 Chemoinformatics laboratory,

More information

Joana Pereira Lamzin Group EMBL Hamburg, Germany. Small molecules How to identify and build them (with ARP/wARP)

Joana Pereira Lamzin Group EMBL Hamburg, Germany. Small molecules How to identify and build them (with ARP/wARP) Joana Pereira Lamzin Group EMBL Hamburg, Germany Small molecules How to identify and build them (with ARP/wARP) The task at hand To find ligand density and build it! Fitting a ligand We have: electron

More information

An Integrated Approach to in-silico

An Integrated Approach to in-silico An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577 Goals

More information

Patent Searching using Bayesian Statistics

Patent Searching using Bayesian Statistics Patent Searching using Bayesian Statistics Willem van Hoorn, Exscientia Ltd Biovia European Forum, London, June 2017 Contents Who are we? Searching molecules in patents What can Pipeline Pilot do for you?

More information

Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods

Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods J. Chem. Inf. Model. 2010, 50, 979 991 979 Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods Yevgeniy Podolyan, Michael A. Walters, and George Karypis*, Department

More information

CDK & Mass Spectrometry

CDK & Mass Spectrometry CDK & Mass Spectrometry October 3, 2011 1/18 Stephan Beisken October 3, 2011 EBI is an outstation of the European Molecular Biology Laboratory. Chemistry Development Kit (CDK) An Open Source Java TM Library

More information

DATA ANALYTICS IN NANOMATERIALS DISCOVERY

DATA ANALYTICS IN NANOMATERIALS DISCOVERY DATA ANALYTICS IN NANOMATERIALS DISCOVERY Michael Fernandez OCE-Postdoctoral Fellow September 2016 www.data61.csiro.au Materials Discovery Process Materials Genome Project Integrating computational methods

More information

Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit

Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit Alfonso Pozzan Computational and Analytical Chemistry Drug Design and Discovery Department

More information

Early Stages of Drug Discovery in the Pharmaceutical Industry

Early Stages of Drug Discovery in the Pharmaceutical Industry Early Stages of Drug Discovery in the Pharmaceutical Industry Daniel Seeliger / Jan Kriegl, Discovery Research, Boehringer Ingelheim September 29, 2016 Historical Drug Discovery From Accidential Discovery

More information

Bridging the Dimensions:

Bridging the Dimensions: Bridging the Dimensions: Seamless Integration of 3D Structure-based Design and 2D Structure-activity Relationships to Guide Medicinal Chemistry ACS Spring National Meeting. COMP, March 13 th 2016 Marcus

More information

FREQUENTLY ASKED QUESTIONS ABOUT SINGLE AND DOUBLE UNKNOWN ANALYSES

FREQUENTLY ASKED QUESTIONS ABOUT SINGLE AND DOUBLE UNKNOWN ANALYSES FREQUENTLY ASKED QUESTIONS ABOUT SINGLE AND DOUBLE UNKNOWN ANALYSES TABLE OF CONTENTS 1. ON PREPARATION FOR THE EXPERIMENTS 2 2. ON PHYSICAL PROPERTIES..3 3. ON SOLUBILITY TESTS USING ACID-BASE CHEMISTRY...4

More information

Chemical Space: Modeling Exploration & Understanding

Chemical Space: Modeling Exploration & Understanding verview Chemical Space: Modeling Exploration & Understanding Rajarshi Guha School of Informatics Indiana University 16 th August, 2006 utline verview 1 verview 2 3 CDK R utline verview 1 verview 2 3 CDK

More information

FIRST EXAMINATION. Name: CHM 332

FIRST EXAMINATION. Name: CHM 332 ame: CM 332 FIRST EXAMIATI All answers should be written on the exam in the spaces provided. Clearly indicate your answers in the spaces provided; if I have to guess as to what or where your answer is,

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Chemists are from Mars, Biologists from Venus. Originally published 7th November 2006

Chemists are from Mars, Biologists from Venus. Originally published 7th November 2006 Chemists are from Mars, Biologists from Venus Originally published 7th November 2006 Chemists are from Mars, Biologists from Venus Andrew Lemon and Ted Hawkins, The Edge Software Consultancy Ltd Abstract

More information

DANIEL WILSON AND BEN CONKLIN. Integrating AI with Foundation Intelligence for Actionable Intelligence

DANIEL WILSON AND BEN CONKLIN. Integrating AI with Foundation Intelligence for Actionable Intelligence DANIEL WILSON AND BEN CONKLIN Integrating AI with Foundation Intelligence for Actionable Intelligence INTEGRATING AI WITH FOUNDATION INTELLIGENCE FOR ACTIONABLE INTELLIGENCE in an arms race for artificial

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes dr. Petra Kralj Novak Petra.Kralj.Novak@ijs.si 7.11.2017 1 Course Prof. Bojan Cestnik Data preparation Prof. Nada Lavrač: Data mining overview Advanced

More information

EASTERN ARIZONA COLLEGE General Chemistry II

EASTERN ARIZONA COLLEGE General Chemistry II EASTERN ARIZONA COLLEGE General Chemistry II Course Design 2013-2014 Course Information Division Science Course Number CHM 152 (SUN# CHM 1152) Title General Chemistry II Credits 4 Developed by Phil McBride,

More information

Suggested solutions for Chapter 14

Suggested solutions for Chapter 14 s for Chapter 14 14 PRBLEM 1 Are these molecules chiral? Draw diagrams to justify your answer. 2 C 2 C Reinforcement of the very important criterion for chirality. Make sure you understand the answer.

More information

FARMINGDALE STATE COLLEGE DEPARTMENT OF CHEMISTRY. CONTACT HOURS: Lecture: 3 Laboratory: 4

FARMINGDALE STATE COLLEGE DEPARTMENT OF CHEMISTRY. CONTACT HOURS: Lecture: 3 Laboratory: 4 FARMINGDALE STATE COLLEGE DEPARTMENT OF CHEMISTRY COURSE OUTLINE: COURSE TITLE: Prepared by: Dr. M. DeCastro September 2011 Organic Chemistry II COURSE NUMBER: CHM 271 CREDITS: 5 CONTACT HOURS: Lecture:

More information

OECD QSAR Toolbox v.3.0

OECD QSAR Toolbox v.3.0 OECD QSAR Toolbox v.3.0 Step-by-step example of how to categorize an inventory by mechanistic behaviour of the chemicals which it consists Background Objectives Specific Aims Trend analysis The exercise

More information

Functional Group Fingerprints CNS Chemistry Wilmington, USA

Functional Group Fingerprints CNS Chemistry Wilmington, USA Functional Group Fingerprints CS Chemistry Wilmington, USA James R. Arnold Charles L. Lerman William F. Michne James R. Damewood American Chemical Society ational Meeting August, 2004 Philadelphia, PA

More information

The Molecule Cloud - compact visualization of large collections of molecules

The Molecule Cloud - compact visualization of large collections of molecules Ertl and Rohde Journal of Cheminformatics 2012, 4:12 METHODOLOGY Open Access The Molecule Cloud - compact visualization of large collections of molecules Peter Ertl * and Bernhard Rohde Abstract Background:

More information

Wiley ChemPlanner predicts experimentally verified synthesis routes in medicinal chemistry

Wiley ChemPlanner predicts experimentally verified synthesis routes in medicinal chemistry Wiley ChemPlanner predicts experimentally verified synthesis routes in medicinal chemistry Simone-Alexandra Stark, University of Regensburg Reinhard Neudert, Wiley Richard Threlfall, Wiley Wiley ChemPlanner

More information

10. Amines (text )

10. Amines (text ) 2009, Department of Chemistry, The University of Western Ontario 10.1 10. Amines (text 10.1 10.6) A. Structure and omenclature Amines are derivatives of ammonia (H 3 ), where one or more H atoms has been

More information

DivCalc: A Utility for Diversity Analysis and Compound Sampling

DivCalc: A Utility for Diversity Analysis and Compound Sampling Molecules 2002, 7, 657-661 molecules ISSN 1420-3049 http://www.mdpi.org DivCalc: A Utility for Diversity Analysis and Compound Sampling Rajeev Gangal* SciNova Informatics, 161 Madhumanjiri Apartments,

More information

Similarity Search. Uwe Koch

Similarity Search. Uwe Koch Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money How to evaluate credit scorecards - and why using the Gini coefficient has cost you money David J. Hand Imperial College London Quantitative Financial Risk Management Centre August 2009 QFRMC - Imperial

More information

Solved and Unsolved Problems in Chemoinformatics

Solved and Unsolved Problems in Chemoinformatics Solved and Unsolved Problems in Chemoinformatics Johann Gasteiger Computer-Chemie-Centrum University of Erlangen-Nürnberg D-91052 Erlangen, Germany Johann.Gasteiger@fau.de Overview objectives of lecture

More information

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility Dr. Wendy A. Warr http://www.warr.com Warr, W. A. A Short Review of Chemical Reaction Database Systems,

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Identification of functional groups in the unknown Will take in lab today

Identification of functional groups in the unknown Will take in lab today Qualitative Analysis of Unknown Compounds 1. Infrared Spectroscopy Identification of functional groups in the unknown Will take in lab today 2. Elemental Analysis Determination of the Empirical Formula

More information

Loudon Chapter 23 Review: Amines Jacquie Richardson, CU Boulder Last updated 4/22/2018

Loudon Chapter 23 Review: Amines Jacquie Richardson, CU Boulder Last updated 4/22/2018 This chapter is about the chemistry of nitrogen. We ve seen it before in several places, but now we can look at several reactions that are specific to nitrogen. Amines can be subdivided based on how many

More information