Extracting Knowledge from Reaction Databases: Developments from InfoChem

Similar documents
Reaction classification, an enduring success story

Mining for Chemistry in Text and Images. A Real-World Example revealing the Challenge, Scope, Limitation and Usability of the current Technology

Challenges in Next Generation Scientific and Patent Information Mining

Chemists are from Mars, Biologists from Venus. Originally published 7th November 2006

Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge?

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

Large scale classification of chemical reactions from patent data

CSD. Unlock value from crystal structure information in the CSD

Global Catalyst Market

RInChI. International Chemical Identifier for Chemical Reactions (RInChI) Guenter Grethe, Jonathan Goodman, Chad Allen

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS

The shortest path to chemistry data and literature

ACD/Labs Software Impurity Resolution Management. Presented by Peter Russell

HELLA AGLAIA MOBILE VISION GMBH

Putting the U.S. Geospatial Services Industry On the Map

Chemists Do you have the Bayer Spirit?

Ørsted. Flexible reporting solutions to drive a clean energy agenda

For Excellence in Organic Chemistry

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Reaxys Improved synthetic planning with Reaxys

The Case for Use Cases

The Road to Improving your GIS Data. An ebook by Geo-Comm, Inc.

Progresses in Automated Chemical Structure Recognition in Text and Images

2011 Oracle Spatial Excellence Awards Presentation

Application of GIS Technologies in maintenance and development of a Gas Transmission System

Working with Rumaila's Supply Chain

Neighborhood Locations and Amenities

Bob Stembridge Patent analysis and graphene 2020 programme

Integrated Cheminformatics to Guide Drug Discovery

Ready for INSPIRE.... connecting worlds. European SDI Service Center

MeteoGroup FleetGuard. The world s most comprehensive SaaS fleet management system

Intelligent NMR productivity tools

ESPRIT Feature. Innovation with Integrity. Particle detection and chemical classification EDS

SIMATIC Ident Industrial Identification Systems

Perseverance. Experimentation. Knowledge.

Bridging the Gap between Engineering and GIS

DRUG DISCOVERY TODAY ELN ELN. Chemistry. Biology. Known ligands. DBs. Generate chemistry ideas. Check chemical feasibility In-house.

Searching CrossFire Beilstein Using DiscoveryGate. DiscoveryGate Version 2.2 Participant s Guide

Pharma and Suppliers: Collaborating on Green Chemistry. Launch of PMI tool. ACS Green Chemistry Institute Pharmaceutical Roundtable

SAMPLE. ELITech Group - Product Pipeline Analysis, 2015 Update. Reference Code: GDME51138PD. Publication Date: June Page 1

CONTINUOUS FLOW CHEMISTRY (PROCESSING) FOR INTERMEDIATES AND APIs

Advancing Geoscientific Capability. Geological Survey of Finland

CHAPTER 22 GEOGRAPHIC INFORMATION SYSTEMS

Geofacets. Do more with less OIL & GAS

PUBLIC SAFETY POWER SHUTOFF POLICIES AND PROCEDURES

GOVERNMENT GIS BUILDING BASED ON THE THEORY OF INFORMATION ARCHITECTURE

Exploit your geodata to enable smart cities and countries

Geospatial natural disaster management

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE

Understanding China Census Data with GIS By Shuming Bao and Susan Haynie China Data Center, University of Michigan

Lecture 11 Linear regression

An online data and consulting resource of THE UNIVERSITY OF TOLEDO THE JACK FORD URBAN AFFAIRS CENTER

The Changing Requirements for Informatics Systems During the Growth of a Collaborative Drug Discovery Service Company. Sally Rose BioFocus plc

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

European Severe Weather

Background literature. Data Mining. Data mining: what is it?

INTRODUCTION TO GIS. Dr. Ori Gudes

Kalexsyn Overview Kalexsyn, Inc Campus Drive Kalamazoo, MI Phone: (269) Fax: (269)

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Perform. Xcel. Lead. Presenter. Raghavendran S. GM Technical (GIS)

GCE Mathematics (MEI) Mark Scheme for June Unit 4772: Decision Mathematics 2. Advanced GCE. Oxford Cambridge and RSA Examinations

Cartographic and Geospatial Futures

CHARTING SPATIAL BUSINESS TRANSFORMATION

FORENSIC TOXICOLOGY SCREENING APPLICATION SOLUTION

ArcGIS for Desktop. ArcGIS for Desktop is the primary authoring tool for the ArcGIS platform.

June 19 Huntsville, Alabama 1

COPYRIGHTED MATERIAL. 1 Introduction

Handling Human Interpreted Analytical Data. Workflows for Pharmaceutical R&D. Presented by Peter Russell

Stephen McDonald, Mark D. Wrona, Jeff Goshawk Waters Corporation, Milford, MA, USA INTRODUCTION

In Silico Investigation of Off-Target Effects

Are You Maximizing The Value Of All Your Data?

Computation of Large Sparse Aggregated Areas for Analytic Database Queries

GeoWorlds: Integrated Digital Libraries and Geographic Information Systems

Introduction. James. B. Hendrickson Department of Chemistry Brandeis University Waltham, MA

Training Path FNT IT Infrastruktur Management

Reaxys Medicinal Chemistry Fact Sheet

Introduction. OntoChem

Indicator: Proportion of the rural population who live within 2 km of an all-season road

Springer Materials ABC. The world s largest resource for physical and chemical data in materials science. springer.com. Consult an Expert!

Shetland Islands Council

Stephen McDonald and Mark D. Wrona Waters Corporation, Milford, MA, USA INT RO DU C T ION. Batch-from-Structure Analysis WAT E R S SO LU T IONS

Driving Serendipity: the Era of Active Ingredients Discovery. Valentina Sedini Scientific Marketing

ISO Series Standards in a Model Driven Architecture for Landmanagement. Jürgen Ebbinghaus, AED-SICAD

mylab: Chemical Safety Module Last Updated: January 19, 2018

ACRONYMS AREAS COUNTRIES MARINE TERMS

The University of Arizona Enterprise GIS. CFTA 2016 Facilities Geospatial Technologies Showcase

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

Writing Patent Specifications

RSC Analytical Division Strategy

Acetic Acid Industry Outlook in China to Market Size, Company Share, Price Trends, Capacity Forecasts of All Active and Planned Plants

What s the problem? A Modern Odyssey in Search of Relevance. The search for relevance. Some current drivers for new services. Some Major Applications

Rick Ebert & Joseph Mazzarella For the NED Team. Big Data Task Force NASA, Ames Research Center 2016 September 28-30

GEOMATICS. Shaping our world. A company of

GIS Capability Maturity Assessment: How is Your Organization Doing?

SPURRING NATION S ECONOMY THROUGH SmartKADASTER

SOP Release. FEV Chassis Reliable Partner in Chassis Development. FEV Chassis Applications and Activities. Concept Layout. Design

GIS FOR MAZOWSZE REGION - GENERAL OUTLINE

Spatial Analysis with ArcGIS Pro STUDENT EDITION

JOB$ IN THE DRUG INDUSTRY

THE WORLD S FASTEST MOST PRECISE FORECASTS

Transcription:

1 Extracting Knowledge from Reaction Databases: Developments from InfoChem CICAG meeting 3rd July 2013 Stephanie North, Allyl Consulting Ltd, representing InfoChem in the UK H. Kraut, H. Matuszczyk, H. Saller, J. Eiblmaier, P. Loew InfoChem GmbH, Landsberger Strasse 408, Munich, 81241, Germany

2 Outline Setting the scene: millions of reactions Rapid growth of published reaction databases e.g. SPRESI, 4.2 M reactions Electronic Laboratory Notebooks now provide corporate reaction databases Reaction knowledge a key driver for chemists innovation and decision making processes - but how best to unlock and exploit resource? InfoChem s CLASSIFY technology provides the solution Today s presentation Introduce InfoChem Describe the reaction classification concept Show how CLASSIFY works and analyse the results Benefits of extending CLASSIFY to corporate reaction databases

3 InfoChem at a Glance Company specializes in chemoinformatics founded in 1989 based in Munich, Germany majority owned by Springer Science + Business Media subsidiary InfoChimia since 1990 People 24 full time employees (Munich office) 2 representatives in UK 3 representatives in Sweden 60 freelance abstractors (residing offshore) Business Areas Software products Projects Text/Data mining Database building Revenues Approx. 3.3 million Euros (2012): Software products (30%) Database building (35%) Projects (12%) Text/Data mining (23%)

4 Business Areas Software Services ICFSE, ICCARTRIDGE, ICCHEMDESK ICMAP, CLASSIFY, ICNameRXN ICSYNTH, ICEDIT, ICTOOLS Project development Consulting Database building InfoChem Core Technologies Content SPRESI web, SPRESImobile Chemisches Zentralblatt Structural Database Text/Data Mining Chemical entity recognition Name to structure conversion Image to structure conversion

5 Fundamentals of reaction Classification Concept What does it do? Organises and analyses large reaction databases according to the type or class of transformation How does it work? Groups similar transformation types together sharing same reaction centres and same environment What are the benefits? Gain knowledge about the type, quality and diversity of reactions to aid inventive process and decision making

6 Background: Driving Force for Classification Status 1988: 1.8 M. RXNs Technology limitations of systems at the time REACCS, ORAC, SynLib handle ca 100K reactions InfoChem acquired SPRESI with 1.8 million reactions How could we provide 1.8 million reactions to customers? How to reduce/select reactions so systems could handle? 25k 39k 66k 65k/y ORAC (6.5) SynLib (2.2) REACCS (6.1) ChemInform (>=1991) ZIC/InfoChem

7 Concept of Reaction Classification Develop a chemically intelligent way to choose an example reaction from a given class of reactions Group similar reactions e.g. like chemistry is published http://www.prestatraining.com/wp-content/uploads/2011/06/istock_000014706389xsmall.jpg Select one example transformation to represent the group The result is a smaller selective database consisting of one example for each transformation type

8 CLASSIFY Classification algorithm: how does it work? For each reaction, determine and map reaction centers Calculate atom hash codes for the reaction centre Atoms included depend on extent of reaction centre s chemical environment considered, resulting in different sized search hit lists (clusters): CC NN C C H N N H 0-Sphere (BROAD) Reaction centers only Large sized clusters NH N H NH H H C N N H H C N N C C N H C H C N N NH H H H 1-Sphere (MEDIUM) Reaction centers plus alpha atoms, excluding hydrogens C C N H NH C C N N C N H C C NH C N H H N H H 2-Sphere (NARROW) Reaction centers plus beta atoms, excluding hydrogens and consecutive sp 3 -atoms Small-sized clusters The sum of the hash codes from these calculations provides the unique InfoChem Reaction Classification Code. Three codes are assigned to each reaction

9 Classification results CLASSIFY Each reaction processed ClassCode determined Cluster according to ClassCodes Codes 3257410 82969498 2945604 35478524 2969666 48127528 4257410 82969497 5945604 35478523 6969666 38127527 3257410 72969496 5945604 25478522 4969666 38127526 2257410 72969495 4945604 25478521 3969666 38127524 3257410 72969494 2945604 25478522 2969666 38127523 4257410 72969495 5348521 83070506 7560743 05292708 8671845 16303819... # RXNs 13,000 12,000 11,000 10,000 9,500 9,000 8,500 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 500 90 25 1 Select one sample for each ClassCode Further analysis of clusters Smaller representative database e.g. ChemReact Large database e.g. SPRESI Groups of reaction types according to ClassCodes Extract knowledge from classification

10 ClassCodes: representative database creation Apply NARROW Classification to SPRESI Most frequent ClassCode reduction of a nitro group: 13,794 examples ONE representative sample reaction per ClassCode from SPRESI used to create ChemReact database. Selection criteria: yield, journal, year of publication etc

11 Sample reactions from Classcodes with high popluations ClassCode Reaction examples in SPRESI (Narrow) ChemReact sample reaction 3257410 82969498 2945604 35478524 2969666 48127528 13,794 8,989 8,947

12 Analysis: Frequency distribution of SPRESI ClassCodes (narrow) Population reactions per ClassCode 13,794 examples 8,989 examples 3,818 examples Database diversity Only 126 class codes > 1,000 example reactions First 180 ClassCodes combined contain 348,927 reactions Core chemistry within fairly narrow range of reaction types 2,476 examples ClassCodes

13 Classification of SPRESI to ChemReact since 1988 Release SPRESI ChemReact 1988 1,800 K 370 K 1991 2,300 K 400 K 2002 3,500 K 450 K 2012 4,260 K 524 K Rxn 10-3 600 400 200 0 1988 ChemReact 1991 2002 2012 Steady increase in diversity of organic synthesis specific reactions 5000 SPRESI ChemReact Growth of ChemReact not proportional to SPRESI database (overall ca 1.4x vs. 2.4x) New reaction references in literature do not necessarily generate new ClassCodes (chemistry) Organic synthesis is described by a comparatively small amount of reaction types Rxn 10-3 4000 3000 2000 1000 0 1988 1991 2002 2012

14 Classfication for Corporate Reaction Databases Organisations are making significant investments in ELNs to capture intellectual property ELNs generate large repositories of corporate reactions Accessibility of reactions depends on type of ELN How to benefit and unlock this valuable resource for improved decision making and reaction planning? CLASSIFY solution to automatically categorize and analyse corporate reactions ICCartridge and Fast Search Engine for storage and fast retrieval of corporate chemical reactions and data

15 Generation of ELN Classified Reaction database ELN reactions Extract reactions RD Files CLASSIFY Oracle Classified reaction database Integrate with InfoChem Cartridge and Fast Search Engine RD Files enriched by ClassCodes 15 InfoChem GmbH InfoChem Copyright 2013

16 Possible Frequency Distribution of a Corporate database 16000 Corporate database 14000 SPRESI database benchmark Strengths 12000 Population of reactions per ClassCode 10000 8000 6000 4000 2000 Possible gap or weakness Emerging or obscure chemistry Quality analysis of singletons 0 ClassCodes

17 CLASSIFY unlocks corporate reaction knowledge Analysis of ClassCodes and frequency distribution provides knowledge about: Diversity, strengths and gaps in corporate reactions Optimum yields, reagents and conditions for a reaction type Low frequency reactions may indicate Emerging valuable new reactions Obscure but useful chemistry not often repeated Database Quality Few Singletons (one reaction in a ClassCode) indicate database has a certain quality standard Many singletons may indicate data quality problems 17 InfoChem GmbH InfoChem Copyright 2013

18 Singleton reactions: deeper look New/unique/weird but valid chemistry Erroneous or inconsistent reaction or data entry roles, missing reactants, different solvent/catalyst/reagent/reactant definitions Inconsistent normalization rules Inadequate or incorrect mapping Multistep reactions as overall reactions e.g. overall step in 13 step reaction: Da-Ming Gou and Ching-Shih Chen: Synthesis of L-α phosphatidyl-d-myo-inositol 3,4,5-trisphosphate, an important intracellular signalling molecule, J. Chem. Soc., Chem.Commun., 1994, 2125 Spresi Doc-ID 3396558

19 Summary of CLASSIFY Benefits Diversity of reaction types identifies strengths and gaps in corporate knowledge Does the business need to focus on alternative reaction classes? Location of failed internal reactions (how/why) to drive reaction improvement Quality: singleton reaction types may indicate erroneous data Drive data quality improvements and business rule compliance Improved route design and innovation Maximum value obtained from corporate reaction expertise Analysis of reactants and conditions within a reaction class to determine best practice leading to simplification and economy of processes Chemists alerted to emerging or novel chemistry, thus speeding discovery Enables selection of viable reactions and efficient route maps Knowledge sharing Only way to link and compare reaction databases Reaction knowledge and expertise shared across organisation

20 Acknowledgments: InfoChem Team Hans Kraut Henry Matuszczyk Heinz Saller Josef Eiblmaier Valentina Eigner-Pitto Ulf Frieske Peter Loew Look out for publication underway Thank you