Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge?

Similar documents
Challenges in Next Generation Scientific and Patent Information Mining

Reaction classification, an enduring success story

Searching Substances in Reaxys

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Extracting Knowledge from Reaction Databases: Developments from InfoChem

NOVEMBER 2014 REAXYS R101 BASIC REACTION QUERIES

Reaxys Pipeline Pilot Components Installation and User Guide

Finding Polymer Information Part 2: Advanced. Dr. Thomas Haubenreich

Mining for Chemistry in Text and Images. A Real-World Example revealing the Challenge, Scope, Limitation and Usability of the current Technology

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

Structure and Reaction querying in Reaxys

E-EROS TUTORIAL Encyclopedia of Reagents for Organic Synthesis at the UIUC Chemistry Library

Basic Techniques in Structure and Substructure

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Demonstration: Searching patents based on chemical structure using SciFinder

Science of Synthesis Guided Examples

OF ALL THE CHEMISTRY RELATED SOFTWARE

Style guide for chemical structures

Structure-based approaches to the indexing and retrieval of patent chemistry. Tim Miller Head of Research May 2010

ChemAxon Partner Session: Arxspan Overview

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

new interface and features

SciFinder Scholar Guide to Getting Started

Problems and Challenges

Finding the Needle - Reaxys Structure Searching

SciFinder Scholar Guide to Getting Started

Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit

OECD QSAR Toolbox v.3.3. Predicting acute aquatic toxicity to fish of Dodecanenitrile (CAS ) taking into account tautomerism

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

Computational Study of Chemical Kinetics (GIDES)

Chemical Journal Publishing in an Online World. Jason Wilde, Publisher Physical Sciences Nature Publishing Group ACS Spring Meeting 2009

Conjugate Pairs Practice #1

Marvin. Sketching, viewing and predicting properties with Marvin - features, tips and tricks. Gyorgy Pirok. Solutions for Cheminformatics

Project Prospect and the InChI. Colin Batchelor

ISIS/Draw "Quick Start"

Handling Human Interpreted Analytical Data. Workflows for Pharmaceutical R&D. Presented by Peter Russell

DiscoveryGate. One place for the answers you need. Chulalongkorn University, Thailand

Searching Inorganic Chemistry

How to Create a Substance Answer Set

Chemically Intelligent Experiment Data Management

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options for grouping with metabolism

NMR Predictor. Introduction

Chemical Reactions and Equations

CHEMDRAW ULTRA ITEC107 - Introduction to Computing for Pharmacy. ITEC107 - Introduction to Computing for Pharmacy 1

Where do all those small molecules come from? What the data reveal

DECEMBER 2014 REAXYS R201 ADVANCED STRUCTURE SEARCHING

Searching CrossFire Databases-

OECD QSAR Toolbox v.3.4. Example for predicting Repeated dose toxicity of 2,3-dimethylaniline

Table of Contents. Scope of the Database 3 Searching by Structure 3. Searching by Substructure 4. Searching by Text 11

SciFinder introduction and a view behind the scene how this tool is made

Reaxys Medicinal Chemistry Fact Sheet

2311A and B Practice Problems to help Prepare for Final from Previous Marder Exams.

POC via CHEMnetBASE for Identifying Unknowns

Substance identification and how to report it in IUCLID 6

POC via CHEMnetBASE for Identifying Unknowns

Chapter 5. Chemistry for Changing Times, Chemical Accounting. Lecture Outlines. John Singer, Jackson Community College. Thirteenth Edition

OECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model

Describing Chemical Reactions

AQA A2 CHEMISTRY TOPIC 4.10 ORGANIC SYNTHESIS AND ANALYSIS TOPIC 4.11 STRUCTURE DETERMINATION BOOKLET OF PAST EXAMINATION QUESTIONS

Partial Periodic Table

CHEM 116 EXAM #2 PRACTICE

A (Mostly) Correctly Formatted Sample Lab Report. Brett A. McGuire Lab Partner: Microsoft Windows Section AB2

The shortest path to chemistry data and literature

Chemical Equilibrium

Name: Class Period: Due Date: Unit 2 What are we made of? Unit 2.2 Test Review

Which of the following is NOT an example of a chemical reaction? A. Milk turning sour B. Food being digested C. A match burning D.

The Case for Use Cases

(a) (i) Use these data to show that benzene is 152 kj mol 1 more stable than the hypothetical compound cyclohexa 1,3,5 triene

Chapter 7 Chemical Reactions

CHEMICAL COMPOSITION AND REACTIONS Chapter 8

Map image from the Atlas of Oregon (2nd. Ed.), Copyright 2001 University of Oregon Press

The LSD Software. A tool for the structure determination of small molecules. Jean-Marc Nuzillard. Cargèse, 2013, March 23 th

DRUG DISCOVERY TODAY ELN ELN. Chemistry. Biology. Known ligands. DBs. Generate chemistry ideas. Check chemical feasibility In-house.

OECD QSAR Toolbox v.4.1. Step-by-step example for predicting skin sensitization accounting for abiotic activation of chemicals

Prediction of Organic Reaction Outcomes. Using Machine Learning

Reaxys The Highlights

4.02 Chemical Reactions

Chapter 2 The text above the third display should say Three other examples.

Structure Drawing. March Use the New Features. Keyboard Shortcuts and Paste from ChemDraw

Analytical data, the web, and standards for unified laboratory informatics databases

SpringerMaterials The Landolt-Börnstein Database

Chemical Reactions Chapter 2 L book pages L44 - L73. examples?

RInChI. International Chemical Identifier for Chemical Reactions (RInChI) Guenter Grethe, Jonathan Goodman, Chad Allen

Student Manual for Aerobic Alcohol Oxidation Using a Copper(I)/TEMPO Catalyst System

Name: Class Period: Due Date: Unit 2 It s Elemental Test Review

Marvin 5.4 A new generation of structure indexing at Elsevier. Dr. Michael Maier, Dr. Heike Nau, Elsevier

CST Review Part 2. Liquid. Gas. 2. How many protons and electrons do the following atoms have?

OECD QSAR Toolbox v.3.3. Step-by-step example of how to categorize an inventory by mechanistic behaviour of the chemicals which it consists

PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE

Catching the Drift Indexing Implicit Knowledge in Chemical Digital Libraries

R. Ashby Duplication by permission only.

Pearson Edexcel Level 3 GCE Chemistry Advanced Paper 2: Advanced Organic and Physical Chemistry

CHEM4. (JAN13CHEM401) WMP/Jan13/CHEM4. General Certificate of Education Advanced Level Examination January 2013

Capturing Chemistry. What you see is what you get In the world of mechanism and chemical transformations

Chemistry 2321 Last name, First name (please print) April 22, 2008 McMurry, Chapters 10 and 11 Row # Seat #

Chemical Databases: Encoding, Storage and Search of Chemical Structures

Stoichiometry. Before You Read. Chapter 10. Chapter 11. Review Vocabulary. Define the following terms. mole. molar mass.

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options of the structure similarity

Reaxys Managing Complexity

Transcription:

1 / 23 Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge? Josef Eiblmaier, Hans Kraut, Sascha Hausberg, Peter Loew

Outline 2 / 23» ChemDraw files: Relevance and the Challenge» Approach» Projects» InfoChem ChemProspector» Wiley Smart Article cora / PIXELIO, www.pixelio.de» Thieme Science of Synthesis Update / Pharmaceutical Substances» Conclusion / Outlook

3 / 23 Patents, Journal Articles and MRW s: a Buried Treasure? Chemical structures (images) Chemical names/fragments (text) Markush structures (text, images, CDX) Chemical structures (CDX files) Reactions (CDX files)

4 / 23 Manuscript Article Database Publishing Manuscript submission Manual Indexing Database production e.g. SciFinder, Reaxys, SPRESI eeros,...

5 / 23 CDX Scheme vs. Database Record ChemDraw file Purpose: presentation / publishing no search Unstructured Structures: no strict rules General rules: none Database Purpose: search / retrieval Structured Structures: strict rules Database rules: strict Reactant Product Reagent Solvent Catalyst SOCl 2 LiOH H 2 O, THF Pd(OAc) 2 Cl-Co2Et, Et 3 Acetone, H 2 O Source: Thieme Pharmaceutical Substances, Ticagrelor (in production)

6 / 23 CDX Scheme Processing, what does that mean? Chemical structures (SD files) Reactions (RD files) ICSchemeProcessor Conditions (RD files) Reagent Solvent Catalyst Source: Thieme Pharmaceutical Substances, Ticagrelor (in production) SOCl 2 LiOH H 2 O, THF Pd(OAc) 2 Cl-Co2Et, Et 3 Acetone, H 2 O

7 / 23 But: CDX files, often an optical illusion! Authors are very inventive for a perfect layout! Appearences are deceiving!» Usage of graphical symbols Polymer supports Heteroatoms C Grid:

8 / 23 Optical illusions 2» Unresolvable labels Labels not defined Element symbols used as R-group labels Ambiguous fragment labels (e.g. molecular formula)

9 / 23 Optical illusions 3» Variable points of attachment

10 / 23 Optical illusions 4» Reaction arrows / forked arrows / brackets

11 / 23 Approach Gerd Altmann / PIXELIO, www.pixelio.de

12 / 23 Approach» The algorithmic approach: Application of a set of rules in the software (generic, project unspecific). Software should recognize all cases that might occur! project (title-) specific rules (drawing conventions must not change), otherwise further development necessary manual post correction required (cost/time intensive) problem is infinite, unprecedented issues can not be handled» The templating approach: software is developed to recognize a defined set of problems (PS) all content must be manually pre-templated (cost intensive) according to the capabilities of the software» The hybrid approach: depending on the source the focus can be laid on either approach

13 / 23 Templating» Templating: Guidelines for authors and typesetters Syntax definitions for tables, R-groups etc. Syntax rules for captions Reaction arrangement, forked arrows Rules for reaction conditions (reactants, catalysts, solvents, yields, temperature)

14 / 23 Examples:» Algorithmic detection of features» Resolution of repeating groups» Enumeration of R-groups» Resolution of aliases/labels source specific alias databases continuously extended» Table Enumeration compound enumeration reaction factual data: Caption/Yield» Variable points of attachment» Forked arrows

15 / 23 Projects

16 / 23 Sucessful Application of CDX Processing: Chemistry Enrichment Workflow*, (Wiley Smart Article) *Reinhard eudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. 17. October, Berlin

17 / 23 Templating* Author s CDX File CDX Template Enumerated structures Templating ICSchemeProcessor CDX-Templating Guidelines (Structures) *Reinhard eudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. 17. October, Berlin

18 / 23 Workflow Science of Synthesis Update R 5 O H 2 HO H + HO R 5 H2O OH H R 5 H O H 2 HO H + HO H2O OH H R 5 O H R 5 H H 3 R 5 2 HO H R 5 2 O 39 40 H 2O O R 5 H H 3 R 5 2 HO H R 5 2 O 39 40 H 2O R 5 H R 5 H ICSchemeProcessor CDX- Templating Guidelines (Reactions) Correct / extend process R 5 O H 2 + OH H HO HO Scheme R H Error 4 Report R 5 H H2O O R 5 H 2 39 R 5 HO H 2 40 H 3 R 5 O H 2O Scheme correction not possible R 5 H Manual data entry

19 / 23 Sample Pharmaceutical Substances Update Source: Thieme Pharmaceutical Substances, Abiraterone

20 / 23 Conclusion» As much as possible algorithmic processing desirable generic: can be applied to other contents as well cheaper (humans cost!)» 100% conversion (without human interaction) never possible» Solutions are project / source specific» Relevance of automatic extraction will continuously increase» Authors / Publishers play an essential role in a successful conversion

21 / 23 Acknowledgements» Wiley Michael Forster Reinhard eudert» Thieme Guido Herrmann Rolf Hoppe Klaus Köberlein» InfoChem Hans Kraut, Sascha Hausberg, Thomas Menke, Manuela Rauh Fanny Irlinger, Huyen gyen, Dagmar Kunzmann

22 / 23 Thank you! Thomas Link / Flickr

23 / 23 Questions?