Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Similar documents
Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Chemical Structure Reconstruction with chemocr

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal

Imago: open-source toolkit for 2D chemical structure image recognition

Dictionary of ligands

Introduction to Chemoinformatics

Pipeline Pilot Integration

Marvin 5.4 A new generation of structure indexing at Elsevier. Dr. Michael Maier, Dr. Heike Nau, Elsevier

Tautomerism in chemical information management systems

Finding the Needle - Reaxys Structure Searching

Reaxys The Highlights

ChemAxon. Content. By György Pirok. D Standardization D Virtual Reactions. D Fragmentation. ChemAxon European UGM Visegrad 2008

Bioinformatics Workshop - NM-AIST

Marvin. Sketching, viewing and predicting properties with Marvin - features, tips and tricks. Gyorgy Pirok. Solutions for Cheminformatics

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

InChI keys as standard global identifiers in chemistry web services. Russ Hillard ACS, Salt Lake City March 2009

Basic Techniques in Structure and Substructure

Reaxys Pipeline Pilot Components Installation and User Guide

Analyzing Small Molecule Data in R

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Ákos Tarcsay CHEMAXON SOLUTIONS

Mining for Chemistry in Text and Images. A Real-World Example revealing the Challenge, Scope, Limitation and Usability of the current Technology

DECEMBER 2014 REAXYS R201 ADVANCED STRUCTURE SEARCHING

Canonical Line Notations

Searching Substances in Reaxys

CDK & Mass Spectrometry

Introduction to Chemoinformatics and Drug Discovery

How to Create a Substance Answer Set

Chemical Ontologies. Chemical Ontologies. ChemAxon UGM May 23, 2012

Challenges in Next Generation Scientific and Patent Information Mining

Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge?

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Organometallics & InChI. August 2017

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

Experiment 1 Scientific Writing Tools

BioSolveIT. A Combinatorial Approach for Handling of Protonation and Tautomer Ambiguities in Docking Experiments

Analytical data, the web, and standards for unified laboratory informatics databases

ISIS/Draw "Quick Start"

CLRG Biocreative V

Data Mining in the Chemical Industry. Overview of presentation

Project Prospect and the InChI. Colin Batchelor

More information can be found in Chapter 12 in your textbook for CHEM 3750/ 3770 and on pages in your laboratory manual.

Chemical Data Retrieval and Management

Structure and Reaction querying in Reaxys

IUCLID Substance Data

Developing CAS Products for Substructure Searching by Chemists. Linda Toler

COURSE OBJECTIVES / OUTCOMES / COMPETENCIES.

Command-line tools of ChemAxon: tips and tricks

POC via CHEMnetBASE for Identifying Unknowns

On InChI and evaluating the quality of cross-reference links

This experiment is a continuation of the earlier experiment on molecular

Structure-based approaches to the indexing and retrieval of patent chemistry. Tim Miller Head of Research May 2010

Computational Methods and Drug-Likeness. Benjamin Georgi und Philip Groth Pharmakokinetik WS 2003/2004

Overview. Database Overview Chart Databases. And now, a Few Words About Searching. How Database Content is Delivered

Chem 1075 Chapter 19 Organic Chemistry Lecture Outline

Garib N Murshudov MRC-LMB, Cambridge

To learn how to use molecular modeling software, a commonly used tool in the chemical and pharmaceutical industry.

BioSolveIT. A Combinatorial Docking Approach for Dealing with Protonation and Tautomer Ambiguities

InChI/InChIKey vs. NCI/CADD Structure Identifiers: A comparison

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

Searching Inorganic Chemistry

AUTOMATIC GENERATION OF TAUTOMERS

KATE2017 on NET beta version Operating manual

Chapter 9. Ionic Compounds

Organic Chemistry 112 A B C - Syllabus Addendum for Prospective Teachers

CHE 200 INFORMATION RESOURCES LIBRARY PRESENTATION

(2) Read each statement carefully and pick the one that is incorrect in its information.

An Integrated Approach to in-silico

Information Retrieval: SciFinder

Dock Ligands from a 2D Molecule Sketch

Data Mining in Chemometrics Sub-structures Learning Via Peak Combinations Searching in Mass Spectra

Organic Nomenclature

Molecular Graphics. Molecular Graphics Expt. 1 1

Reaxys Managing Complexity

Table of Contents. Scope of the Database 3 Searching by Structure 3. Searching by Substructure 4. Searching by Text 11

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

The IUPAC Chemical Identifier

Comprehensive Chemoinformatics since Web-based, client/server, and toolkit approaches. Native Oracle (cartridge) and Microsoft technology.

Building innovative drug discovery alliances. Just in KNIME: Successful Process Driven Drug Discovery

CHEM 4170 Problem Set #1

The Case for Use Cases

Chapter 5 Stereochemistry

Open PHACTS Explorer: Compound by Name

Capturing Chemistry. What you see is what you get In the world of mechanism and chemical transformations

The Electronic Representation of Chemical Structures: beyond the low hanging fruit

Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit

Lab: Model Building with Covalent Compounds - Introduction

QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

Automated Identification and Conversion of Chemical Names to Structure

OECD QSAR Toolbox v.3.4. Example for predicting Repeated dose toxicity of 2,3-dimethylaniline

HOMOLGOUS SERIES (a family of organic compounds) Can you recall these functional groups from GCSE? C n H 2n+2 C n H 2n C n H 2n+1 OH RCOOH RCOOR

Chemically Intelligent Experiment Data Management

Molecular Modelling. Computational Chemistry Demystified. RSC Publishing. Interprobe Chemical Services, Lenzie, Kirkintilloch, Glasgow, UK

Structure Input and Search Documentation

EXPERIMENT 1: Survival Organic Chemistry: Molecular Models

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

Chemical Databases: Encoding, Storage and Search of Chemical Structures

Frequent Pattern Mining: Exercises

OECD QSAR Toolbox v.3.2. Step-by-step example of how to build and evaluate a category based on mechanism of action with protein and DNA binding

Transcription:

Information Extraction from Chemical Images Discovery Knowledge & Informatics April 24 th, 2006 Dr.

Available Chemical Information Textbooks Reports Patents Databases Scientific journals and publications Websites page 2

Representations of Chemical Compounds Name (trivial, trade, brand, INN, USAN) Registration numbers (CAS, NCI, Beilstein) Formal description (sum formula, SMILES) Chemical nomenclature (IUPAC, CAS, InChI) Depictions page 3

Example: Aspirin Name: Acetylsalicylic acid, Aspirin, Bayer, Colfarit, Dolean PH 8, Duramax, Ecotrin, CAS: 50-78-2, SID: 35870, Formula: C9H8O4 IUPAC Name: 2-acetoxybenzoic acid SMILES: CC(=O)OC1=CC=CC=C1C(=O)O InChI: 1.12Beta/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h1H3,2-5H,(H,11,12) Depiction: page 4

Information Extraction Methods Names Dictionary based Registration numbers Databases Formal descriptions Rule based Depictions chemical OCR page 5

Representing a Chemical Compound How much information do you want to include? Atoms present OH Connections between atoms bond types Isotopes Charges Stereochemical configuration 14 CH 2 O H N + 3 CH O - page 6

Modeling of Chemicals as Graphs Why use graph theory? Established mathematical field Graphs can be easily represented in computers Existing algorithms for comparison, searching, etc. Unlike humans, computers aren t very good at pattern recognition Similar or Same? page 7

Computer Representation A typical example: MDL MOL file (SDF) For more information on MDL formats, see http://www.mdl.com/downloads/public/ctfile/ctfile.jsp page 8

Disadvantages of Using Graphs Many graph algorithms are inherently slow Analogy between chemical structures and graphs is not perfect Realities of chemical structures cause problems aromaticity stereochemistry tautomerism inorganic compounds macromolecules and polymers incompletely-defined substances page 9

Good News There is only a limited number of chemical drawing tools (and these are using templates): ChemDraw (CambridgeSoft) ChemSketch (ACD) ISISdraw (MDL) JAVA applets (ChemAxon)... Reduced complexity page 10

chemocr: Reconstruction of Chemical Compounds 1 Document 2 Depiction 3 Reconstruction 4 SDF file -IS IS - 09230315072D 27 29 0 0 0 0 0 0 0 0999 V 2000-0.9348-0.4000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0-0.9359-1.2274 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0-0.2211-1.6402 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.4953-1.2269 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.4925-0.3964 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0-0.2229 0.0128 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.0750-1.8084 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.0708-2.6334 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.7875-1.3917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.5042-1.8034 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.2162-1.3874 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.2120-0.5611 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.4899-0.1526 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.7808-0.5709 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.0042-0.3417 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 4.0083-1.5959 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0 4.4125-1.0542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 5.2375-1.0542 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 0.4792-3.2167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.2167-2.3917 0.0000 S 0 0 3 0 0 0 0 0 0 0 0 0 5.0125-2.1750 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 3.4167-2.6042 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 4.4292-3.1875 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 5.2250-3.4000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.0125-3.9000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0-0.3458-3.2125 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.8875-3.9292 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 page 11

CSR (Compound Structure Reconstruction) raster images page segmentation image preproces sing connected components component classifier vectorizer approx. graph matcher molecular graph converter superatoms OCR common fragments module chemical rules module s-atom database machine learning tool machine learning tool manual curation tool chemical cartridge molecule database page 12

Preprocessing Steps Page segmentation Image extraction Image conversion (image restauration, adaptive binarization...) page 13

Connected Component Analysis Building an image tree Using adaptive nested TreeMaps page 14

Component Classification 1 2 Raster image Extract features 3 Classify as... Single bonds Double bonds Thick chirals Dotted chirals Text 4 Manual curation page 15

Atomtype Reconstruction 1 Train new characters 3 Expand superatoms Need of a chemical intelligent OCR 2 Define new superatoms page 16

Vectorization Fixing vectorization errors using relative neighborhood graphs Disconnections Antiparallel double bonds Fixing bond lengths Dubious links Need of a chemical intelligent vectorizer page 17

Graph Matching Using a line graph representation Searching for subgraph isomorphism Database with common fragments Decomposition network for fragments Recognizing new fragments Graph matching a solution for mapping bridged ring systems page 18

Manual Curation of Errors Editing bonds Reconstruction score page 19

Post Processing Workflow plugin technology 2D beautify File format conversion 2D to 3D conversion Name generation Property calculation / prediction page 20

A Real Challenge Data set with ~7.600 depictions of natural products to get new scaffolds and super atoms to incorporate the CSR workflow into a grid service to add a database interface current status: ~3.400 fully reconstructed! But we need more real training sets (i.e. pictures and the solved structure) page 21

Future Works Incompletely-defined substances: unknown stereochemistry unknown attachment position unknown repetition NH 2 Cl n OH page 22

Markush ( Generic ) Structures and Reaction Schemes shorthand for describing sets of structures with common features structures with R-groups very important in chemical patents can be used to describe combinatorial libraries can be used as queries in database searches OH R2 R1 R1= * R2= Cl * CH 2 Br * I * CH * 3 CH 2CH3 * CH CH 2CH2 CH 3 2 CH2 page 23

The Mission: Combination of CSR and Text Mining Image Analysis / Structure Reconstruction -CH3 PPAR activation -CH3 -CH2-CH3 -CH2-CNHS -COOH Text Analysis / Entity Recognition Cytochrome inhibition PPAR activation Stability in serum Side effect Blood-brain-barrier -CH2-CH3 Cytochrome inhibition -CH2-CNHS Stability in serum -COOH Side effect Reconstruction of Published Chem-, Pharmand PatentSpace page 24

The Team (in the order of appearance) Tanja Fey Le Thuy Bui Thi Christoph Friedrich Yuan Wang Maria-Elena Algorri Miguel Alvarez Wei Wang page 25

CSR Software Demo available CSR can extract chemical depictions from various image sources and convert them into SD-files, which can be further used in nearly all chemical software; it allows for the modification of reconstructed molecules by a structure editor; it maintains the superatom and bond (single, double, triple, or chiral) information; and it accepts user curation and scoring schema to improve its performance. page 26