Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Similar documents
Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Chemical Structure Reconstruction with chemocr

Imago: open-source toolkit for 2D chemical structure image recognition

How to Create a Substance Answer Set

Reaxys Medicinal Chemistry Fact Sheet

Pipeline Pilot Integration

Reaxys Pipeline Pilot Components Installation and User Guide

CSD. Unlock value from crystal structure information in the CSD

Ákos Tarcsay CHEMAXON SOLUTIONS

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

Handling Human Interpreted Analytical Data. Workflows for Pharmaceutical R&D. Presented by Peter Russell

The Case for Use Cases

Capturing Chemistry. What you see is what you get In the world of mechanism and chemical transformations

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

Searching Substances in Reaxys

Structure and Reaction querying in Reaxys

Mining for Chemistry in Text and Images. A Real-World Example revealing the Challenge, Scope, Limitation and Usability of the current Technology

Reaxys The Highlights

ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature

ATLAS of Biochemistry

The Schrödinger KNIME extensions

Introduction to Chemoinformatics and Drug Discovery

ChemAxon. Content. By György Pirok. D Standardization D Virtual Reactions. D Fragmentation. ChemAxon European UGM Visegrad 2008

Chemical Ontologies. Chemical Ontologies. ChemAxon UGM May 23, 2012

Finding the Needle - Reaxys Structure Searching

Dictionary of ligands

Collaborative topic models: motivations cont

Command-line tools of ChemAxon: tips and tricks

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

Analytical data, the web, and standards for unified laboratory informatics databases

Troubleshooting Replication and Geodata Services. Liz Parrish & Ben Lin

Chemical Data Retrieval and Management

Integrated Cheminformatics to Guide Drug Discovery

How GIS can be used for improvement of literacy and CE programmes

Large scale classification of chemical reactions from patent data

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Introduction to IsoMAP Isoscapes Modeling, Analysis, and Prediction

Reaxys Managing Complexity

CLRG Biocreative V

DRUG DISCOVERY TODAY ELN ELN. Chemistry. Biology. Known ligands. DBs. Generate chemistry ideas. Check chemical feasibility In-house.

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

WHAT'S NEW IN HSC CHEMISTRY 6.0

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit

Migrating Defense Workflows from ArcMap to ArcGIS Pro. Renee Bernstein and Jared Sellers

IUCLID Substance Data

CSD Conformer Generator User Guide

Introduction to Spark

Introducing a Bioinformatics Similarity Search Solution

Bio-Medical Text Mining with Machine Learning

Incorporating ArcGIS Pro in your Curriculum

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Digital Tax Maps Westport Island Project Summary

Ephemeral Cities Erich Kesse March

Cross Media Entity Extraction and Linkage for Chemical Documents

Artificial Intelligence (AI) Common AI Methods. Training. Signals to Perceptrons. Artificial Neural Networks (ANN) Artificial Intelligence

Steve Pietersen Office Telephone No

Administering your Enterprise Geodatabase using Python. Jill Penney

Joana Pereira Lamzin Group EMBL Hamburg, Germany. Small molecules How to identify and build them (with ARP/wARP)

Last updated: Copyright

PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE

Rainfall data analysis and storm prediction system

Application Note 12: Fully Automated Compound Screening and Verification Using Spinsolve and MestReNova

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options for grouping with metabolism

DECEMBER 2014 REAXYS R201 ADVANCED STRUCTURE SEARCHING

In silico generation of novel, drug-like chemical matter using the LSTM deep neural network

CSD Conformer Generator User Guide

(S1) (S2) = =8.16

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

Preparing a PDB File

ONLINE DECISION SUPPORT TOOL FOR AVALANCHE RISK MANAGEMENT. Patrick Nairz* Avalanche Warning Center Tyrol, Austria

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Cloud-based WRF Downscaling Simulations at Scale using Community Reanalysis and Climate Datasets

Esri Defense Mapping: Cartographic Production. Bo King

Search for Substance Data

Geo-enabling a Transactional Real Estate Management System A case study from the Minnesota Dept. of Transportation

Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge?

Building innovative drug discovery alliances. Just in KNIME: Successful Process Driven Drug Discovery

CWMS Modeling for Real-Time Water Management

Challenges in Next Generation Scientific and Patent Information Mining

Demonstration: Searching patents based on chemical structure using SciFinder

Anomaly Detection for the CERN Large Hadron Collider injection magnets

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Marvin. Sketching, viewing and predicting properties with Marvin - features, tips and tricks. Gyorgy Pirok. Solutions for Cheminformatics

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis

Information Retrieval: SciFinder

Leveraging the GIS Capability within FlexiCadastre

EMMA : ECDC Mapping and Multilayer Analysis A GIS enterprise solution to EU agency. Sharing experience and learning from the others

Management of Geological Information for Mining Sector Development and Investment Attraction Examples from Uganda and Tanzania

KATE2017 on NET beta version Operating manual

Christopher A. Lipinski Adjunct Senior Research Fellow Pfizer Global R&D, Groton Labs

STRUCTURAL BIOINFORMATICS I. Fall 2015

A STUDY ON INCREMENTAL OBJECT-ORIENTED MODEL AND ITS SUPPORTING ENVIRONMENT FOR CARTOGRAPHIC GENERALIZATION IN MULTI-SCALE SPATIAL DATABASE

DEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery

OECD QSAR Toolbox v.3.2. Step-by-step example of how to build and evaluate a category based on mechanism of action with protein and DNA binding

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Integrating Globus into a Science Gateway for Cryo-EM

Introduction to ArcGIS Server Development

Introduction to ArcGIS Maps for Office. Greg Ponto Scott Ball

Diana: A Free Meteorological Workstation. Lisbeth Bergholt and Helen Korsmo

Transcription:

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, 2006 Dr.

Overview Brief introduction Chemical Structure Recognition (chemocr) Manual conversion of images Up scaling and automatisation Protocol database and parameter evaluation 2 methods of validation Test and benchmark data sets Examples, results and lessons learned page 2

Chemical Structure Recognition an Overview 1 Document Depiction Reconstruction 2 3 4 SDF file 5 in silico Chemistry created from /home/marc/workspace/csr/results/csr/examples/us20051820 53/US2005182053_result.pnm MZCSRv0.5010050621162D 0.00000 0.00000 0 26 28 0 1 0 0 0 0 0999 V2000 204.0000 102.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 275.0000 61.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 201.0000 59.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 422.0000 178.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 311.0000 164.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 384.0000 165.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 447.0000 144.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 383.0000 123.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 131.0000 60.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 239.0000 123.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 349.0000 218.0000 0.0000 R# 0 0 0 0 0 0 0 0 0 0 0 0 447.0000 207.0000 0.0000 R# 0 0 0 0 0 0 0 0 0 0 0 0 page 3

The chemocr Process is a multi step process: 1. image preprocessing 2. image conversion 3. semantic entity recognition 4. chemical structure assembly 5. reconstruction validation 6. post processing for each step a specific module has been implemented modules can be assembled into workflows page 4

Look And Feel Of CSR reconstructed molecule input image page 5

The Interactive User Mode a graphical user interface has been developed the user can trigger each module separately there are curators and editors to interfere with the process the main advantages are: full control of the process easier than redrawing of the image teaching and learning capabilities of the system page 6

Adding New Modules Using JAVA APIs and RPCs Workflow plugin technology Marvin Msketch beautify 2D file format conversion 2D to 3D conversion name generation property calculation / prediction open babel page 7

The Distributed Batch Mode: Scaling Up the Process setting up the batch mode: a specific workflow is predefined a suitable parameter set is chosen each image becomes one job which is send to one computer all results are assembled advantages: large speed up less human resources vast number of results disadvantages: no control errors occur checking the results is time consuming page 8

Many Images Many Parameters? geometric constraints chiral: # segments orientation predefined for image sets currently 4 estimated from the image itself bond: min length max lenght character: area size shape page 9

Technical Solution For Up Scaling job description sh-script Sun Grid Engine Cluster license server CSR workstation DB server query / analysis MySQL database page 10

Protocol Database for the Reconstruction Process 1 session information ER diagram 2 workflow configuration 3 in- & outputs query: show me all molecules which have an atom error after changing the text_area parameter page 11

Result Validation Using Training and Test Data result validation? image corpora CSR reconstruction reconstruction test molecules result analysis page 12

Validation Classes A Closer Look validation test reconstruction / test molecule page 13

Reconstruction Error Prediction As an Alert System result validation can only be used if the molecule is already known or the expert is checking the result: good for bug fixing and training of the process can t be used for the data generation process need a different strategy for the batch mode: identify and predict reconstruction errors alert the user only if interaction is needed choose a threshold for the precision page 14

Error Prediction the Theory prediction and recognition can be based on the use of chemical knowledge bases image properties, i.e. measure the complexity of the problem instance based machine learning, i.e. teach the system the main goal is to assemble a reconstruction score without knowing the correct solution R score = w 1 complexity + w 2 chemical likelihood + w 3 known errors < T alert? weights w can be set by regression analysis page 15

Established Error Classes chemical knowledge bases OCR errors and unknown super atoms valence checking known scaffolds image properties strange bond drawings (size, angles,...) pixel density, size of connected components complexity instance based machine learning (IBL) atom and bond distributions Lipinski score (i.e. drug like) page 16

The Results Current Status test sets: 2005 2006 2007 100x 100x 7600x 750,000x measurement: correct: input identical to output incorrect: more than one error thin & clean > 95% thick & clean > 75% challenging mixture > 55% patents >??% page 17

Not Too Bad... simple bridges large molecules salts and chirals page 18

Questionable extreme long bonds new bond types close connections SMILES and alike page 19

Really Bad... overlaps bad quality thick bridges complicated bridged rings amino acid orientation page 20

Patent Images... R-groups and captions [ ] SDF reaction schema [?] RDF Markush structures [ ] file format page 21

So We Got an Error Reported need perfect reconstruction start molecule editor need for indexing and retrieval use similarity and substructure searches specify reporting threshold to be tolerated? page 22

A Glimpse at the Future: Multi Modal Extraction From Patents ProMiner (NER) proteins protein families protein complex compound process drug class disease pathways PatentMine by FhG CSR PatentSpace graphical mining page 23

Lessons Learned a generic chemocr framework has been established there is and there will not be a one-fits-all solution CSR can be adapted and optimized (parameters, error models, image preprocessing, ) although we have looked into many examples, we have not seen so far all sorts of image sources (e.g. legacy of old documents) we will continuously improve our methods as new challenges come along You can get hands on experience on CSR in an evaluation project SCAI provides: training, installation support, bug fixing, fitting CSR to the data, long term research agenda page 24

The Team (in the order of appearance) * Tanja Fey Le Thuy Bui Thi Christoph Friedrich* Yuan Wang Maria-Elena Algorri* Miguel Alvarez Angelika Weihermüller* Wei Wang Peter Kral* Carina Haupt* *) currently improving CSR page 25

CSR Online Demo Available During The Break CSR can extract chemical depictions from various image sources and convert them into SMILES and SD files, which can be further used in nearly all chemical software; it allows for the modification of reconstructed molecules by a structure editor; it maintains the superatom and bond (single, double, triple, or chiral) information; and it accepts user curation in each stage and scoring schema to improve its performance. page 26