Challenges in Next Generation Scientific and Patent Information Mining

Similar documents
Mining for Chemistry in Text and Images. A Real-World Example revealing the Challenge, Scope, Limitation and Usability of the current Technology

Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge?

Extracting Knowledge from Reaction Databases: Developments from InfoChem

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Imago: open-source toolkit for 2D chemical structure image recognition

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

CLRG Biocreative V

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

CHEMDRAW ULTRA ITEC107 - Introduction to Computing for Pharmacy. ITEC107 - Introduction to Computing for Pharmacy 1

Structure-based approaches to the indexing and retrieval of patent chemistry. Tim Miller Head of Research May 2010

Chemical Structure Reconstruction with chemocr

Chemical Journal Publishing in an Online World. Jason Wilde, Publisher Physical Sciences Nature Publishing Group ACS Spring Meeting 2009

Demonstration: Searching patents based on chemical structure using SciFinder

Progresses in Automated Chemical Structure Recognition in Text and Images

Acknowledgments xiii Preface xv. GIS Tutorial 1 Introducing GIS and health applications 1. What is GIS? 2

Computation Theory Finite Automata

Stoichiometry. Before You Read. Chapter 10. Chapter 11. Review Vocabulary. Define the following terms. mole. molar mass.

Chemical Text Mining for Current Awareness of Pharmaceutical Patents. Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK

Chapter 5. Chemistry for Changing Times, Chemical Accounting. Lecture Outlines. John Singer, Jackson Community College. Thirteenth Edition

ISO INTERNATIONAL STANDARD. Geographic information Metadata Part 2: Extensions for imagery and gridded data

Describing Chemical Reactions

Automated Identification and Conversion of Chemical Names to Structure

THIS IS A NEW SPECIFICATION

Friday 10 June 2016 Afternoon Time allowed: 1 hour 30 minutes

Lab Activity 4: Structures and Properties of Aliphatic Compounds

Reaction classification, an enduring success story

ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature

Advanced Subsidiary Unit 1: The Core Principles of Chemistry

Automata Theory, Computability and Complexity

(a) Draw the electron density map for a chlorine molecule to show covalent bonding. (1)

Style guide for chemical structures

Chapter Introduction Lesson 1 Understanding Chemical Reactions Lesson 2 Types of Chemical Reactions Lesson 3 Energy Changes and Chemical Reactions

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

SPECIMEN. day June 20XX Morning/Afternoon MAXIMUM MARK 70. AS Level Chemistry B (Salters) H033/02 Chemistry in depth SAMPLE MARK SCHEME

Chem101 - Lecture 5. Chemical Equations Chemical equations are used to describe chemical reactions.

Chapter 2 The text above the third display should say Three other examples.

TILING PROOFS OF SOME FIBONACCI-LUCAS RELATIONS. Mark Shattuck Department of Mathematics, University of Tennessee, Knoxville, TN , USA

GCSE (9 1) Combined Science (Chemistry) A (Gateway Science) J250/09 Paper 9, C1-C3 and CS7 (PAGs C1-C5) (Higher Tier)

Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations from Patent Texts

Chemistry: The Central Science. Chapter 8: Basic Concepts of Chemical Bonding

1 hour 30 minutes plus your additional time allowance

POC via CHEMnetBASE for Identifying Unknowns

ISO INTERNATIONAL STANDARD. Geographic information Metadata Part 2: Extensions for imagery and gridded data

Handling Human Interpreted Analytical Data. Workflows for Pharmaceutical R&D. Presented by Peter Russell

E-EROS TUTORIAL Encyclopedia of Reagents for Organic Synthesis at the UIUC Chemistry Library

Chapter 8. Basic Concepts of Chemical Bonding

M10/4/CHEMI/SP2/ENG/TZ1/XX+ CHEMISTRY. Wednesday 12 May 2010 (afternoon) Candidate session number. 1 hour 15 minutes INSTRUCTIONS TO CANDIDATES

Advanced Subsidiary Unit 1: The Core Principles of Chemistry

This document consists of 12 printed pages and a Data Sheet for Chemistry.

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

2. Which of the following salts form coloured solutions when dissolved in water? I. Atomic radius II. Melting point III.

POC via CHEMnetBASE for Identifying Unknowns

CHEMISTRY 2815/01 Trends and Patterns

Hierachical Name Entity Recognition

CHEM J-3 June 2014

(DO NOT WRITE ON THIS TEST)

Cherry Hill Tuition A Level Chemistry OCR (A) Paper 2 THIS IS A NEW SPECIFICATION

Marvin. Sketching, viewing and predicting properties with Marvin - features, tips and tricks. Gyorgy Pirok. Solutions for Cheminformatics

CHEM 120: Introduction to Inorganic Chemistry. Chapters Covered and Test dates. The mole concept and atoms. Avogadro s number

Chemistry 11 Unit 1 Safety in the Laboratory. Chemistry 11 Unit 2 Introduction to Chemistry

The Case for Use Cases

THE DIGITAL TERRAIN MAP LIBRARY: AN EXPLORATIONIST S RESOURCE

EQUILIBRIUM CONSTANT, K eq or K. The Law of Chemical Equilibrium: (Guldberg & Waage, 1864)

Science 10: CHEMISTRY Review Questions

ISIS/Draw "Quick Start"

Chemical Ontologies. Chemical Ontologies. ChemAxon UGM May 23, 2012

Available online at Analele Stiintifice ale Universitatii Al. I. Cuza din Iasi Seria Geologie 58 (1) (2012) 53 58

NAME: Chapter 12, Test 1: Chemical Bonding. Total Question(s): 20 Here are the quiz answers, in review:

ALE 4. Effect of Temperature and Catalysts on the Rate of a Chemical Reaction

Technical Specifications. Form of the standard

THE VIBRATIONAL SPECTRA OF A POLYATOMIC MOLECULE (Revised 3/27/2006)

Chemical Reactions and Equations

CS Sampling and Aliasing. Analog vs Digital

Tuning Color Through Substitution

Internet Appendix for Sentiment During Recessions

Calculating Bond Enthalpies of the Hydrides

REGI. Claims REGIstry file: Dictionary of specific chemical compounds for crossfile searching with the IFI/Plenum patent databases

Paper Reference. Friday 16 January 2009 Morning Time: 1 hour

ECE Lecture 1. Terminology. Statistics are used much like a drunk uses a lamppost: for support, not illumination

The Rate Law: Reactant Concentration and Rate. Relating Reactant Concentrations and Rate

Chapter 4: Covalent Bonding and Chemical Structure Representation

Rate Equations and Kp

S T A T I O N 1 E X O T H E R M I C / E N D O T H E R M I C P R O C E S S E S

Identification and Characterization of an Isolated Impurity Fraction: Analysis of an Unknown Degradant Found in Quetiapine Fumarate

SUPPLEMENTARY INFORMATION

Name: Class Period: Due Date: Unit 2 It s Elemental Test Review

Which of the following is NOT an example of a chemical reaction? A. Milk turning sour B. Food being digested C. A match burning D.

Chemistry Key Concepts - Atomic structure

Ch 1: Introduction: Matter and Measurement

Structure Drawing. March Use the New Features. Keyboard Shortcuts and Paste from ChemDraw

SciFinder introduction and a view behind the scene how this tool is made

CS 2336 Discrete Mathematics

Deterministic Finite Automaton (DFA)

Science of Synthesis Guided Examples

A (Mostly) Correctly Formatted Sample Lab Report. Brett A. McGuire Lab Partner: Microsoft Windows Section AB2

Unit 1, Lesson 07: Introduction to Covalent Bonding and the Octet Rule

The Silacyclobutene Ring: An Indicator of Triplet State Baird-Aromaticity

Buffer Data Capture. Exercise 4:

OBJECT BASED IMAGE ANALYSIS FOR URBAN MAPPING AND CITY PLANNING IN BELGIUM. P. Lemenkova

MARIYA INTERNATIONAL SCHOOL. Work sheet I. Term II. Level 8 Chemistry [MCQ] Name: THE PRTIODIC TABLE OF ELEMENTS

Transcription:

1 / 30 Challenges in Next Generation Scientific and Patent Information Mining Josef Eiblmaier (InfoChem), Hans Kraut (InfoChem), Larisa Isenko (InfoChem), Heinz Saller (InfoChem), Peter Loew (InfoChem)

2 / 30 Outline» Patents, a buried treasure?» Chemical NER» CDX work-up» Image recognition» Combining the techniques: ChemProspector

3 / 30 No. Applications Filed (Selected Patent Offices) Source: EPO (http://www.epo.org/searching/asian/trends_de.html), USPTO (http://www.uspto.gov/web/offices/ac/ido/oeip/taf/us_stat.htm)

4 / 30 Patents, a Buried Treasure?» Scientific knowledge contained in patents comprises about 70 80 percent of all scientific knowledge» In most cases, a patent is the first publication in the research process» Important for competitor monitoring, technology assessment, R&D portfolio management» Access to external technological knowledge

5 / 30 Chemical Pharmaceutical Patents: a Buried Treasure? Chemical structures (images) Chemical names/fragments (text) Markush structures (text, images, CDX) Chemical structures (CDX files) Reactions (images, CDX files)

6 / 30 Chemical Named Entity Extraction: the Compulsory Exercise?» Named entity extraction (ICANNOTATOR) - Dictionary based - Morphology based - Context based» Name to structure conversion» Tagging of source documents - Highlighting - Enrichment with additional attributes - Linking (deep linking, cross document)

7 / 30 Challenges NER: OCR ed Documents Scanned source document: OCR results: Tile structure of the only reprcsentativcs o f unsyrnrnctrical 3,4'- and 4,5-disubstitutcd 2,2'-bithiophcncs 3,4'- dibromo-2.2'-bithiophcne (17) [29] and 4-(2-thienyl)-5-phcnyl-2,2'-bithiophene (18) [30] was determined. In the moleculc of the first compound slight deviation from planarity is observed, and as a result the torsion angle S,~,- C~:~-Cn.~-S~n is 175.(I~ The bond lengths and the angles of the heterocycle arc comparable with the values determined lbr thc other bithiophenes. The heterocyclcs in thc 2,2'-bithiophene fragmcnt of compound 18 are

8 / 30 Challenges NER: Digital Born PDF s PDF document: PdfToText results: [7a,8a,3,4 ]-N -(Phenyl)succinimido-6,14-endo-etheno.

9 / 30 CDX Workup: Nothing is as it Seems!» For various sources CS ChemDraw files are available on large scale MRW s (e.g. Science of Synthesis) Journals (e.g. Thieme, Wiley, Springer) Patents (e.g. USPTO: 2000 - present)» The contained structures/reactions can be worked-up automatically (ICSchemeProcessor)» But be careful: nothing is as it seems!

10 / 30 CDX Work-up: Nothing is as it Seems!» Reaction arrows / forked arrows / brackets

11 / 30 CDX Work-up: Nothing is as it Seems!» Unresolvable labels Labels not defined Element symbols used as R-group labels Ambiguous fragment labels (e.g. molecular formula)

12 / 30 CDX Work-up: Nothing is as it seems!» Variable points of attachment

13 / 30 CDX Work-up: Nothing is as it Seems!» R-group enumeration in the CDX file unstructured in the text structured in the text as table (XML)

14 / 30 CDX Work-up: Nothing is as it Seems!» Non chemical objects C-C bonds!

15 / 30 CDX Work-up: Solution/Approach ICSchemeProcessor» Algorithmic detection of features Repeating groups Aliases, R-groups within CDX files» Guidelines for authors and typesetters Syntax definitions for tables, R-groups etc. Syntax rules for captions Reaction arrangement, forked arrows Location and colour rules for reaction conditions (reactants, catalysts, solvents)

16 / 30 Image Recognition: the Supreme Challenge» In many cases only raster images are available» Multiple step process: page segmentation image classification vectorization, OCR, reconstruction» Part of ChemProspector: ICImg2Struct InfoChem proprietary development, started mid 2010

17 / 30 Image Recognition ICImg2Struct: Status and Evaluation» Used benchmark set: OSRA validation set (USPTO) 5,735 chemical structures (images and associated MOL files) (http://cactus.nci.nih.gov/osra/uspto-validation.zip)» Degree of match: InfoChem SimilarityMCD 100% correct: 4,234 structures (74%) 90-100%: 722 structures (12%) 0 90%: 779 structures (14%)

18 / 30 Original Image ICImg2Struct SimilarityMCD 100% 100% 96.3%

19 / 30 Image Recognition: Variable Points of Attachment ICImg2Struct ICImg2Struct ICImg2Struct

20 / 30 Image Recognition: More Challenges» Wavy bonds» Atom numbers» Brackets» Chlorine, Iodine» Circles» Fused characters» Charges» Crossing bonds» Variable bonds»

21 / 30 Image Recognition: Even More Challenges

22 / 30 Combining the Techniques: ChemProspector» Main emphasis: The automatic extraction of Markush Structures from patent documents» Research SME-project within the THESEUS research program» THESEUS: New Technologies for the Internet of Services» Duration THESEUS: five years (2007-2011)» Duration ChemProspector: July 2009 end of 2011

23 / 30 Main Challenges Image classifier Chemical recognition ICANNOTATOR Markush-Parser SemanticParser

24 / 30 ChemProspector: Approach OCR ICANNOTATOR Page segmentation Markush-Parser SemanticParser Persistence Layer Image classifier Chemical recognition

25 / 30 Results: Example Image classifier Chemical recognition ICANNOTATOR Markush-Parser SemanticParser

26 / 30 Results: US2003162298 A1

27 / 30 Results: US 20050154025 A1

28 / 30 Acknowledgements» The InfoChem Team» The German Federal Ministry of Economy and Technology (BMWi)

29 / 30 Thank you!

30 / 30 Questions?