ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature

Similar documents
CLRG Biocreative V

The GapVis project and automated textual geoparsing

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Imago: open-source toolkit for 2D chemical structure image recognition

Information Extraction from Biomedical Text. BMI/CS 776 Mark Craven

Spatial Role Labeling CS365 Course Project

ESPRIT Feature. Innovation with Integrity. Particle detection and chemical classification EDS

STRUCTURAL BIOINFORMATICS I. Fall 2015

PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE

STRING: Protein association networks. Lars Juhl Jensen

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

Chemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser

Annotation tasks and solutions in CLARIN-PL

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

Reaxys Medicinal Chemistry Fact Sheet

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

Automated Geoparsing of Paris Street Names in 19th Century Novels

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

JCICS Major Research Areas

Text Analytics (Text Mining)

Syntactic Patterns of Spatial Relations in Text

Handling Human Interpreted Analytical Data. Workflows for Pharmaceutical R&D. Presented by Peter Russell

The Noisy Channel Model and Markov Models

Cross Media Entity Extraction and Linkage for Chemical Documents

Gene mention normalization in full texts using GNAT and LINNAEUS

Lecture 13: Structured Prediction

Text mining and natural language analysis. Jefrey Lijffijt

Introducing a Bioinformatics Similarity Search Solution

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Formal Verification. Lecture 1: Introduction to Model Checking and Temporal Logic¹

Special Topics in Computer Science

UNIT II REGULAR LANGUAGES

SciFinder Premier CAS solutions to explore all chemistry MethodsNow, PatentPak, ChemZent, SciFinder n

Enhancing the Curation of Botanical Data Using Text Analysis Tools

Chemists are from Mars, Biologists from Venus. Originally published 7th November 2006

The shortest path to chemistry data and literature

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS

DRUG DISCOVERY TODAY ELN ELN. Chemistry. Biology. Known ligands. DBs. Generate chemistry ideas. Check chemical feasibility In-house.

The Case for Use Cases

Information Extraction from Text

GIS TECHNICIAN I GIS TECHNICIAN II

Applied Natural Language Processing

Text Analytics (Text Mining)

Modern Chemistry Chapter 13 Review Answers File Type

Extracting Information from Text

Prenominal Modifier Ordering via MSA. Alignment

Predicting English keywords from Java Bytecodes using Machine Learning

A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia

Geographic Analysis of Linguistically Encoded Movement Patterns A Contextualized Perspective

Latent Dirichlet Allocation Introduction/Overview

WIPO s Chemical Search Function

FIRE DEPARMENT SANTA CLARA COUNTY

Analytical data, the web, and standards for unified laboratory informatics databases

Applying Bioisosteric Transformations to Predict Novel, High Quality Compounds

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems

Reaxys Pipeline Pilot Components Installation and User Guide

(S1) (S2) = =8.16

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Naïve Bayes, Maxent and Neural Models

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Patent Searching using Bayesian Statistics

MSc Drug Design. Module Structure: (15 credits each) Lectures and Tutorials Assessment: 50% coursework, 50% unseen examination.

KGBuilder: A System for Large-Scale Scientific Domain Knowledge Graph Building

Intelligent Systems (AI-2)

MassHunter Software Overview

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Combining Vector-Space and Word-based Aspect Models for Passage Retrieval

how to *do* computationally assisted research

Large scale classification of chemical reactions from patent data

Intelligent Systems (AI-2)

Effectiveness of complex index terms in information retrieval

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

Physics Giancoli 6th Edition Solutions Manual File Type

Integrated Cheminformatics to Guide Drug Discovery

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Reaxys The Highlights

Conditional Random Field

Latent Dirichlet Allocation (LDA)

Learning Features from Co-occurrences: A Theoretical Analysis

Extracting Chemical Reactions from Biological Literature

Ichiro Takeuchi University of Maryland

Lecture 5 Neural models for NLP

Media Map: A Multilingual Document Map with a Design Interface

10/17/04. Today s Main Points

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Information Extraction and GATE. Valentin Tablan University of Sheffield Department of Computer Science NLP Group

CSC321 Lecture 15: Recurrent Neural Networks

Introduction To Linear Regression Analysis 4th Edition Student Solutions Manual Wiley Series In Probability And Statistics

Multi-theme Sentiment Analysis using Quantified Contextual

Generic Text Summarization

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Classification of Study Region in Environmental Science Abstracts

Searching Substances in Reaxys

Integrative Metabolic Fate Prediction and Retrosynthetic Analysis the Semantic Way

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Fun with weighted FSTs

The Semantic Annotation Based on Mongolian Place Recognition

Transcription:

: A toolkit for automated extraction of chemical information from the scientific literature Callum Court Molecular Engineering, University of Cambridge Supervisor: Dr Jacqueline Cole 1 / 20

Overview 1 Introduction 2 Previous work 3 Challenges 4 Overview of the toolkit 5 Applications 2 / 20

Introduction Approximately 20,000 new compounds and properties published in 10,000 biomedical chemistry journals in 2013 alone 1 Ideally we would compile all available scientific data into a database of material properties Too much data to extract manually 3 / 20

Introduction Scientific results are typically presented in papers, patents, these etc Containing unstructured and semi-structured data in the form of text, tables, captions and figures not readily interpretable by machines Modern Machine Learning and Natural Language Processing (NLP) techniques provide us with the means for automated information extraction 4 / 20

Previous work Large scale data-mining for materials discovery: The Materials Genome Initiative 2 The Harvard Clean Energy Project 3 The Materials Project 4 Text mining tools for the Chemistry domain: ChemicalTagger 5 ChemEx Project 6 5 / 20

Previous work Previous methods tend to focus on on predicting chemical properties confined to a particular field of research (photovoltaics, batteries etc.) All would be well complemented by a generic method for generating databases of materials properties in a domain-independent way 6 / 20

Challenges Although the scientific literature is relatively formulaic and structured, text-mining the scientific literature is very difficult Each sub-domain of science has its own specific terminology and abbreviations These conventions can vary between papers (and perhaps between sections) Each sentence/paragraph cannot be processed individually as information is spread out through multiple sections 7 / 20

(CDE) A comprehensive toolkit for the automated extraction of chemical information from scientific documents. Full extraction of melting points, glass transitions, UV-Vis absorption spectra and more Full source code and documentation available under MIT license at www.chemdataextractor.org 8 / 20

(CDE) 9 / 20

Document processing This stage converts differing file types into a single consistent structure consisting of abstracts, paragraphs, figures, captions and tables Enables all subsequent stages to perform in the same way regardless of initial document type 10 / 20

Natural language processing The key stage of the CDE pipeline where relationships and information are extracted from the text of the document 1 Tokenization 2 Part-of-speech tagging 3 Entity recognition 4 Phrase parsing 5 Information extraction 11 / 20

Natural language processing 12 / 20

Table parsing Tables are an ideal source for retrieving structured data This stage treats tables as highly condensed forms of text Specialised rules are used to parse table headers and columns in the same way as normal text 13 / 20

Interdependency resolution Finally, all information from the natural language processor and table processor can be brought together This stage resolves the interdependencies between different sections and compiles all information into a set of structured records These records can be easily compiled into a database 14 / 20

Performance Evaluation performed on a set of 50 chemistry articles sourced from the Royal Society of Chemistry, American Chemical Society and Elsevier Precision: The fraction of retrieved records that are correct Recall: The fraction of correct records that are retrieved F-score: The harmonic mean of Precision and Recall 15 / 20

Applications Autogenerated databases of material properties can have great utility in materials science: 1 Materials or drug discovery 2 Property prediction 3 Compound identification 4 Research design 16 / 20

Current work Currently work is being undertaken to enhance the capability of CDE to extract properties associated with the physics corpora In particular, the extraction of magnetic properties with the aim of creating large auto-generated databases of magnetic properties 17 / 20

The Snowball algorithm The rule-based approach to phrase parsing is highly inefficient The Snowball algorithm 7 is a semi-supervised machine learning approach to probabilistic phrase parsing Initial results demonstrate a large increase in precision and F-score for CDE when a Snowball step is included into the pipeline 18 / 20

Summary provides a complete pipeline for automatically extracting chemical data from the scientific literature in a domain independent way The overall system presents a high F-score of over 90% when applied to the chemistry literature Further enhancements to the system may be able to push this score even higher and make the system more suited for use in the physics domain This has great potential for use in materials science research 19 / 20

References [1] Rmy D Hoffmann, Arnaud Gohier, and Pavel Pospisil. Data mining in drug discovery. Wiley-VCH, 2013. Chap. 5. [2] National Science and Technology Council (US). Materials genome initiative for global competitiveness. Executive Office of the President, National Science and Technology Council, 2011. [3] Roberto Olivares-Amaya et al. Accelerated computational discovery of high-performance materials for organic photovoltaics by means of cheminformatics. Energy & Environmental Science 4.12 (2011). [4] Anubhav Jain et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Materials 1.1 (2013). [5] Lezan Hawizy et al. ChemicalTagger: A tool for semantic text-mining in chemistry. Journal of cheminformatics 3.1 (2011). [6] Atima Tharatipyakul et al. ChemEx: information extraction system for chemical data curation. BMC bioinformatics 13.17 (2012). [7] Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections. Proceedings of the fifth ACM conference on Digital libraries. ACM. 2000. 20 / 20