Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit

Similar documents
Navigating between patents, papers, abstracts and databases using public sources and tools

Ákos Tarcsay CHEMAXON SOLUTIONS

The Case for Use Cases

PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Analytical data, the web, and standards for unified laboratory informatics databases

Open PHACTS Explorer: Compound by Name

DRUG DISCOVERY TODAY ELN ELN. Chemistry. Biology. Known ligands. DBs. Generate chemistry ideas. Check chemical feasibility In-house.

Introducing a Bioinformatics Similarity Search Solution

Reaxys The Highlights

Reaxys Pipeline Pilot Components Installation and User Guide

Introduction. OntoChem

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

Imago: open-source toolkit for 2D chemical structure image recognition

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Structure-based approaches to the indexing and retrieval of patent chemistry. Tim Miller Head of Research May 2010

CSD. Unlock value from crystal structure information in the CSD

EMPIRICAL VS. RATIONAL METHODS OF DISCOVERING NEW DRUGS

MSc Drug Design. Module Structure: (15 credits each) Lectures and Tutorials Assessment: 50% coursework, 50% unseen examination.

SciFinder Scholar Guide to Getting Started

Handling Human Interpreted Analytical Data. Workflows for Pharmaceutical R&D. Presented by Peter Russell

Chemical Data Retrieval and Management

Course Plan (Syllabus): Drug Design and Discovery

SciFinder Scholar Guide to Getting Started

Structure and Reaction querying in Reaxys

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

Data Quality Issues That Can Impact Drug Discovery

LIBRARY DESIGN FOR COLLABORATIVE DRUG DISCOVERY: EXPANDING DRUGGABLE CHEMOGENOMIC SPACE

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

PubChem data extraction and integration using Instant JChem. Oleg Ursu Cristian Bologa Tudor I. Oprea Division of Biocomputing

RMassBank: Automatic Recalibration and Processing of Tandem HR-MS Spectra for MassBank

Integrated Cheminformatics to Guide Drug Discovery

Biologically Relevant Molecular Comparisons. Mark Mackey

Pipeline Pilot Integration

Large scale classification of chemical reactions from patent data

Drug Informatics for Chemical Genomics...

Practical QSAR and Library Design: Advanced tools for research teams

ECS8020 ORGANIC ELEMENTAL ANALYZER CHNS-O Analyzer

Using Web Technologies for Integrative Drug Discovery

The shortest path to chemistry data and literature

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

ChemSpider Reactions: Delivering a free community resource of chemical syntheses

Reaxys Medicinal Chemistry Fact Sheet

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

Searching Substances in Reaxys

In Silico Investigation of Off-Target Effects

Introducing New Technology for Liquid Handling Quality Assurance

Let s develop together an exchange virtual compounds computa6onal pla7orm

CSD. CSD-Enterprise. Access the CSD and ALL CCDC application software

Using Self-Organizing maps to accelerate similarity search

A powerful site for all chemists CHOICE CRC Handbook of Chemistry and Physics

An Integrated Approach to in-silico

CLRG Biocreative V

Design and Synthesis of the Comprehensive Fragment Library

Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge?

Marvin. Sketching, viewing and predicting properties with Marvin - features, tips and tricks. Gyorgy Pirok. Solutions for Cheminformatics

GCC E x h i b i t i o n N e w s l e t t e r. 8 th GERMAN CONFERENCE ON CHEMOINFORMATICS TOPICS

Data Mining in the Chemical Industry. Overview of presentation

Pipeline Pilot Integration

Receptor Based Drug Design (1)

Build_model v User Guide

Patent Searching using Bayesian Statistics

October 6 University Faculty of pharmacy Computer Aided Drug Design Unit

Applying Bioisosteric Transformations to Predict Novel, High Quality Compounds

Perseverance. Experimentation. Knowledge.

Towards Automatic Nanomanipulation at the Atomic Scale

How IJC is Adding Value to a Molecular Design Business

ChiroSolve Enantiomer preparation for discovery, development and manufacturing

Command-line tools of ChemAxon: tips and tricks

Tautomerism in chemical information management systems

Marvin 5.4 A new generation of structure indexing at Elsevier. Dr. Michael Maier, Dr. Heike Nau, Elsevier

Capturing Chemistry. What you see is what you get In the world of mechanism and chemical transformations

Fondamenti di Chimica Farmaceutica. Computer Chemistry in Drug Research: Introduction

Developing CAS Products for Substructure Searching by Chemists. Linda Toler

Reaxys Managing Complexity

How Diverse Are Diversity Assessment Methods? A Comparative Analysis and Benchmarking of Molecular Descriptor Space

Chemical Ontologies. Chemical Ontologies. ChemAxon UGM May 23, 2012

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis

Using AutoDock for Virtual Screening

User Guide for LeDock

Compounding insights Thermo Scientific Compound Discoverer Software

Introducing the New SciFinder. Veli Pekka Hyttinen Regional Marketing Manager, Central and Eastern Europe Jasna April 1, 2014

JOB$ IN THE DRUG INDUSTRY

Dispensing Processes Profoundly Impact Biological, Computational and Statistical Analyses

Hit Finding and Optimization Using BLAZE & FORGE

ChemSpider Main Menu. Select Advanced Search

est Drive K20 GPUs! Experience The Acceleration Run Computational Chemistry Codes on Tesla K20 GPU today

SciFinder introduction and a view behind the scene how this tool is made

Highly automated protein formulation development: a case study

Spatial Data Infrastructure Concepts and Components. Douglas Nebert U.S. Federal Geographic Data Committee Secretariat

Chemists are from Mars, Biologists from Venus. Originally published 7th November 2006

SciFinder Premier CAS solutions to explore all chemistry MethodsNow, PatentPak, ChemZent, SciFinder n

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal

Application Note LCMS-112 A Fully Automated Two-Step Procedure for Quality Control of Synthetic Peptides

PDF-4+ Tools and Searches

Discovering The World Of Chemistry

Transcription:

Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit Alfonso Pozzan Computational and Analytical Chemistry Drug Design and Discovery Department Aptuit Verona (Italy) Chemaxon 2014 User Group Meeting Budapest 20th-21st May 2014

2 Verona Major Site Facilities Drug Design & Discovery, Preclinical Bioscience, API Development & Manufacture, Solid State Chemistry Solid & Oral Dosage Form. Dev., Physical & Analytical Chemistry Clinical Sciences

Background 3 In the last decades the number of freely/easily available information relating to the biological data of drug like molecules is consistently increased. In the past, the majority of published small-molecule bioactivity data were exclusively available via commercial products (SciFinder, CrossFire, AureusScience,...). Small companies, and academic institutions can now extend the scope of knowledge based drug discovery thanks to these initiatives ChEMBL DrugBank BindingDB PubChem ChemSpider Like in the case of ChEMBL, the core data are manually extracted from the full text of peer-reviewed scientific publications in a variety of journals, such as Journal of Medicinal Chemistry, Bioorganic Medicinal Chemistry Letters and Journal of Natural Products. For patents there is now the possibility to search +20M structures using the SureChem Open application

Chemaxon Document 2 Structure 4

Case study 5 Orexin1 Orexin2 antagonists recent patents (23, 4 in Japanese*) WO2013182972A, US2012165331, WO2013139730, WO2013127913A1, WO2013119639, WO2013092893A1, WO13092893, WO2013068935, WO2010048012, WO2010048016A, WO2008147518A1, WO2010048013A1, WO2004026866A1, WO2013050938, WO2013059163, WO2013062858, WO2013062857, WO2013059222, WO2011006960A1, WO2012039371A1*, WO2012081692A1*, WO2012153729A1*, WO2013123240A1* Structure extraction made using Chemaxon molconvert (6.2.1) and Igor Filippov OSRA (2.0.0) all running on a IBM X3950 Linux (RH) server molconvert mrv patent.pdf o patent.mrv Initial raw structures where ~30k (not uniques check) Using a simple MW<600, NumHeavy > 6, ClogP= exist gives ~13k cps (not uniq check) * Japanese Patents were translated in English using the Google patents translation capabilities

Removing not relevant structures 6 Automatic structure extraction from a batch of documents (patents) lead to a number of structures that are not relevant and/or some artifacts Running Markush based Sub Structure Searches (SSS) based on each patent would be accurate but very time consuming We would like to have in place a sustainable unsupervised process to repeat for each target of interest. However simple filters can be effectively used A) Remove artifacts with MW= 0 A typical artifacts would be the word halogen, alkyl, aryl, usually those have MW= 0 and can be easily removed (~3k entries) B) Remove molecules containing atoms either than {C,N,O,P,S,F,Cl,Br,I} [!$([#6,#7,#8,#15,#16,#9,#17,#35,#53])] either = 1 (~6k entries) C) Remove molecules that does not contain any carbon atom [$[(#6)] Carbon = 0 [~12k entries] Combining A or B or C results in 18k entries removed from the initial list

[Log] Removing not relevant structures 7 Color coded by Molecular Complexity as defined by number of unique linear fragments ( 1 to 7 atoms in length) of each molecule*

Removing not relevant structures 8 The following cut-offs were added: Fragments for each molecule >= 130 This filters helps in removing lower complexity symmetric reagents Num Heavy >= 22 NumDuplicates <= 3 This leaved us with ~1200 unique structures (1878 non unique structures from the patents ) At this point we were left with some questions Did we get/kept all the relevant structure from all patents? What is the level of recognition/conversion at each document level?

Retrospective analysis 9 All the initial patents (23) were represented at the end of the analysis. There is a big variability in the number of structures extracted from each patent, this led us in having a more detailed analysis at single patent level. The figure represent the number of structures originating from each patent that reached the end of the conversion and filtering process

Some examples of recognized and unrecognized structures from the same patent (WO2004026866A1) 10 This structure was not recognized This structure was recognized and converted This seems to be more an OCR issue than a text2structure one

From D2S to PatentInformatics 11 Patents 2D Stereocentre expansion Pka Fix ConfAnalysis Patents 3D Pharmacophore screening Shape Similarity This is a very simple patent mining platform were 3D structures derived from the Orexin patents example have been annotated with various descriptors including the RMS fit with the OX1 and OX2 pharmacophores. Flipper, fixpka, Omega2 and ROCS (from OpenEye Software were used to prepare, and generate 3D conformations as well to perform Shape Similarities). MOE was used for the ph4 analysis and screening

Pseudo automatic Patents Data Extraction example: Structure and IC50 extraction (355 datapoints) reported in patent WO2013050938A1 12 The following example highlight some useful application of data extraction from patents. In this case D2S is needed but not sufficient as combination of pdf OCR and Perl scripting/parsing was needed

The Overall Process 13

Conclusions 14 PROS Chemaxon Documents 2 Structure and Filippov s OSRA combined with an appropriate post processing workflow can effectively increase the knowledge on a particular target of interest. Sometime this could touch areas not covered from commercially available databases (particular papers, patents, papers in different languages) The overall process allow to curate the level of information extraction Structures active on a specific target (higher automation) Structures and associated curve data (lower automation) Need to have in place an effective post processing, in this case large amount of patents can be mined. High level of automation can be achieved CONS The process presented here it is still far from being perfect great variability from patent to patent Extracting biological data and associate them to structures can still take some time (not fully automated)