Project Prospect and the InChI. Colin Batchelor

Similar documents
Ontologies for nanotechnology. Colin Batchelor

Searching Substances in Reaxys

Chemical Journal Publishing in an Online World. Jason Wilde, Publisher Physical Sciences Nature Publishing Group ACS Spring Meeting 2009

ChemSpider Reactions: Delivering a free community resource of chemical syntheses

Luckily this intermediate has three saturated carbons between the carbonyls, which again points to a Michael reaction:

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

OF ALL THE CHEMISTRY RELATED SOFTWARE

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options of the structure similarity

Organometallics & InChI. August 2017

A (Mostly) Correctly Formatted Sample Lab Report. Brett A. McGuire Lab Partner: Microsoft Windows Section AB2

Mathematical Induction

Electron Counting. It s easier to show than to explain, I think.

Style guide for chemical structures

Introduction to Statics

Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge?

Chemical Ontologies. Chemical Ontologies. ChemAxon UGM May 23, 2012

Lab Exercise 03: Gauss Law

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Section 9.1, Why is chemical bonding important? What disease has been tamed to some degree through the help of chemical bonding theory?

So I have an SD File What do I do next? Rajarshi Guha & Noel O Boyle NCATS & NextMove So<ware

MITOCW watch?v=ed_xr1bzuqs

CS 124 Math Review Section January 29, 2018

Catching the Drift Indexing Implicit Knowledge in Chemical Digital Libraries

How to add your reactions to generate a Chemistry Space in KNIME

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

Acceleration and Force: I

11, Inorganic Chemistry III (Metal π-complexes and Metal Clusters) Module 31: Preparation and reactions of metal clusters

Discrete Structures Proofwriting Checklist

Basic Techniques in Structure and Substructure

Suggested solutions for Chapter 14

1 Continuity and Limits of Functions

5.03 In-Class Exam 3

Strategies for Organic Synthesis

Engineering Physics 3W4 "Acquisition and Analysis of Experimental Information" Part II: Fourier Transforms

EPIC LIGAND SURVEY: CARBON MONOXIDE

Introduction to Chemoinformatics and Drug Discovery

MITOCW Lec 11 MIT 6.042J Mathematics for Computer Science, Fall 2010

Chem 232. Representation of Reaction Mechanisms. A Simple Guide to "Arrow Pushing"

What disease has been tamed to some degree through the help of chemical bonding theory?

$100 $400 $400 $400 $500

REALLY, REALLY STRONG BASES. DO NOT FORGET THIS!!!!!

Separating Mixtures - lab diagrams Slip Quiz 1 (New) Classification of Matter (Pogil=notes) Homework Slip Quiz 2

Strongly Agree Agree

Explain below in one sentence each (a) how we deal with overall charges on the complex in the two methods and (b) how this can be rationalized.

Lesson 21 Not So Dramatic Quadratics

Bioelectricity Prof. Mainak Das Department of Biological Sciences, and Bioengineering Indian Institute of Technology, Kanpur.

Introduction to Spark

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Grades 7 & 8, Math Circles 10/11/12 October, Series & Polygonal Numbers

Tautomerism in chemical information management systems

Searching Inorganic Chemistry

You must have your answers written in PERMANENT ink if you want a regrade!!!! This means no test written in pencil or ERASABLE INK will be regraded.

Properties of Arithmetic

CONTENTS PART I STRUCTURES OF THE TRANSITION-METAL COMPLEXES

Things we can already do with matrices. Unit II - Matrix arithmetic. Defining the matrix product. Things that fail in matrix arithmetic

Math 3 Variable Manipulation Part 1 Algebraic Systems

Spectrometric Identification Of Organic Compounds PDF

Operation and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

2.3 Composite Functions. We have seen that the union of sets A and B is defined by:

Finding Limits Graphically and Numerically

Ordinary Differential Equations Prof. A. K. Nandakumaran Department of Mathematics Indian Institute of Science Bangalore

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

MA 510 ASSIGNMENT SHEET Spring 2009 Text: Vector Calculus, J. Marsden and A. Tromba, fifth edition

Molecular Vibrations: The Theory Of Infrared And Raman Vibrational Spectra (Dover Books On Chemistry) Download Free (EPUB, PDF)

Chapter 1. Chemistry: An Overview and the Scientific Method

Reaxys Managing Complexity

Lecture - 24 Radial Basis Function Networks: Cover s Theorem

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

11, Inorganic Chemistry III (Metal π-complexes and Metal Clusters) Module 32: Polynuclear metal carbonyls and their structure

Experiment 14 It s Snow Big Deal

Chemically Intelligent Experiment Data Management

ECON 497 Midterm Spring

C2 Revision Pack (Please keep this pack with you)

Constant Acceleration

Unit 01 Motion with constant velocity. What we asked about

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

InChI keys as standard global identifiers in chemistry web services. Russ Hillard ACS, Salt Lake City March 2009

Confidence Intervals. - simply, an interval for which we have a certain confidence.

DEPARTMENT: Chemistry

The Schrödinger KNIME extensions

Workshop on Heterogeneous Computing, 16-20, July No Monte Carlo is safe Monte Carlo - more so parallel Monte Carlo

Linear classifiers Lecture 3

AI Programming CS S-09 Knowledge Representation

LIN1032 Formal Foundations for Linguistics

An Ultra Low Resistance Continuity Checker

Chapter 9. Chemical Bonding I: Basic Concepts

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Habitable Planets. Version: December 2004 Alan Gould. Grades 5-8. Great Explorations in Math and Science

PEANO AXIOMS FOR THE NATURAL NUMBERS AND PROOFS BY INDUCTION. The Peano axioms

Knots, Coloring and Applications

Objective 3. Draw resonance structures, use curved arrows, determine extent of delocalization. Identify major/minor contributor.

Frequent Pattern Mining: Exercises

SciFinder Scholar Guide to Getting Started

9.5 Radical Equations

Do now: Brainstorm how you would draw the Lewis diagram for: H 2 O CO 2

CS 188: Artificial Intelligence Spring Announcements

At the start of the term, we saw the following formula for computing the sum of the first n integers:

Astronomy 102 Lab: Hubble Law

Organic Chemistry I For Dummies PDF

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

Transcription:

Project Prospect and the InChI Colin Batchelor batchelorc@rsc.org 2009-03-22

Project Prospect and the InChI: outline What can we do with InChIs that we couldn t do before? Where the InChIs come from Where the InChIs go The human factor 2

What we do with InChIs that we couldn t do before InChIs are canonical. InChIs are informative. Low-cost, low-effort route to running a molecular structure database. Supports the extraction of chemical structures from journal articles. 3

How does publishing really work? 4

Data capture Editing and proof-reading 5

Enhanced HTML Database Text mining (Oscar) Manual QA Enhanced RSS 6

Where the InChIs come from Compounds with names ~60% Oscar Compounds with numbers ~70% author-supplied ChemDraw ~20% PubChem lookup ~20% ChemDraw ~30% editor-drawn ChemDraw 7

Regular polysemy Even a simple chemical name can mean more than one thing Corbett, Batchelor and Copestake, Pyridine, pyridines and pyridine rings, 8 Proceedings of BERBTM-08, next week.

Imidazole 9

An imidazole 10

The imidazole side-chain/ group/ring/etc. 11

Can InChI handle this? No. But it was never meant to. So we use ChEBI, an ontology of classes and parts, nanoparticles and other things that InChI was never meant to handle. http://www.ebi.ac.uk/chebi/ 12

Can ChEBI handle this? Imidazoles (!) (CHEBI:24780) Imidazole (CHEBI:16069) Imidazole ring not yet Imidazolyl group not yet 13

Disambiguation One Sense per Discourse (Gale et al. 1992) this doesn t hold at all One Sense per Collocation (Yarowsky 1993) matches our intuitions 14

Disambiguation: toy model CLASS: w( 1) = a, an, the, this w(0) plural (bit of a cheat, as not a collocation) PART: w( 1) = bridging, terminal w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w(+1)w(+2) = building block, protecting group, side chain 15

Where the InChIs go HTML RSS Database 16

HTML http://www.rsc.org/delivery/_articlelinking/ DisplayHTMLArticleforfree.asp? JournalCode=CC&Year=2009&ManuscriptI D=b823340c&Iss=Advance_Article 17

18

19

RSS 2007: First ever routinely-generated RSS feeds containing InChIs 2009: RSS feed bringing together all Prospected articles from across the RSC 20

RSS: the gory details <item rdf:about=http://xlink.rsc.org/?doi=b716356h&rss=1> <title> [ title] </title> <link>http://xlink.rsc.org/?doi=b716356h&rss=1</link> <description> [ blah] </description> <content:encoded> [ human-readable stuff</content:encoded> [ dublin core stuff ] <content:items> <rdf:bag> <rdf:li> <content:item rdf:about= info:inchi/inchi=1/c22h22no4/ c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12h,1-5h3/q+1"/> </rdf:li> <rdf:li> <content:item rdf:about= http://purl.org/obo/owl/so#so:0000028 /> </rdf:li> </rdf:bag> </content:items> </item> 21

RSS: the gory details Content module from RSS 1.0 http://web.resource.org/rss/1.0/modules/content We would like in future have proper rdf predicates e.g. is_about, mentions. 22

Chemical structure search 23

Search results 24

The human factor: questions from our internal FAQs How do I draw organometallic compounds? Why do organometallic compounds look a mess? Why does my structure look rubbish in the structure drawer? How do I draw a fullerene? 25

The human factor Interpreting the InChI! We are developing tools to validate and explain InChIs to technical editors. Documentation wiki Templates for tricky 3D systems Careful enumeration of examples 26

Example training slide Dots separate the formulae; semicolons separate the charges. InChI=1/ C8H12.C8H18.C7H15.C6 H12.Mg/ c1-2-4-6-8-7-5-3-1;1-7(2,3 )8(4,5)6;1-3-5-7-6-4-2;1-6- 4-2-3-5-6;/h1-2,7-8H, 3-6H2;1-6H3;5H, 3-4,6-7H2,1-2H3;6H, 2-5H2,1H3;/q;;-1;;+1/ b2-1-,8-7-;;;; 27

Worked examples Alkali metal salts Grignard reagents Delocalized systems Metallocenes cod Phosphine ligands Metal carbonyl complexes Metal carbenes Boranes Dative bonds 28

Example delocalized system 29

Fun with metal carbonyls 30

and more worked examples 31

Next steps Tools for explaining and validating InChIs Parallel standard and non-standard InChIs maintained internally. Better handling of compounds with metal atoms. 32

What has InChI allowed us to achieve? Workflow set up for chemically-enriched journal articles in a publishing production environment within a year. Industry-leading markup technology Widespread interest Prizes 33

"My first forays show [Project Prospect is] brilliant. [It's] great to see the compounds and have machine readable SMILES and InChIs" "Your new system is very impressive, I am sure it will become very useful to a large community It is great and exciting!! Just from the very few minutes I looked into it I realized its great potential and immediately ran to show it to my students!" This is a fantastic resource for the community, and a great use of the GO and SO. Nice work" I have found it very intuitive/ straightforward to use.[i] believe that it will make the manuscript even more appealing to readers." 34

35