Project Prospect and the InChI Colin Batchelor batchelorc@rsc.org 2009-03-22
Project Prospect and the InChI: outline What can we do with InChIs that we couldn t do before? Where the InChIs come from Where the InChIs go The human factor 2
What we do with InChIs that we couldn t do before InChIs are canonical. InChIs are informative. Low-cost, low-effort route to running a molecular structure database. Supports the extraction of chemical structures from journal articles. 3
How does publishing really work? 4
Data capture Editing and proof-reading 5
Enhanced HTML Database Text mining (Oscar) Manual QA Enhanced RSS 6
Where the InChIs come from Compounds with names ~60% Oscar Compounds with numbers ~70% author-supplied ChemDraw ~20% PubChem lookup ~20% ChemDraw ~30% editor-drawn ChemDraw 7
Regular polysemy Even a simple chemical name can mean more than one thing Corbett, Batchelor and Copestake, Pyridine, pyridines and pyridine rings, 8 Proceedings of BERBTM-08, next week.
Imidazole 9
An imidazole 10
The imidazole side-chain/ group/ring/etc. 11
Can InChI handle this? No. But it was never meant to. So we use ChEBI, an ontology of classes and parts, nanoparticles and other things that InChI was never meant to handle. http://www.ebi.ac.uk/chebi/ 12
Can ChEBI handle this? Imidazoles (!) (CHEBI:24780) Imidazole (CHEBI:16069) Imidazole ring not yet Imidazolyl group not yet 13
Disambiguation One Sense per Discourse (Gale et al. 1992) this doesn t hold at all One Sense per Collocation (Yarowsky 1993) matches our intuitions 14
Disambiguation: toy model CLASS: w( 1) = a, an, the, this w(0) plural (bit of a cheat, as not a collocation) PART: w( 1) = bridging, terminal w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w(+1)w(+2) = building block, protecting group, side chain 15
Where the InChIs go HTML RSS Database 16
HTML http://www.rsc.org/delivery/_articlelinking/ DisplayHTMLArticleforfree.asp? JournalCode=CC&Year=2009&ManuscriptI D=b823340c&Iss=Advance_Article 17
18
19
RSS 2007: First ever routinely-generated RSS feeds containing InChIs 2009: RSS feed bringing together all Prospected articles from across the RSC 20
RSS: the gory details <item rdf:about=http://xlink.rsc.org/?doi=b716356h&rss=1> <title> [ title] </title> <link>http://xlink.rsc.org/?doi=b716356h&rss=1</link> <description> [ blah] </description> <content:encoded> [ human-readable stuff</content:encoded> [ dublin core stuff ] <content:items> <rdf:bag> <rdf:li> <content:item rdf:about= info:inchi/inchi=1/c22h22no4/ c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12h,1-5h3/q+1"/> </rdf:li> <rdf:li> <content:item rdf:about= http://purl.org/obo/owl/so#so:0000028 /> </rdf:li> </rdf:bag> </content:items> </item> 21
RSS: the gory details Content module from RSS 1.0 http://web.resource.org/rss/1.0/modules/content We would like in future have proper rdf predicates e.g. is_about, mentions. 22
Chemical structure search 23
Search results 24
The human factor: questions from our internal FAQs How do I draw organometallic compounds? Why do organometallic compounds look a mess? Why does my structure look rubbish in the structure drawer? How do I draw a fullerene? 25
The human factor Interpreting the InChI! We are developing tools to validate and explain InChIs to technical editors. Documentation wiki Templates for tricky 3D systems Careful enumeration of examples 26
Example training slide Dots separate the formulae; semicolons separate the charges. InChI=1/ C8H12.C8H18.C7H15.C6 H12.Mg/ c1-2-4-6-8-7-5-3-1;1-7(2,3 )8(4,5)6;1-3-5-7-6-4-2;1-6- 4-2-3-5-6;/h1-2,7-8H, 3-6H2;1-6H3;5H, 3-4,6-7H2,1-2H3;6H, 2-5H2,1H3;/q;;-1;;+1/ b2-1-,8-7-;;;; 27
Worked examples Alkali metal salts Grignard reagents Delocalized systems Metallocenes cod Phosphine ligands Metal carbonyl complexes Metal carbenes Boranes Dative bonds 28
Example delocalized system 29
Fun with metal carbonyls 30
and more worked examples 31
Next steps Tools for explaining and validating InChIs Parallel standard and non-standard InChIs maintained internally. Better handling of compounds with metal atoms. 32
What has InChI allowed us to achieve? Workflow set up for chemically-enriched journal articles in a publishing production environment within a year. Industry-leading markup technology Widespread interest Prizes 33
"My first forays show [Project Prospect is] brilliant. [It's] great to see the compounds and have machine readable SMILES and InChIs" "Your new system is very impressive, I am sure it will become very useful to a large community It is great and exciting!! Just from the very few minutes I looked into it I realized its great potential and immediately ran to show it to my students!" This is a fantastic resource for the community, and a great use of the GO and SO. Nice work" I have found it very intuitive/ straightforward to use.[i] believe that it will make the manuscript even more appealing to readers." 34
35