InChI keys as standard global identifiers in chemistry web services. Russ Hillard ACS, Salt Lake City March 2009

Similar documents
Introduction. Chemical Structure Graphs. Whitepaper

Tautomerism in chemical information management systems

Canonical Line Notations

Developing CAS Products for Substructure Searching by Chemists. Linda Toler

InChI, the IUPAC International Chemical Identifier

Organometallics & InChI. August 2017

InChI/InChIKey vs. NCI/CADD Structure Identifiers: A comparison

5. Composition and Connectivity Does the formula always represent the complete composition of the substance?

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal

Reaxys Pipeline Pilot Components Installation and User Guide

BIOVIA ENHANCED STEREOCHEMICAL REPRESENTATION WHITE PAPER

The Electronic Representation of Chemical Structures: beyond the low hanging fruit

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Introduction to Chemoinformatics and Drug Discovery

The IUPAC Chemical Identifier

How to Create a Substance Answer Set

Chapter 5 Stereochemistry

STEREOCHEMISTRY A STUDENT SHOULD BE ABLE TO:

Stereochemistry. Based on McMurry s Organic Chemistry, 6 th edition

Suggested solutions for Chapter 14

CHEMICAL REPRESENTATION GUIDE

(1) Check to see if the two compounds are identical. (2) Recall the definitions of stereoisomers, conformational isomers, and constitutional isomers.

Introduction to Chemoinformatics

240 Chem. Stereochemistry. Chapter 5

Sametz: CHEM 321 Fall 2012 Organic Chemistry Final

(S)-(-)-Dopa, used to treat Parkinson's disease, and its medically ineffective (R)-(+) enantiomer

9. Stereochemistry. Stereochemistry

Chapter 5 Stereochemistry

10/4/2010. Sequence Rules for Specifying Configuration. Sequence Rules for Specifying Configuration. 5.5 Sequence Rules for Specifying.

DiscoveryGate. One place for the answers you need. Chulalongkorn University, Thailand

DECEMBER 2014 REAXYS R201 ADVANCED STRUCTURE SEARCHING

geometric isomers (diastereomers)

Reaxys Managing Complexity

Reaxys The Highlights

Chemical Information Retrieval CAS & SciFinder Searching for Substances

CHAPTER 5. Stereoisomers

ChemAxon. Content. By György Pirok. D Standardization D Virtual Reactions. D Fragmentation. ChemAxon European UGM Visegrad 2008

Searching Substances in Reaxys

CHEM1902/ N-9 November 2014

Open PHACTS Explorer: Compound by Name

Organic Chemistry Chapter 5 Stereoisomers H. D. Roth

RInChI. International Chemical Identifier for Chemical Reactions (RInChI) Guenter Grethe, Jonathan Goodman, Chad Allen

Structure Searching in CrossFire Beilstein. DiscoveryGate SM Version 1.4 Participant s Guide

CHE 200 INFORMATION RESOURCES LIBRARY PRESENTATION

CHEM 251 (4 credits): Description

Fall Organic Chemistry Experiment #6 Fractional Crystallization (Resolution of Enantiomers)

Due Date: 2) What is the relationship between the following compounds?

Eliel, E.L.: Wilen, S.H. Stereochemistry of Organic Compounds, Wiley, New York, 1994.

Final Exam. Your lab section and TA name: (if you are not in a lab section write no lab ) Instructions:

Organic Chemistry. Chemical Bonding and Structure (2)

Chapter 4: Stereochemistry

Partial Periodic Table

Chem 341 Jasperse Ch. 9 Handouts 1

PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE

CH Organic Chemistry I (Katz) Practice Exam #3- Fall 2013

On InChI and evaluating the quality of cross-reference links

CSUS - CH6B Fischer projection and R/S configurations Instructor: J.T., P: 1. a) Fischer Projection can be rotated by 180 only!

CHE 400/500 LECTURE NOTES

STEREOCHEMISTRY A STUDENT WHO HAS MASTERED THE MATERIAL IN THIS SECTION SHOULD BE ABLE TO:

WIPO s Chemical Search Function

A powerful site for all chemists CHOICE CRC Handbook of Chemistry and Physics

STEREOCHEMISTRY. 2. Define the following, and tell whether or not a given compound or structure fits the description or possesses the feature.

Information Retrieval: SciFinder

180 C 2 -C 3 bond rotation

The shortest path to chemistry data and literature

Chemistry 123: Physical and Organic Chemistry Topic 1: Organic Chemistry

Basic Techniques in Structure and Substructure

International Chemical Identifier for Reactions (RInChI)

Chapter 6. Isomers and Stereochemistry

Reading Check (Warm up #) 1) What is an isotope? 2) What is a chemical compound?

Chapter 6. Isomers and Stereochemistry

Lesson 4. Molecular Geometry and Isomers II. Lesson 4 CH 3 HO H OH

DiscoveryGate SM Version 1.4 Participant s Guide

Basic Stereochemical Considerations

CHM 251 Organic Chemistry 1

Applying the Semantic Web to Computational Chemistry

Unit B Analysis Questions

CH 3 C 2 H 5. Tetrahedral Stereochemistry

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

CHEM J-10 June The structure of ( )-linalool, a commonly occurring natural product, is shown below.

Database Speaks. Ling-Kang Liu ( 劉陵崗 ) Institute of Chemistry, Academia Sinica Nangang, Taipei 115, Taiwan

Chemically Intelligent Experiment Data Management

CHEM 261 HOME WORK Lecture Topics: MODULE 1: The Basics: Bonding and Molecular Structure Text Sections (N0 1.9, 9-11) Homework: Chapter 1:

Copyright 2009 James K Whitesell

CHEM 263 Oct 18, Do they have the same molecular formula?

Imago: open-source toolkit for 2D chemical structure image recognition

SciFinder Premier CAS solutions to explore all chemistry MethodsNow, PatentPak, ChemZent, SciFinder n

Enantiomers. nonsuperimposable mirror image Both Configuration will be opposite. Both Configuration will be opposite

MULTIPLE CHOICE QUESTIONS Stereochemistry

Exp 08: Organic Molecules

If Compound 2 (below) was used in the above reaction, how might things change? Name Page 1. I. (23 points)

Names. Chiral: A chiral object is not superimposable upon its mirror image. A chiral object contains the property of "handedness.

Assign (R) or (S) configurations to the chiral carbons in the following molecules: enantiomers

C 4 H 10 O. butanol. diethyl ether. different carbon skeleton different functional group different position of FG

Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples

*Assignments could be reversed. *

Meeting the challenges of representing large, modified biopolymers

Chapter 7 Cyclic Compounds. Stereochemistry of Reactions

Detailed Course Content

Partial Periodic Table

Transcription:

InChI keys as standard global identifiers in chemistry web services Russ Hillard ACS, Salt Lake City March 2009

Context of this talk We have created a web service That aggregates sources built independently - Dozens individual databases - Containing Molecules and reactions - Created using non-standardized business rules (wrt chemical representation) Covers large record sets - 30+ million unique molecules from combined sources - 5+ million unique reactions from combined sources Requires integration across all sources - Based on shared chemical entities - Where entity means chemical compound(s) - And chemical compound has a unique identifiers - Chemical structure elucidated by scientists - Systematic chemical name derived from structure - Graphic representation of structure assigned at registration - Trivial chemical name assigned to structure - Registry number assigned to structure - Key or string computed from structure

The basic problem... ChemInform (FIZ Chemie) BRN3936786 Beilstein (Elsevier) BRN3936786 Curr. Chem Reactions (Thomson) 5693-99-2 stereochem unspecified 71403-94-6 relative stereochem 121651-02-3 absolute stereochem (2R,3S) 126720-47-6 absolute stereochem (2S,3R) trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd (2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom) Don t always have or know the BRN, CASRN, ChemSpiderID, MFCD#... Relationship of Structure:RegNumbers if often 1:many

One solution Define our own set of registration rules Register all structures to one big database - Normalize structures according to our rules Assign a unique record identifier (URI) to the normalized structures Correlate our URIs to the native sources Use our URIs to correlate records across different databases We have done this but have not exposed the URIs - Even with modern computers this is resource intensive - Problem is compounded when data is from different providers - Does the world really need another Global Registry Number?

As currently implemented this gives: ChemInform (FIZ Chemie) Great for internal correlations: Reactions Commercial Availability Toxicity Bioactivity... etc Molecules Synthetic preparations of Organic reactions of Toxicity... Etc But what about external correlations? Anything we don t/can t index Commercial data Proprietary data

Alternative solution Assume structures as registered are correct - Accept that we cannot always normalize according to our rules Use a derived (calculated) compound identifier Is this possible? - IUPAC Name - Wiswesser Line Notation (WLN) - Molfile and its derivatives - SEMA Key - MDL Line Notation - SMILES - Chemical Markup Language (CML) - InChI Name - InChI Key - NEMA key Will focus on these two options

IUPAC - International Chemical Identifier The objective of the IUPAC Chemical Identifier Project is to establish a unique label, the IUPAC Chemical Identifier, which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources thus enabling easier linking of diverse data compilations. The initial work focused on the development of algorithms for converting an input organic chemical structure to a unique (canonical) form. This, in effect, involves the unique numbering of each atom, with equivalent atoms being assigned identical numbers. "Serializing" the result to create a string is the final, straightforward, step in creating an identifier. From: http://www.iupac.org/web/ins/2000-025-1-800 For this presentation all InchI Keys are generated using: final standard InChI/InChIKey v. 1.02 so7ware

The Morgan Algorithm Invented by H. L. Morgan, J. Chem. Doc., 5, 107 (1965) - Underpins many of the systems in use today - The basis of CAS Online Identifies atoms based on an extended connectivity value and the atom with the highest value becomes the first atom in the name, and its neighbors are then listed in descending order ties are resolved based on additional parameters, for example bond order, and atomic number Does not handle stereochemistry SEMA developed to handle stereoisomers - W. T. Wipke and T. M. Dyott, J. Amer. Chem. Soc., 96, 4825, (1974).

NEMA NEMA produces a unique name and key for a wider range of structures than SEMA. It extends perception to non-tetrahedral stereogenic centers, it supports both 2D and 3D stereochemistry perception, and it does not have an atom limit. It is a proprietary to Symyx, but it is exposed in our products, for example Symyx Draw and Symyx Direct generate NEMA keys. The work of Wipke et al identified the value of a constitutional key and a stereo key. This approach has been incorporated into NEMA. W. T. Wipke, S. Krishnan, and G. I. Ouchi, J. Chem. Inf Comput. Sci., 18, 32, 1978

Tautomers (mobile H atoms) Different structures Different systematic names Presumably exist in equilibrium InchI Keys are identical NEMA Keys are different

Both structures are registered to our collection 57531-38-1 assigned to both structures 4(5)-chloro-5(4)-nitroimidazole 5(4)-chloro-4(5)-nitroimidazole 4-chloro-5-nitroimidazole 5-chloro-4-nitroimidazole 4-chloro-5-nitro-1(3)H-imidazole

Tautomers ( mobile hydrogen atoms ) Different NEMA Keys Same InchI Key

Mesomers Mesomers ideally would have the same identifier Different NEMA Keys Same InchI Key

Both structures are registered to our collection Methylene blue 61-73-4

Mesomers? Same InChi Key Different NEMA Keys Same InchI Key Same NEMA Keys

Stereoisomers No stereo Enantiomeric pair Pure enantiomer InchI does not distinguish pure enantiomer from raceme

Relative versus absolute stereochemistry Indistinguishable based on InchI Key

Absolute Stereochemistry InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N 1S 1s 2R 3 unique NEMA Keys

InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N

Concern with stereochem goes back to.. ChemInform (FIZ Chemie) BRN3936786 Beilstein (Elsevier) BRN3936786 Curr. Chem Reactions (Thomson) 5693-99-2 stereochem unspecified 71403-94-6 relative stereochem 121651-02-3 absolute stereochem (2R,3S) 126720-47-6 absolute stereochem (2S,3R) trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd (2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom) Don t always have or know the BRN, CASRN, ChemSpiderID, MFCD#... Relationship of Structure:RegNumbers if often 1:many

Typically problematic structures Definitely the same compound Same InchI Key Different NEMA Keys

Typically problematic compounds Just the tip of the iceberg Organometallics Inorganics

Layered structure of InchI Keys AAAAAAAAAAAAAA-BBBBBBBBCD AAAAAAAAAAAAAA = skeleton BBBBBBBB = structural features mobile hydrogens, isotopes, metal bonds... C = flag, InchI version... D = check character Ability to reconstruct InChi Keys into classes of related structures sets them apart

InChI key resolution using ChemSpider Full InChI key search Partial InChI key search

There is still plenty to do Biologics Average pipeline contains 22% biologics Some companies are near 50% Peptides & modified peptides Nucleic acid sequences Generics Markush structures Polymers Repeating monomers Block copolymers Cross-linked polymers

So what should go into our web service? Unique chemical structures registered to Compound Index Unique reaction structures registered to Reaction Index Assigned global identifiers as available - Registry numbers (BRN, CASRN, MFCD#s, PubChemIDs...) Computed global identifiers for all compounds - InChI strings - InChI Keys - NEMA Keys Register InChi Keys to ACD and other Symyx databases Let the consumer decide which to use