InChI keys as standard global identifiers in chemistry web services Russ Hillard ACS, Salt Lake City March 2009
Context of this talk We have created a web service That aggregates sources built independently - Dozens individual databases - Containing Molecules and reactions - Created using non-standardized business rules (wrt chemical representation) Covers large record sets - 30+ million unique molecules from combined sources - 5+ million unique reactions from combined sources Requires integration across all sources - Based on shared chemical entities - Where entity means chemical compound(s) - And chemical compound has a unique identifiers - Chemical structure elucidated by scientists - Systematic chemical name derived from structure - Graphic representation of structure assigned at registration - Trivial chemical name assigned to structure - Registry number assigned to structure - Key or string computed from structure
The basic problem... ChemInform (FIZ Chemie) BRN3936786 Beilstein (Elsevier) BRN3936786 Curr. Chem Reactions (Thomson) 5693-99-2 stereochem unspecified 71403-94-6 relative stereochem 121651-02-3 absolute stereochem (2R,3S) 126720-47-6 absolute stereochem (2S,3R) trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd (2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom) Don t always have or know the BRN, CASRN, ChemSpiderID, MFCD#... Relationship of Structure:RegNumbers if often 1:many
One solution Define our own set of registration rules Register all structures to one big database - Normalize structures according to our rules Assign a unique record identifier (URI) to the normalized structures Correlate our URIs to the native sources Use our URIs to correlate records across different databases We have done this but have not exposed the URIs - Even with modern computers this is resource intensive - Problem is compounded when data is from different providers - Does the world really need another Global Registry Number?
As currently implemented this gives: ChemInform (FIZ Chemie) Great for internal correlations: Reactions Commercial Availability Toxicity Bioactivity... etc Molecules Synthetic preparations of Organic reactions of Toxicity... Etc But what about external correlations? Anything we don t/can t index Commercial data Proprietary data
Alternative solution Assume structures as registered are correct - Accept that we cannot always normalize according to our rules Use a derived (calculated) compound identifier Is this possible? - IUPAC Name - Wiswesser Line Notation (WLN) - Molfile and its derivatives - SEMA Key - MDL Line Notation - SMILES - Chemical Markup Language (CML) - InChI Name - InChI Key - NEMA key Will focus on these two options
IUPAC - International Chemical Identifier The objective of the IUPAC Chemical Identifier Project is to establish a unique label, the IUPAC Chemical Identifier, which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources thus enabling easier linking of diverse data compilations. The initial work focused on the development of algorithms for converting an input organic chemical structure to a unique (canonical) form. This, in effect, involves the unique numbering of each atom, with equivalent atoms being assigned identical numbers. "Serializing" the result to create a string is the final, straightforward, step in creating an identifier. From: http://www.iupac.org/web/ins/2000-025-1-800 For this presentation all InchI Keys are generated using: final standard InChI/InChIKey v. 1.02 so7ware
The Morgan Algorithm Invented by H. L. Morgan, J. Chem. Doc., 5, 107 (1965) - Underpins many of the systems in use today - The basis of CAS Online Identifies atoms based on an extended connectivity value and the atom with the highest value becomes the first atom in the name, and its neighbors are then listed in descending order ties are resolved based on additional parameters, for example bond order, and atomic number Does not handle stereochemistry SEMA developed to handle stereoisomers - W. T. Wipke and T. M. Dyott, J. Amer. Chem. Soc., 96, 4825, (1974).
NEMA NEMA produces a unique name and key for a wider range of structures than SEMA. It extends perception to non-tetrahedral stereogenic centers, it supports both 2D and 3D stereochemistry perception, and it does not have an atom limit. It is a proprietary to Symyx, but it is exposed in our products, for example Symyx Draw and Symyx Direct generate NEMA keys. The work of Wipke et al identified the value of a constitutional key and a stereo key. This approach has been incorporated into NEMA. W. T. Wipke, S. Krishnan, and G. I. Ouchi, J. Chem. Inf Comput. Sci., 18, 32, 1978
Tautomers (mobile H atoms) Different structures Different systematic names Presumably exist in equilibrium InchI Keys are identical NEMA Keys are different
Both structures are registered to our collection 57531-38-1 assigned to both structures 4(5)-chloro-5(4)-nitroimidazole 5(4)-chloro-4(5)-nitroimidazole 4-chloro-5-nitroimidazole 5-chloro-4-nitroimidazole 4-chloro-5-nitro-1(3)H-imidazole
Tautomers ( mobile hydrogen atoms ) Different NEMA Keys Same InchI Key
Mesomers Mesomers ideally would have the same identifier Different NEMA Keys Same InchI Key
Both structures are registered to our collection Methylene blue 61-73-4
Mesomers? Same InChi Key Different NEMA Keys Same InchI Key Same NEMA Keys
Stereoisomers No stereo Enantiomeric pair Pure enantiomer InchI does not distinguish pure enantiomer from raceme
Relative versus absolute stereochemistry Indistinguishable based on InchI Key
Absolute Stereochemistry InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N 1S 1s 2R 3 unique NEMA Keys
InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N
Concern with stereochem goes back to.. ChemInform (FIZ Chemie) BRN3936786 Beilstein (Elsevier) BRN3936786 Curr. Chem Reactions (Thomson) 5693-99-2 stereochem unspecified 71403-94-6 relative stereochem 121651-02-3 absolute stereochem (2R,3S) 126720-47-6 absolute stereochem (2S,3R) trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd (2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom) Don t always have or know the BRN, CASRN, ChemSpiderID, MFCD#... Relationship of Structure:RegNumbers if often 1:many
Typically problematic structures Definitely the same compound Same InchI Key Different NEMA Keys
Typically problematic compounds Just the tip of the iceberg Organometallics Inorganics
Layered structure of InchI Keys AAAAAAAAAAAAAA-BBBBBBBBCD AAAAAAAAAAAAAA = skeleton BBBBBBBB = structural features mobile hydrogens, isotopes, metal bonds... C = flag, InchI version... D = check character Ability to reconstruct InChi Keys into classes of related structures sets them apart
InChI key resolution using ChemSpider Full InChI key search Partial InChI key search
There is still plenty to do Biologics Average pipeline contains 22% biologics Some companies are near 50% Peptides & modified peptides Nucleic acid sequences Generics Markush structures Polymers Repeating monomers Block copolymers Cross-linked polymers
So what should go into our web service? Unique chemical structures registered to Compound Index Unique reaction structures registered to Reaction Index Assigned global identifiers as available - Registry numbers (BRN, CASRN, MFCD#s, PubChemIDs...) Computed global identifiers for all compounds - InChI strings - InChI Keys - NEMA Keys Register InChi Keys to ACD and other Symyx databases Let the consumer decide which to use