InChI/InChIKey vs. CI/CADD Structure Identifiers: A comparison Markus Sitzmann Computer-Aided Drug Design Group (CI/CADD), Laboratory of Medicinal Chemistry, CI-Frederick, I, DS Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
The Adaption and Use of the IUPAC InChI/InChIKey Chemical Structure Lookup Service 74 million structure records 46 million unique structures InChI/InChIKey Std. InChI/InChIKey CI/CADD Identifiers FICTS FICuS uuuuu Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Unique Representation of Chemical Structures CI/CADD Structure Identifiers based on hashcodes calculated by the chemoinformatics toolkit CACTVS 2 CACTVS hashcodes: 9850FD9F9E2B4E25 represent a chemical structure uniquely as 16-digit hexadecimal number (64-bit unsigned) have a high sensitivity to structural features of a compound change if connectivity changes Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
2 6C16DE2351F9FF50 tautomers 2 E92E4BA2869F3611 stereoisomers 2 8A7AD1EB498CC76A 2 salt - 2 charged form a + 3 + - 3ECEF579D7DF025A 9850FD9F9E2B4E25 A3DAE0788050DDE4 a 2 8F7A1DE5A733F0E0 errors isotope 15 2 B2FDA68AEDA06DB9 60525E1AF41497B6 Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Unique Representation of Chemical Structures CI/CADD Structure Identifiers MDL Molfile MDL SDF SMILES ChemDraw cdx PDB input structure structure normalization parent structure hashcode calculation E_ASISY CI/CADD Identifier MDL SDF SMILES database Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
CI/CADD Structure Identifiers Structure ormalization adjustable levels of sensitivity: Fragments Isotopes Charges Tautomers Stereochemistry sensitive sensitive sensitive sensitive sensitive - a + D D D D D D 3 + - C 2 C 2 keep only largest organic fragment ignore isotope labels uncharge find canonical tautomer discard stereo information 2 C 2 un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
CI/CADD Structure Identifiers Structure ormalization Fragments Isotopes Charges sensitive sensitive sensitive - a + D D D D D D 3 + - Tautomers sensitive Stereochemistry sensitive C 2 C 2 2 C 2 un-sensitive un-sensitive un-sensitive un-sensitive Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
CI/CADD Structure Identifiers Structure ormalization FICTS identifier: representation of the exact drawing Fragments Isotopes Charges sensitive sensitive sensitive - a + D D D D D D F I C 3 + - Tautomers sensitive = T = Stereochemistry sensitive C 2 S C 2 2 C 2 un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
CI/CADD Structure Identifiers Structure ormalization FICuS identifier: comes closest to how a chemist perceives a compound Fragments Isotopes Charges sensitive sensitive sensitive - a + D D D D D D 3 + - Tautomers sensitive = Stereochemistry sensitive C 2 C 2 F I C = u S 2 C 2 un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
CI/CADD Structure Identifier Structure ormalization uuuuu identifier: closely related forms of the same compound Fragments Isotopes Charges Tautomers Stereochemistry sensitive sensitive sensitive sensitive sensitive - a + D D D D D D 3 + - = C = 2 C 2 = u = = = = = u u u u 2 C 2 un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
CI/CADD Structure Identifier Structure ormalization normalize or discard stereo information define canonical tautomer n FICTS define canonical resonance form/ protonation state n d FICTu FICuS input structure d n FICuu uuuts parent structures correct structure: add hydrogen atoms correct functional groups correct metal atom bonds n d uuutu uuuus get largest fragment & uncharge: delete complex center get largest organic fragment delete radical center uncharge structure d discard isotope labels uuuuu Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
CI/CADD Structure Identifier 2 9850FD9F9E2B4E25-FICTS-01-57 9850FD9F9E2B4E25-FICuS-01-78 9850FD9F9E2B4E25-uuuuu-01-27 <CACTVS hashcode (E_ASISY)>-<tag>-<version>-<checksum> Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
2 6C16DE2351F9FF50-FICTS tautomers 2 E92E4BA2869F3611-FICTS stereoisomers 2 8A7AD1EB498CC76A-FICTS 2 salt - a + 3 + - 2 charged form E5F83F10C5DB080A-FICTS 2 9850FD9F9E2B4E25-FICTS FICTS errors a A3DAE0788050DDE4-FICTS isotope 15 2 E5F83F10C5DB080A-FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25-FICTS Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
2 9850FD9F9E2B4E25-FICuS tautomers 2 E92E4BA2869F3611-FICuS stereoisomers 2 8A7AD1EB498CC76A-FICuS 2 salt - a + 3 + - 2 charged form E5F83F10C5DB080A-FICuS 2 9850FD9F9E2B4E25-FICuS FICuS errors a A3DAE0788050DDE4-FICuS isotope 15 2 E5F83F10C5DB080A-FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25-FICuS Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
2 9850FD9F9E2B4E25-uuuuu tautomers 2 9850FD9F9E2B4E25-uuuuu stereoisomers 2 9850FD9F9E2B4E25-uuuuu 2 salt - a + 3 + - 2 charged form 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 2 uuuuu errors a isotope 1 5 2 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-FICuS Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
2 DVDQJCIGZP-UFFFAYSA- tautomers 2 DVDQJCIGZP-RXMQYKEDSA- stereoisomers 2 DVDQJCIGZP-YFKPBYRVSA- 2 salt - a + 3 + - 2 charged form UPKBYGGMJTIM-UFFFAYSA-M DVDQJCIGZP-UFFFAYSA- DVDQJCIGZP-UFFFAYSA- 2 Std. InChIKey errors a isotope 1 5 2 UPKBYGGMJTIM-UFFFAYSA-M DVDQJCIGZP-UFFFAYSA- DVDQJCIGZP-CDYZYAPPSA- Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Tautomers canonical tautomer? Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Tautomers CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism) rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS types of tautomerism covered: 1.3, 1.5 keto/enol imine/enamine imine/amine lactam/lactim 1.3, 1.5, 1.7, 1.11 hydrogen atom shift on (aromatic) heteroatoms keten/ynol nitro/aci-nitro nitroso/oxime special cases: cyanic/iso-cyanic acid, phosphonic acid, formamidinesulfonic acid, isocyanide, furanones and more Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Tautomers 21 SMIRKS transforms, examples: transform: 1.3 keto-enol [,S,Se,Te;X1:1]=[Cx1:2][CX4R{0-2}:3][#1:4]>> [#1:4][,S,Se,Te;X2:1][Cx1,cx1:2]=[C,cx1,cx0:3] transform: 1.3 heteroatom shift [,n,s,s,,o,se,te:1]=[x2,nx2,c,c,p,p:2] [,n,s,,se,te:3][#1:4]>>[#1:4][,n,s,,se,te:1] [X2,nX2,C,c,P,p:2]=[,n,S,s,,o,Se,Te:3] transform: 1.5 heteroatom shift [nx2,x2,s,,se,te:1]=[c,c,nx2,x2:6][c,c:5]=[c,c,nx2:2] [,n,s,s,,o,se,te:3][#1:4]>>[#1:4][,n,s,,se,te:1] [C,c,nX2,X2:6]=[C,c:5][C,c,nX2:2]=[X2,S,,Se,Te:3] Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Tautomers guanine 2 2 2 2 2 A6199E68A788F2F5-FICTS 67196F0B20B1D934-FICTS D979CF9770AC0BA5-FICTS 675R4FCC50F45026-FICTS 959B273B619C709F-FICTS 2 2 2 1AD375920BE60DAD-FICTS BCCDA7D0CDACF120-FICTS CE8F480C11DBFC4F-FICTS 61248C4A7D045A47-FICTS 0B345B47F6625113-FICTS 181CA9BCE3EF47F4-FICTS D46A1E6500B06AB6-FICTS 56FFE8B5619FB01-FICTS F802E527EC5C61BF-FICTS EF060DA9D97091DE-FICTS UYTPUPDQBUYGX-UFFFAYSA- BCCDA7D0CDACF120-FICuS Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Tautomerism & Stereochemistry methyl propenyl ketone E Z Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Tautomerism & Stereochemistry methyl propenyl ketone E tautomer tautomer Z Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Tautomerism & Stereochemistry methyl propenyl ketone E tautomer 76D03F08ACDF6C0C-FICuS FICUS disregards stereochemistry on double bonds if the double bond is not located during tautomer generation. Z tautomer Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Tautomerism & Stereochemistry InChI=1S/C58/c1-3-4-5(2)6/h3-4,1-23 LABTWGUMFABVFG-UFFFAYSA- E methyl propenyl ketone InChI=1S/C58/c1-3-4-5(2)6/h3-4,1-23/b4-3+ LABTWGUMFABVFG-EGZZKSA- tautomer 76D03F08ACDF6C0C-FICuS InChI=1S/C58/c1-3-4-5(2)6/h3-4,6,12,23/b5-4- LYGWZVQSCPYDG-PLGDYQASA- FICUS disregards stereochemistry on double bonds if the double bond is not located during tautomer generation. Z tautomer InChI=1S/C58/c1-3-4-5(2)6/h3-4,1-23/b4-3- LABTWGUMFABVFG-ARJAWSKDSA- Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Tautomerism & Stereochemistry methyl propenyl ketone E 821D8C17ACE5040E-FICTS tautomer 76D03F08ACDF6C0C-FICTS 6EB4AA2BAA11965F-FICTS FICTS sees four different structures tautomer Z 1677645190718885-FICTS Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Charges in Resonance Systems uncharge F3A27F03AE77A722 62FADCB01F197FC9 canonical resonance structure? problem! different protonation states uncharge F3A27F03AE77A722 2E011EE4519F7920 Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Charges in Resonance Systems generation of all formal resonance structures for a given (charged) organic compound rule set of 14 transforms encoded as (CACTVS-extended) SMIRKS shifting of charges: 5 rules recombination of charges: 5 rules separation of charges: 4 rules Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Structure ormalization Charges in Resonance Systems münchnones: 1.2 recombination separation (pentavalent atom) 1.3 shift 1.2 recombination 1.3 shift 1.2 shift 1.3 recombination 1.3 shift 1.3 shift 1.3 shift 1.3 shift (no plausible unpolarized resonance structure can be drawn) IUYUGWCTLFFCL-UFFFAYSA- F68AC07DE0D3379F-FICuS Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison»chemical Structure Lookup Service«Database PubChem database (including pen CI database, EPA DSSTox databases, IAID IV databases, IST Webbook, LM ChemIDplus, ChemSpider ) Chemavigator iresearch Library (compilation of commercially available screening compounds from ~250 international chemistry suppliers) thers ~10% Chemav. iresearch Lib. ~43% PubChem ~47% Commercial Sources / thers (Asinex, Comgenex, ) 74 million structure records (~46 million unique structures) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Unique Structure Counts structure records registered in CSLS: 74.2 million successful calculation of: Standard InChI/InChIKey: 73.8 million records CI/CADD Structure Identifiers: 73.7 million records unique structure counts (compound sets) Standard InChI/InChIKey: FICTS Identifier FICuS Identifier Standard InChIKey (first block) uuuuu Identifier 48,027,940 48,023,835 46,715,521 43,055,589 41,671,010 Standard InChI/InChIKeys were calculated by stdinchi-1 (Linux i-386 executable) from the original SD file records Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison FICuS compound set (46.7 million unique) Standard InChI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) original structure record set (74.2 million) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison 1 conflicts? FICuS compound set (46.7 million unique) Standard InChI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) original structure record set (74.2 million) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison 1 conflicts? FICuS compound set (46.7 million unique) Standard InChI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) original structure record set (74.2 million) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison 2 same InChI/InChIKey? Standard InChI/InChIKey calculated by CACTVS from FICuS compound structure FICuS compound set (46.7 million unique) Standard InChI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) original structure record set (74.2 million) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison 1 no conflicts between Std. InChI/InChIKey and FICuS structure records (million records) all structure records 73.7 FICuS linked to a single InChI/InChIKey 62.3 (84.5%) both linked to a single structure record 34.4 (46.9%) both linked to multiple structure records 27.9 (38.0%) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison 1 conflicts between Std. InChI/InChIKey and FICuS structure records (million records) all structure records 73.7 FICuS is linked to multiple InChI/InChIKeys or vice versa 10.9 (14.7%) one FICuS is linked to multiple InChI/InChIKeys 6.8 (9.2%) one InChI/InChIKey is linked to multiple FICuS 4.1 (5.5%) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison 1 conflicts between Std. InChI/InChIKey and FICuS structure records (million records) all structure records 73.7 FICuS is linked to multiple InChI/InChIKeys or vice versa 10.9 (14.7%) one FICuS is linked to multiple InChI/InChIKeys number of InChIKey first block 6.8 2.3 one InChI/InChIKey is linked to multiple FICuS 4.1 number of InChIKey first block 1.0 (9.2%) (3.1%) (5.5%) (1.3%) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison 2 same InChI/InChIKey? compounds (unique structures) (million records) structure records (million records) all compounds InChI changes all records InChI changes FICTS 48.0 3.8 (7.9%) 4.6 (6.2%) FICuS 46.7 6.4 (13.7%) 73.7 9.3 (12.7%) uuuuu 41.6 11.9 (28.6%) 21.9 (29.7%) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison 2 same InChI/InChIKey? compounds (unique structures) (million records) structure records (million records) all compounds InChI changes all records InChI changes FICTS 48.0 3.8 (7.9%) 4.6 (6.2%) FICuS 46.7 6.4 (13.7%) 73.7 9.3 (12.7%) uuuuu 41.6 vs. InChIKey first block 11.9 (28.6%) 21.9 (29.7%) 3.2 (7.6%) 6.3 (8.4%) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison Detailed Comparison compound classification (formal) tautomer count > 1 (formal) tautomer count > 3 (formal) tautomer count > 10 full stereo contains metal atoms metal complexes salt has resonance charges inorganic occurrence in FICuS set 56.4% 25.4% 5.5% 25.7% 0.8% 0.2% 1.0% 0.2% 0.1% occurrence in FICuS subset (InChI changes) 14.5% 18.5% 28.9% 16.9% 34.5% 52.1% 18.6% 52.1% 33.9% Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison ChemBlock A3422/0145215 FICuS: 12 different structure records linked to this structure Std. InChI/InChIKey (stdinchi-1): calculates 3 different strings/keys for these 12 structure records (all have the same connectivity layer/first block) all of these 3 StdInChI/InChIKey differ from the StdInChI/InChIKey calculated after FICuS normalization (including connectivity layer/ first block) Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison ChemBlock A3422/0145215 E Z Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison ChemBlock A3422/0145215 tautomer: E Z Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison ChemBlock A3422/0145215 tautomer: tautomeric interconversion? E Z Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison ChemBlock A3422/0145215 R tautomeric interconversion? S tautomer: tautomeric interconversion? E Z Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison ChemBlock A3422/0145215 R tautomeric interconversion? S tautomer: tautomeric interconversion? E Z Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison ChemBlock A3422/0145215 ow many structures? tautomer: R ZIC04685909 tautomeric interconversion? S tautomeric interconversion? E ChemBlock A3422/0145215 Chemavigator 47748165 IST MS-Lib 1967005690 Z Chemavigator 65635274 Comparison Chemavigator Standard InChI/InChIKeys 34903393 - CI/CADD Structure Identifiers
InChI/InChIKey - CI/CADD Identifier comparison ChemBlock A3422/0145215 ow many structures? R tautomeric interconversion? S tautomer: FICuS parent structure tautomeric interconversion? E InChIKey A Z InChIKey C same connectivity layer/block InChIKey B Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
The Adaption and Use of the IUPAC InChI/InChIKey Chemical Structure Lookup Service 74 million structure records 46 million unique structures http://cactus.nci.nih.gov/lookup InChI/InChIKey Std. InChI/InChIKey CI/CADD Identifiers FICTS FICuS uuuuu Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Web Service Chemical Structure REST Service (beta) URL scheme: http://cactus.nci.nih.gov/chemical/structure/{identifier}/{method} http://cactus.nci.nih.gov/chemical/structure/inchikey=lfqscwfljttz-ufffaysa-/smiles http://cactus.nci.nih.gov/chemical/structure/inchikey=lfqscwfljttz-ufffaysa-/names http://cactus.nci.nih.gov/chemical/structure/inchikey=lfqscwfljttz-ufffaysa-/ficus http://cactus.nci.nih.gov/chemical/structure/inchikey=lfqscwfljttz-ufffaysa-/stdinchi http://cactus.nci.nih.gov/chemical/structure/inchikey=lfqscwfljttz-ufffaysa-/image http://cactus.nci.nih.gov/chemical/structure/ethanol/stdinchikey http://cactus.nci.nih.gov/chemical/structure/64-17-5/stdinchikey returns plain text/gif image if the structure identifier is not resolvable: http 404 status code Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers
Acknowledgments CADD Group, LMC, CI Marc icklaus Igor V. Filippov Chemavigator Scott utton Tad urst CACTVS, Xemistry Gmb Wolf-Dietrich Ihlenfeldt Thanks to all database providers Thanks to the InChI Team ur web site: http://cactus.nci.nih.gov Comparison Standard InChI/InChIKeys - CI/CADD Structure Identifiers