The Electronic Representation of Chemical Structures: beyond the low hanging fruit

Similar documents
Table of Contents. Scope of the Database 3 Searching by Structure 3. Searching by Substructure 4. Searching by Text 11

Structure Searching in CrossFire Beilstein. DiscoveryGate SM Version 1.4 Participant s Guide

Introduction. Chemical Structure Graphs. Whitepaper

Searching CrossFire Beilstein Using DiscoveryGate. DiscoveryGate Version 2.2 Participant s Guide

DECEMBER 2014 REAXYS R201 ADVANCED STRUCTURE SEARCHING

InChI keys as standard global identifiers in chemistry web services. Russ Hillard ACS, Salt Lake City March 2009

Command-line tools of ChemAxon: tips and tricks

Meeting the challenges of representing large, modified biopolymers

Tautomerism in chemical information management systems

Searching Substances in Reaxys

Methods for tautomer enumeration, -searching and -duplicate filtering

Reaxys Pipeline Pilot Components Installation and User Guide

1.14 the orbital view of bonding:

Pipeline Pilot Integration

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Developing CAS Products for Substructure Searching by Chemists. Linda Toler

ChemAxon. Content. By György Pirok. D Standardization D Virtual Reactions. D Fragmentation. ChemAxon European UGM Visegrad 2008

CHEMICAL REPRESENTATION GUIDE

BIOVIA ENHANCED STEREOCHEMICAL REPRESENTATION WHITE PAPER

Basic Techniques in Structure and Substructure

Structure-based approaches to the indexing and retrieval of patent chemistry. Tim Miller Head of Research May 2010

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

Marvin. Sketching, viewing and predicting properties with Marvin - features, tips and tricks. Gyorgy Pirok. Solutions for Cheminformatics

Imago: open-source toolkit for 2D chemical structure image recognition

Chemical Databases: Encoding, Storage and Search of Chemical Structures

Introduction to Chemoinformatics and Drug Discovery

Reaxys The Highlights

Project I. Heterocyclic and medicinal chemistry

CDK & Mass Spectrometry

The IUPAC Chemical Identifier

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Bioinformatics Workshop - NM-AIST

ISIS/Draw "Quick Start"

BIOB111_CHBIO - Tutorial activity for Session 10. Conceptual multiple choice questions:

Marvin 5.4 A new generation of structure indexing at Elsevier. Dr. Michael Maier, Dr. Heike Nau, Elsevier

IUCLID Substance Data

Information Retrieval: SciFinder

Bio-elements. Living organisms requires only 27 of the 90 common chemical elements found in the crust of the earth, to be as its essential components.

Exam 1 (Monday, July 6, 2015)

An Integrated Approach to in-silico

Chapter 22. Organic and Biological Molecules

Analyzing Small Molecule Data in R

Comprehensive Chemoinformatics since Web-based, client/server, and toolkit approaches. Native Oracle (cartridge) and Microsoft technology.

Searching CrossFire Gmelin

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal

MEDICINAL CHEMISTRY I EXAM #1

Chapter 25: The Chemistry of Life: Organic and Biological Chemistry

Ákos Tarcsay CHEMAXON SOLUTIONS

Pipeline Pilot Integration

The Case for Use Cases

CHEM 261 HOME WORK Lecture Topics: MODULE 1: The Basics: Bonding and Molecular Structure Text Sections (N0 1.9, 9-11) Homework: Chapter 1:

Chapter 25 Organic and Biological Chemistry

Nuggets of Knowledge for Chapter 17 Dienes and Aromaticity Chem 2320

Chemistry 27 Professor Matthew Shair. Harvard University Spring Hour Exam 3 Friday, April 28, :07 AM 12:00 PM

MEDICINAL CHEMISTRY I EXAM #1

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

10. Amines (text )

DERWENT MARKUSH RESOURCE NOW AVAILABLE ON STN. BRIAN LARNER IP & SCIENCE 10 Dec 2015

Chemistry in Biology Section 1 Atoms, Elements, and Compounds

Chemical Ontologies. Chemical Ontologies. ChemAxon UGM May 23, 2012

Canonical Line Notations

Benzene and its Isomers

Chemical Data Retrieval and Management

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

The shortest path to chemistry data and literature

Course Information. Instructor Information

2016 Pearson Education, Inc. Isolated and Conjugated Dienes

Isomerism CH 4 C 2 H 6 C 3 H 8 C 4 H 10 C 5 H 12. Constitutional isomers...

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

ORGANIC CHEMISTRY. Fifth Edition. Stanley H. Pine

Large scale classification of chemical reactions from patent data

Introduction to Chemoinformatics

ORGANIC - BROWN 8E CH.1 - COVALENT BONDING AND SHAPES OF MOLECULES

Alkenes and Alkynes 10/27/2010. Chapter 7. Alkenes and Alkynes. Alkenes and Alkynes

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

A powerful site for all chemists CHOICE CRC Handbook of Chemistry and Physics

Organometallics & InChI. August 2017

LigandScout. Automated Structure-Based Pharmacophore Model Generation. Gerhard Wolber* and Thierry Langer

Data Mining in the Chemical Industry. Overview of presentation

Ping-Chiang Lyu. Institute of Bioinformatics and Structural Biology, Department of Life Science, National Tsing Hua University.

Organic Chemistry. Organic chemistry is the chemistry of compounds containing carbon.

Alcohols, Ethers, & Epoxides

5. (6 pts) Show how the following compound can be synthesized from the indicated starting material:

Dictionary of ligands

Chapter 2: An Introduction to Organic Compounds

Aromatic Hydrocarbons

Learning Guide for Chapter 17 - Dienes

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Organic Chemistry 112 A B C - Syllabus Addendum for Prospective Teachers

BioSolveIT. A Combinatorial Docking Approach for Dealing with Protonation and Tautomer Ambiguities

Enablement in the Chemical Arts. Carlyn A. Burton Osha Liang LLP October 21, 2009

Learning Organic Chemistry

antidisestablishmenttarianism an-ti-dis-es-tab-lish-ment-ta-ri-an-ism

NOVEMBER 2014 REAXYS R101 BASIC REACTION QUERIES

Biol 205 S08 Week 2 Lecture 1

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

How to add your reactions to generate a Chemistry Space in KNIME

ORGANIC CHEMISTRY. Wiley STUDY GUIDE AND SOLUTIONS MANUAL TO ACCOMPANY ROBERT G. JOHNSON JON ANTILLA ELEVENTH EDITION. University of South Florida

CHM1321 Stereochemistry and Molecular Models Assignment. Introduction

Transcription:

The Electronic Representation of Chemical Structures: beyond the low hanging fruit How Accelrys Plans to Address the Remaining Challenges in Structure Representation and Searching: Chemically Modified Biologics, Non-specific Structures, and Organometallic Compounds The New Developments in Chemical Information: Best Practice Keith T Taylor, PhD, BSc, MRSC Product Manager, Chemistry Foundation Accelrys Inc, California

The Language of Chemistry The Chemical Structure Diagram What is it? psilocybin, [3-[2-(dimethylamino)ethyl]-1H-indol-4-yl] dihydrogen phosphate An: indole; phosphate ester; tertiary amine; acid; base

Valence Bond Model Chemical structure diagrams work for structures that can be described as atoms linked by a definite number of bonds to other atoms Works well for most drug-like structures that contain main group elements Second row elements can present difficulties Need to accommodate multiple valencies in periods three and higher Some interesting problems in period 2 Searching is best with a standardized representation Structure representation conventions are needed

Encoding approaches Two styles; Connection Table The structure is defined as a table of atom types and bond types that connect to the atom Each atom and bond is given an arbitrary number in a series Relative coordinates for each atom are usually included Molfile is the most common type SDfile is an extension of the molfile Line Notations Arbitrary atom is selected and then the structure is described as a sequence of atoms connected by symbols that represent bond types Includes labels to identify ring closures SMILES is the most common type InChI is a line notation

Examples Molfile ACCLDraw06271311182D 8 8 0 0 0 0 0 0 0 0999 V2000 15.8192-4.2369 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 14.6879-4.6046 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 14.6879-5.7940 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 13.6627-6.3940 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 12.6375-5.7940 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 12.6375-4.6060 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 13.6627-4.0181 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 13.6627-2.8370 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 3 2 2 0 0 0 0 3 4 1 0 0 0 0 4 5 2 0 0 0 0 5 6 1 0 0 0 0 6 7 2 0 0 0 0 7 2 1 0 0 0 0 7 8 1 0 0 0 0 M END Verbose Preserves layout SMILES Cc1ccccc1O Concise No layout information

Encoding Chemical Structures The structure is treated as a graph Atoms are Nodes Bonds are Edges Graph theory is used to match Nodes/Atoms and Edges/Graphs The mechanism for substructure searching The structure is canonicalized A unique layout that can be reproduced for any input version A name is generated from the canonical structure NEMA key (Accelrys) InChI Name and Key Canonical SMILES Ideally all approaches would produce the same canonical form but in practice different approaches produce different results Names are therefore algorithm dependent Names are used for exact structure matching

Canonicalization Structure SMILES Cc1ccccc1O Canonical Version Cc1ccccc1O And Layout c1c(o)c(c)ccc1 c1cccc(o)c1c

Different Representations Group 2 elements can give problems Carbon Monoxide divalent carbon Nitric Oxide divalent nitrogen Nitrogroup pentavalent nitrogen Tautomers Acetone or prop-1-en-2-ol Aromaticity

The Low-Hanging Fruit is not Easy Define a standard representation and enforce it in your databases Accelrys Available Chemicals Directory is ubiquitous and most companies have adopted its representation rules Search in-house and external databases with the same query

The Three Rs Representation A meaningful diagram Registration Storing a standardized and validated object in a database Retrieval Use an understandable query to retrieve all the objects in the database that match

Cross-searching of Non Specific Structures Imipramine Metabolites

Challenges - Addressed Generic Structure Combinatorial Library Patents Polymers Mixtures with known composition Acetaminophen (paracetamol), aspirin, caffeine, and excipients Non-specific structures Natural products Industrial preparations Metabolites Biologics Antibody Drug Conjugates Organometallics

Generic Structures Benzodiazepine library Contains 192 structures

Polymer Aluminized PET Jeffermine ED-2003

Mixture Acetaminophen (paracetamol), aspirin, caffeine, and excipients

Non-specific Structures Natural product Commercial mixtures Metabolites

Organometallics Coordination (dative) bonding Haptic bonding

Biologics Small biologics Up to ~30 residues Use full connectivity Visually confusing Depict the individual residues as abbreviations Visually cleaner But underlying structure remains cumbersome Large number of stereogenic centers slows down registration and searching

Self-Contained Sequence Representation A hybrid approach Use pseudoatoms for standard residues Embed explicit chemistry for non-standard residues and modifications Embed the full structure of standard residues to enable full structure features to be calculated Formula and formula weight Much more compact Registration and searching much faster No loss of structural information Structure is portable Visually resembles the abbreviated form Self-Contained Sequence Representation: Bridging the Gap between Bioinformatics and Cheminformatics William L. Chen*, Burton A. Leland, Joseph L. Durant, David L. Grier, Bradley D. Christie, James G. Nourse, and Keith T. Taylor J. Chem. Inf. Model., 2011, 51 (9), pp 2186 2208. (DOI: 10.1021/ci2001988)

Improved depiction of Chemically Modified Biologics Do not underestimate the scale of depiction work It is essential for successful adoption contracted expanded

PEGylated peptides

Antibody Drug Conjugates Example: Herceptin/Trastuzumab Large biologic structure Drug attached via a linker to a variable location Variable number of attached drug/linker entities Structure can be registered and searched using a combination of all the features described But Drawing needs simplification Depiction needs improvements Representation issue remains How to display a disulfide bridge that may be broken and replaced by the drug payload Account for the disulfide bridges that remain in the formula and formula weight In research

What works today Herceptin_DM1 Mut Cys

What works today Site-specific conjugation to an antibody with an unnatural amino acid glycosylated

Markush Chemically Modified Biologic Variable residue Variable attachment

Markush Homology Representation Major focus of Accelrys chemical representation development Represent parts of the structure as text identifier ALK represents any alkyl group Use for registration and searching Use for patent searching Patents contain all the features described so far (and more) Generic features Defined RGroups Atom lists Generic atoms Homology groups

Curently supported Generic features Defined RGroups Atom lists Generic atoms Reaxys homology groups Any group G Acyclic ACY Carbacyclic ABC Alkyl ALK Alkenyl AEL Alkynyl AYL Heteroacyclic AHC Alkoxy AOX Cyclic CYC Heterocyclic CHC Heteroaryl HAR Carbocyclic CBC Aryl ARY Cycloalkyl CAL Cycloalkenyl CEL Cyclic (no C) CXX

Mapping: Homology group screening Screen MDDR data set 129,237 structures screened in ~30s No pre-processing Hits = 470 Hits = 108 Hits = 45 Hits = 16 Key: Q = Any atom except C & H AHC = acyclic chain with a heteroatom AOX = alkoxy chaing Hits = 10

Stereochemistry Requires a separate discussion Tetrahedral stereochemistry covered Includes ABSOLUTE, AND and OR centers, and structures with a mixture of types Allenes and cumulenes Biaryls and any pair of rings with hindered rotation Work in progress Stereochemistry of organometallics Helicenes

Summary Discrete small molecules are covered Biologics well covered but work needed on User Interface (UI) design Stereochemistry well covered Organometallics and helicenes need work Significant enhancements to homology group handling required Underway