CHEMISTRY COLLECTION Basic Chemistry Guide

Size: px

Start display at page:

Download "CHEMISTRY COLLECTION Basic Chemistry Guide"

Valentine Spencer
5 years ago
Views:

1 CHEMISTRY COLLECTION Basic Chemistry Guide

2 Copyright Notice Copyright 2011 Accelrys Software Inc. All rights reserved. This product (software and/or documentation) is furnished under a License Agreement and may be used only in accordance with the terms of such agreement. Trademarks The registered trademarks or trademarks of Accelrys Software Inc. include but are not limited to: ACCELRYS ACCELRYS & Logo PIPELINE PILOT All other trademarks are the property of their respective owners. Restrictions on Government Use This is a commercial product. Use, release, duplication, or disclosure by United States Government agencies is subject to restrictions set forth in DFARS or FAR , as applicable, and any successor rules and regulations. Acknowledgments and References To print photographs or files of computational results (figures and/or data) obtained using Accelrys software, acknowledge the source in an appropriate format. For example: Imaging results obtained using software programs from Accelrys Software Inc. Data management and analysis performed with the Pipeline Pilot Imaging collection. Graphical displays generated with the Discovery Studio Visualizer. To reference an Accelrys Software Inc. publication in another publication, Accelrys Software Inc. is the author and the publisher. For example: Accelrys Software Inc., Chemistry Collection: Basic Chemistry User Guide, Pipeline Pilot, San Diego: Accelrys Software Inc., Request for Permission to Reprint and Acknowledgment Accelrys may grant permission to republish or reprint its copyrighted materials. Requests should be submitted to Accelrys Scientific and Technical Support, either through to support@accelrys.com or in writing to: Accelrys Scientific and Technical Support Telesis Court Suite 100 San Diego, CA Please include an acknowledgment Reprinted with permission from Accelrys Software Inc., [Document name], [Month Year], Accelrys Software Inc., San Diego. For example: Reprinted with permission from Accelrys Software, Inc., Pipeline Pilot Next Gen Sequencing Collection: User Guide, May, 2011, Accelrys Software, Inc., San Diego.

3 Contents Chapter 1 Introduction Who Should Read this Guide... 4 Requirements... 4 Client-side Software Requirements... 4 Server-side Software Requirements... 5 Additional Information... 5 Chapter 2 Readers Shared Parameters for Readers... 6 Source Files for Readers... 7 Chapter 3 Viewers and Writers Viewers... 8 Writers... 8 Shared Parameters for Writers... 9 Chapter 4 Converters Molecule From Text Molecule To Text Chapter 5 Calculators ALogP Canonical Smiles Translating SMILES Output to Molecular Data E-state Keys What to Output Element Count Advanced Parameters Molecular Fingerprints Fingerprint Storage Formats Path-based Fingerprints MDL Public Key Fingerprints User Key Fingerprints Atom Environment Fingerprints Extended Connectivity Fingerprints Fingerprint Generation Method Calculate Extended Connectivity Fingerprints Fingerprint Feature Code Generation Hashing Schemes Functional Class Fingerprints Reaction Fingerprints Figuring Out the Fingerprint Name Molecular Formula Molecular Properties Molecular Property Counts Molecular Weight Num H AcceptorDonors Solubility Solubility Model Substructure Count from File Substructure Count from Tag Substructure Map Surface Area and Volume Solvent Accessible Surface Area Molecular Energy PiSystem Properties Chapter 6 Manipulators 2D Depiction Algorithms Standardize Molecule Standardize Charges D Coordinates, 3D Conformations and Minimize Energy Generate Fragments Chain Assemblies Rings Ring Assemblies Bridge Assemblies Bemis-Murcko Assemblies Deprotonate Bases, Protonate Acids and Ionize Molecule at ph Chapter 7 Filters Chapter 8 Search and Similarity Molecular Similarity (Tanimoto, etc.) Chapter 9 Database Content Components and Example Protocols Appendix Appendix A: Substructure Searching Appendix B: Tetrahedral Stereo Perception Appendix C: Support for MDL Enhanced Stereo Representation Appendix D: Support for Features in ChemDraw Files.. 77 Appendix E: Glossary of Terms Chemistry Collection: Basic Chemistry Guide 3

4 Chapter 1 Introduction The Chemistry collection allows you to deploy Pipeline Pilot in a chemistry setting. Use these specialized tools to efficiently perform compound processing and cheminformatics research and analysis. This large collection of components includes data readers, writers and viewers, molecular property calculators, filters, manipulators, converters, and utilities. With Chemistry components, you can create protocols for a variety of applications including: Compound library acquisition Library cleanup and standardization Substructure searching Extensive property profiling and subset selection Extended Connectivity Fingerprints (ECFP): The Chemistry collection uses Extended Connectivity Fingerprints (ECFP), SciTegic s proprietary method for calculating structural fingerprints. This method offers excellent characterization of molecules that indexes the environments of every atom in a molecule by using up to four billion different structural features. ECFP is an efficient and useful method for performing searching and clustering Partial support for Non-Specific Structures (NONS) read from SGroup lines in MDL SD or SKC files or from Accord files Representation, generation and enumeration of Markush structures Use of Markush structures as queries in substructure searching Representation, enumeration and depiction of repeat units Representation and depiction of custom data Combined with the separately available Modeling collection: Substructure activity modeling Compound clustering When combined with the Integration collection, the Molecular Toolkit s Java and Perl APIs provide programmatic access to the molecular data model and its searching methods. Who Should Read this Guide The Chemistry collection includes a large number of components organized into numerous folders. Information about using these components is available in two separate guides. The Chemistry components covered in this basic guide include: Readers Viewers Writers Converters Calculators Manipulators Search and Similarity Requirements Some collections require third-party software that is not included with Pipeline Pilot. This software might need to be installed on a client or on the server (depending on component). Client-side Software Requirements Chemistry components that require third-party software on your client system include: 4 Pipeline Pilot

5 To use this component: Accelrys DS Visualizer ISISDraw Sketcher Chemistry Sketcher ISIS for Excel Reader ISIS for Excel Viewer Accord for Excel Reader Accord for Excel Viewer Accord for Excel Writer You need this software: Accelrys DS Visualizer or Accelrys Discovery Studio ISISDraw A sketch application such as ISISDraw, SymyxDraw, AccelrysDraw, or ChemDraw ISISForExcel AccordForExcel Server-side Software Requirements Chemistry components that require third-party software on your server include: To use this component: ISIS Reader Bioisosteres components You need this software: ISISBase Accelrys Bioster database license See Also For information on utility components and advanced chemistry conception, such as reactions and enumeration, MCSS, and pka, see the Advanced Chemistry User Guide. For detailed information about the latest changes in the Chemistry collection, see the Chemistry Release Notes. Additional Information For more information about the Chemistry collection for Pipeline Pilot and other Accelrys software products, visit Chemistry Collection: Basic Chemistry Guide 5

6 Chapter 2 Readers A reader component generates a stream of data records that subsequent components in the pipeline both receive (as input) and send (as output). The data records that flow throughout the pipeline are based on input from a data source (usually a file or database). Readers are frequently used as the initial component in a pipeline. You can also use them in many other locations throughout a protocol to add more sources of data and to read temporary files. Since they are designed to generate data, readers do not expose input ports by default. The Chemistry collection includes readers that provide import facilities for several commonly used molecular file formats including SD, RG, RD, and RXN (from MDL), SMILES, SMIRKS, SMARTS, and TDT (from Daylight), MOL2 (from Tripos), ChemDraw and ChemDraw XML (from CambridgeSoft), Maestro (from Schrodinger) and public formats such as PDB. You can also read molecular data from databases such as ISIS and MDL Direct (from MDL), ActivityBase (from IDBS), and Accord (from Accelrys). The current set of readers in the Chemistry collection includes: Accord for Excel Reader Accord Reader ChemDraw Reader ChemDraw XML Reader Embedded Molecules Reader ISISDraw Sketcher ISIS Reader Maestro Reader Mol2 Reader PDB Reader RD Reader RG Reader RXN Reader SD Reader SKC Reader SMARTS Reader SMILES Reader SMIRKS Reader TDT Reader Shared Parameters for Readers In addition to their own specific settings, most readers support the following parameters: Source: Specifies the location of the data to read into the component. You can select source files from the SciTegic server, a client machine, or on another machine that is accessible on the network. You can also read URL sources including HTTP and FTP. Maximum: Allows you to specify a limit on the number of data records to read. (Reads all records if a value is not set.) Keep Properties: Allows you to preview the data records and define the properties on the data record you want to include or exclude from the pipeline. (Retains all properties if a value is not set.) SourceTag: Labels data records based on their point of origin in the pipeline. It is useful to identify the source of data records downstream or in the results. You insert an extra property called SourceTag into each data record that identifies the data source location. The value can be the name of the source file location or a more general identifier (such as a number or letter). Use this property somewhere else in the pipeline to filter or group data records. 6 Pipeline Pilot

7 Source Files for Readers When you run a protocol, all files that the protocol uses must be accessible on the server. All readers let you select source files on the SciTegic server, client, or another machine on the network. If the location of the source file is not shared, you are prompted to upload it on the server so the protocol can run. Tip: With the SD Reader component, you can read all.mol or.sd files in a folder at once. You can manually edit the Source parameter to include an asterisk (*) in the filename (like a wildcard asterisk used in other Windows applications). For example, Parameter Value = data/examples/e*.sd. Chemistry Collection: Basic Chemistry Guide 7

8 Chapter 3 Viewers and Writers Viewers A viewer component displays information or results of a protocol on a client machine. Viewers are frequently used as the final component in a pipeline. You can also use them to display intermediate results or to provide Pipeline Pilot with additional required information at protocol runtime. The Chemistry collection integrates several popular third-party structural viewers and a variety of Web viewers are also available for graphically displaying your molecular data and property information. The current set of viewers in the Chemistry collection includes: Accelrys DS Visualizer Accord for Excel Viewer Excel Structure Viewer HTML Molecular Cluster Viewer HTML Molecular Grouped Viewer HTML Molecular Table Viewer ISIS for Excel Viewer SAR Viewer Visual Molecule Selector IMPORTANT! For viewer components that require third-party applications, the software needs to be installed on all clients that run the protocols. If you share protocols with other client users, ensure that the client machines can support the third-party applications. We recommend you configure all third-party applications to open automatically so that your protocols can run without interruptions from password prompts or login dialogs. (You might not see a password/login dialog if the Pipeline Pilot program window is maximized, because it can cover all other open windows.) Writers A writer is a component that saves data into a pipeline to a format that you specify when you set the value for the parameter. Typically, the saved data is stored in a file or database. Writers are frequently used as the final component in a pipeline. They are also used in many other locations throughout protocols for storing intermediate data and for writing temporary files. You can also write data into different databases using these components. Most writers do not have output ports since they are designed to be the last component in a pipeline. The Chemistry collection includes writers that provide import facilities for a number of commonly used molecular file formats including HTML (hypertext markup language), MOL2 (from Tripos), ChemDraw and ChemDraw XML (from CambridgeSoft), Maestro (from Schrodinger), Accord (from Accelrys), public formats such as SD, SKC, RD, RG, and RXN (from MDL), SMILES and TDT (from Daylight), and PDB. The current set of writers in the Chemistry collection includes: Accord for Excel Accord Writer ChemDraw Writer ChemDraw XML Writer HTML Molecular Grouped Writer HTML Molecular Table Writer Maestro Writer PDB Writer RD Writer RG Writer RXN Writer SD Writer SKC Writer SMILES Writer 8 Pipeline Pilot

9 Mol2 Writer TDT Writer Shared Parameters for Writers In addition to their own specific settings, data writers generally support the following parameters: Destination: Specifies the target location for the data. Writers only output data on the server. Maximum: Allows you to specify a limit on the number of data records to output. (Writes all records if a value is not set.) IfFileExists: Specifies what to do if the file name already exists at the output destination. You can overwrite the existing file, append to it, or halt the pipeline. Tip: Writer components can save files in compressed (zipped) or uncompressed format. To save files as compressed, add.gz to the end of the filename. Chemistry Collection: Basic Chemistry Guide 9

10 Chapter 4 Converters The Converter components perform molecule to text conversions. They are organized into the following subfolders: Molecule From Text: Components that allow the translation of a textual molecule description into a molecule. Molecule To Text: Components that allow the translation of the molecular information into a block of text that is stored on some property. Molecule From Text These components convert text properties in the following formats into molecules: MDL Mol file (CTAB), SYBYL Mol2, Accelrys Accord formats, PDB, ChemDraw, ChemDraw XML, Maestro, Daylight SMARTS, Daylight SMILES, MDL RXN, MDL SKC, MDL Chime, and Pipeline Pilot Chemistry. Molecule From Text components include: Identify Molecular Format Molecule from Accord Molecule from ChemDraw Molecule from ChemDraw XML Molecule from Chime Molecule from CTAB Molecule from InChI_AuxInfo Molecule from Maestro Molecule from MOL2 Molecule from PDB Molecule from Pipeline Pilot Chemistry Molecule from SKC Molecule from SMARTS Molecule from SMILES Molecule from Text Reaction from RXN Reaction from SMIRKS Molecule To Text These components create a text property containing a representation of the molecular data record in one of the commonly used formats or an image of the molecule in JPEG, PNG or SVG format. This information is useful for storing molecular information in a database in a format that can be used later to reconstruct the molecule. You can use the component to reconstruct the molecule from a property value containing the text with the molecular representation. Molecule to text components include: Image From Molecule Molecule to Accord Molecule to ChemDraw Molecule to ChemDraw XML Molecule to Chime Molecule to CTAB Molecule to Image Molecule to InChI Molecule to JPEG Molecule to Maestro Molecule to NEMA Molecule to MOL2 Molecule to PDB Molecule to Pipeline Pilot Chemistry Molecule to PNG Molecule to SMARTS Molecule to SKC Molecule to SMILES Molecule to SVG Molecule to Text Reaction to RXN 10 Pipeline Pilot

11 The following record shows an example of MDL CTAB text, the CTAB of an acetamide molecule (this record is a single text string with internal returns): ACETAMIDE SciTegic D V C C O N M END Note: CTAB, Chime, InChI, MOL2, NEMA, PDB, and SMILES are different ways to represent a molecule as text. CTAB, MOL2, and PDB are larger in size, and the internal carriage returns can cause problems, if written to some formats (such as delimited files). PDB is commonly used to store protein structures. They preserve atom coordinates. Chime is a compressed and encrypted CTAB. It contains no internal carriage returns, but is not human readable. SMILES is more compact and Canonical_SMILES can be used for molecular comparison, but atom coordinates are not preserved. NEMA is a unique format thatcan also be used for molecular comparison, but it is a one-way conversion (you cannot convert back). InChI is the international chemical identifier from IUPAC. With InChI, two strings can be calculated: InChI and InChI_AuxInfo. InChI is similar to Canonical_SMILES in that it can be used for molecular comparison and contains no atom, but has the additional feature that different tautomers of the same compound have the same InChI string. InChI_AuxInfo is the string recommended for recreating the molecule (Molecule From InChI_AuxInfo). It is a more verbose string which contains atom coordinates and bond orders are preserved. For more information about InChI, see and The Reaction to RXN component creates a text property containing an MDL RXN representation of the reaction in a molecular data record. The Molecule to JPEG, Molecule to PNG, and Molecule to SVG components calculate images of the molecule in the following formats: JPEG: (Joint Photographic Experts Group) A format used for compressed high-color or true-color images such as photographs. PNG: (Portable Network Graphics) A newer format used for bitmapped images (similar to GIF without legal restrictions). It provides high color support and improved compression. Newer versions of browsers such as Internet Explorer support this format. SVG: (Scalable Vector Graphics) A modularized language for describing two-dimensional vector and mixed vector/raster graphics in XML. The current coordinates of the molecule are used. For a given data record, a property containing a JPEG, PNG, or SVG image can be saved to an image file using the Text Writer, setting the output mode to Binary. Tip: If the molecule is not currently represented in 2D, use the 2D_Coords component to generate 2D coordinates. Chemistry Collection: Basic Chemistry Guide 11

12 Chapter 5 Calculators An important feature of a Pipeline Pilot protocol is its ability to calculate some properties on-the-fly. Calculator components allow the translation of the molecular information into a block of text that is stored on some property. Components that implement these on-the-fly property calculations are called property calculators. They declare the properties they can calculate using the Output parameter. The properties these components calculate are called calculable properties. For example, the molecular property ALogP can be used within a PilotScript expression. If the value is not already defined, the required property calculator is invoked automatically. The Chemistry collection includes property calculators that calculate numeric molecular descriptors. The current set of calculators in the Chemistry collection includes: Physicochemical AlogP Solubility Surface Area and Volume Surface Area and Volume 3D Structural Canonical Smiles Element Count Gasteiger Charges MDL Key Fingerprints Molecular Energy Molecular Fingerprints Molecular Formula Molecular Properties Molecular Property Counts Num H AcceptorDonors PiSystem Properties Substructure Mapping Substructure Count from File Substructure Count from Tag Substructure Map from File Substructure Map from Tag Topological indices Balaban Wiener and Zagreb Indices Chi Indices E-state Keys InfoContent Descriptors Kappa Shape Indices Subgraph Counts ALogP The ALogP component calculates the Ghose/Crippen group-contribution estimate for LogP, where P is the relative solubility of a compound in octanol vs. water. For more details see Ghose, A.K., Viswanadhan, V.N., and Wendoloski, J.J., Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragment Methods: An Analysis of AlogP and CLogP Methods. J. Phys. Chem. A, 1998, 102, ALogP can calculate the following properties: AlogP: The Ghose/Crippen group-contribution estimate for LogP, where P is the relative solubility of a compound in oil (actually, octanol) vs. water. AlogP_MR: The Ghose/Crippen estimate of molar refractivity, which contains information about molecular volume and polarizability of a compound. AlogP_Count: Returns an array of 120 numbers, which correspond to the 120 Ghose/Crippen atom types. The content of each array element is the number of atoms in the molecule of that particular atom type. 12 Pipeline Pilot

13 Canonical Smiles The Canonical Smiles component is a type of molecular property calculator that calculates a SMILES representation of the input molecule, optimally canonicalized so it s independent of the original atom numbering. SMILES is a text-based representation for molecular information developed by Daylight. A canonical SMILES is independent of the original atom numbering or explicit vs. implicit hydrogens. The SMILES string is written as text to a property. Canonical SMILES is unique to a given molecule, regardless of how it was drawn. You can use it as the key in merge or join operations to perform molecular comparisons without a molecular database. For example, the canonical SMILES representations of the first 10 molecules in Asinex are: Canonical_Smiles CN(C)c1ccc(\C=C\C(=O)\C=C\c2ccc(cc2)[N+](=O)[O-])cc1 Cc1ccccc1OCCNC(=S)SCC(=O)Nc2ccc(Cl)c(Cl)c2 Cc1ccc(OCCNC(=S)SCC(=O)OC(C)(C)C)cc1 CCCOC(=O)CSC(=S)NCCOc1ccccc1C [O-][N+](=O)c1ncn(CCn2ncnc2[N+](=O)[O-])n1 Cc1ccc(cc1)C23CC4CC(CC(C4)(C2)c5ccc(C)cc5)C3 [O-][N+](=O)c1ccc(N\N=C\CCC\C=N\Nc2ccc(cc2[N+](=O)[O-])[N+](=O)[O-])c(c1)[N+](=O)[O-] Clc1cccc(Oc2nc(CCCC#CC=C)nc(Oc3cccc(Cl)c3)n2)c1 C=CC#CCCCc1ccc(Oc2nc(Oc3ccccc3)nc(Oc4ccccc4)n2)cc1 CC(C)CC1NC(=S)N(C1=O)c2ccccc2 Note: The Canonicalization algorithm is Accelrys ; while it is derived from the Daylight algorithm, it will not necessarily give identical results. Compare two SMILES for identity only when both are canonicalized by the same method. Translating SMILES Output to Molecular Data To translate the SMILES output of Canonical Smiles into a molecular data record, use the Molecule From Smiles component. You can write molecular information into spreadsheets and databases to recreate it later, upon retrieval. E-state Keys This component calculates the Electrotopological State (E-State) descriptors defined by Kier and Hall (Hall L., Mohney B., Kier L., J. Chem. Inf. Comput. Sci., 1991, 31, 76-82, Hall L., Kier L., J. Chem. Inf. Comput. Sci., 2000, 40, ). The E-state keys are atomic indices that combine the electronic properties and the topological environment for each atom in the molecule. Keys are calculated for C, N, O, S, P, F, Cl, Br, I, Li, Be, B, Si, Ge, As, Se, Sn, and Pb atoms, which are classified into 79 atom types. These descriptors are widely used in structure similarity, library comparison, and QSAR/QSPR studies. Chemistry Collection: Basic Chemistry Guide 13

14 E-state Keys calculates either the sums of the E-state values or the counts of each atom type. The E-state Counts are the number of occurrences in the molecule of each of the 79 different atom types. E-state Sums are the sum of the E-state values for each of the 79 atom types. The output can be as a number of individual properties, one for each atom type, or as a single property with an array of values. The descriptors can be output as arrays or as individual properties. An option in the E-state calculator allows the display of the E- state type and E-state value for each atom in the molecule in the 2D depictions in the HTML Molecular Table Viewer, as shown in the figure below. What to Output E-state type and E-state value for each atom in the molecule The output type for E-state Keys is controlled by the What to Output parameter, which includes the following values: Estate_Keys_Properties: Calculates the E-state sums for all atom types and outputs them as individual properties. Estate_Counts_Properties: Calculates the E-state counts for all atom types and outputs them as individual properties. Estate_Keys: Calculates the E-state sums for all atom types and outputs them in one property as an array of double values. Estate_Counts: Calculates the E-state counts for all atom types and outputs them in one property as an array of integer values. Estate_NumUnknown: Outputs the number of atoms that could not be classified into an E-state atom type. E-state keys are calculated for organic elements (C, N, O, P, and S), halogens (F, Cl, Br, and I) and for Li, Be, B, Si, Ge, As, Se, Sn, and Pb. Element Count The Element Count component is a type of molecular property calculator that counts the atoms of each of the selected element types. The advanced parameters (described in detail below) allow you to return the total number of atoms using a list of element types. The Output parameter for this component contains a number of properties appended with the string _Count. The letters before the appended string are the atomic symbol for the given element. If you do not make a selection from the Output list, defaults are used to generate the output. Note: You can query a number of non-standard element symbols, such as *, R, X, D and T. Depending on their source, they may represent unknown atom types, abstract atom types (for example, Q means non-hydrogen non-carbon atoms in MDL queries), or other features. For common data sources, these are not present. 14 Pipeline Pilot

15 Advanced Parameters The advanced parameters that are available for this component include: Elements: An alternate mechanism of specifying elements to count (list by one or two letter codes, separated by commas). For example, enter Li,Na,K,Rb,Cs to generate the properties Li_Count, Na_Count, K_Count, Rb_Count, and Cs_Count. Total: The output name for totaled count of atoms listed in Elements. If you enter a value in this parameter (for example, Group1A_Count ), individual counts are not generated. Instead, a single count is generated that contains the sum of all of the individual element types. NotList: Specifies that the elements to count are those not listed in Elements. This inverts the logic of the total. It returns the count of all atoms with element types not contained in the given list. For example, use the following parameter values to create a calculator for Inorganic_Count : Elements: O,C,N,S,P,F,Cl,Br,I Total: Inorganic_Count NotList: True This example sets NotList to True and instructs the component to write the output to the property Inorganic_Count. Molecular Fingerprints This component calculates a variety of molecular fingerprints for the input molecules and reactions. It uses one the following algorithms to calculate fingerprints: SciTegic extended-connectivity fingerprints Daylight-style path fingerprints Atom Environment fingerprints MDL public key fingerprints For both the extended-connectivity and path fingerprints, a number of methods are available to define the atom abstraction used to generate the initial atom code. You should also specify the maximum path distance (such as number of bonds) to use for indexing an individual fragment. The next section provides more details about molecular fingerprints including: Fingerprint Parameters Fingerprint Storage Formats Path-based Fingerprints MDL Public Key Fingerprints Atom Environment Fingerprints Extended Connectivity Fingerprints Fingerprint Generation Method Calculating Extended Connectivity Fingerprints Hashing Schemes Functional Class Fingerprints Figuring out the Fingerprint Name Chemistry Collection: Basic Chemistry Guide 15

Fingerprint Parameters The Parameters tab for the Molecular Fingerprints component looks like this: Type Parameters for Molecular Fingerprints This parameter is the type of fingerprint to calculate.

16 Fingerprint Parameters The Parameters tab for the Molecular Fingerprints component looks like this: Type Parameters for Molecular Fingerprints This parameter is the type of fingerprint to calculate. You can use the following values: ExtendedConnectivity: Generates extended-connectivity fingerprints. Path: Generates Daylight-style path-based fingerprints. Atom Environment: Generates higher-order features from atom types using a method developed by Bender et al. This creates a String Fingerprint. HashAtomEnvironment: Uses a hash code to create an Integer Fingerprint representation of the AtomEnvironment fingerprints for ease of use (e.g. learned models, etc.). MDLPublicKeys: Generates the MDL Public key fingerprints. UserKeys: Generates fingerprints derived from substructures that you define (user key fingerprints). AtomAbstraction This parameter is only used with the ExtendedConnectivity and Path types. It determines the method for generating the initial atom feature codes for the heavy (non-hydrogen) atoms in the molecule. You can use the following values: FunctionalClass: Uses the rapid functional-role codes. This abstraction generates extended-connectivity fingerprints (FCFPs) and path fingerprints (FPFP). The functional-role code is a combination of a hydrogenbond acceptor, hydrogen-bond donor, positively ionized or positively ionizable, negatively ionized or negatively ionizable, aromatic, and halogen. AtomType: Uses a code derived from the number of connections to an atom, the element type, the charge, and the atomic mass. This abstraction generates extended-connectivity fingerprints (ECFPs) and path fingerprints (EPFPs). ALogPCode: Uses a code from the 120 atom types used in the calculation of ALogP. This abstraction generates extended-connectivity fingerprints (LCFPs) and path fingerprints (LPFPs). SYBYL: Uses the SYBYL atom types used in the Tripos Mol2 File Format. UserAtomTypes: Assumes that the property UserAtomTypes is defined on the molecule and contains an array of integers, one for each atom in the molecule. The i th value in the array is the user atom type for the i th atom in the molecule. Reaction: Uses type, charge, hybridization, reactant or product, and reaction site information. Only available for reaction inputs. OutputType This parameter controls way the fingerprint is presented. There are two methods available: Fingerprint: A list of the features present in the molecule, with duplicates removed. Counts: A list of the features present in the molecule, with duplicates retained. If a feature occurs more than once in a molecule, that bit value is included more than once in the output list. 16 Pipeline Pilot

17 MaximumDistance This parameter is used with ExtendedConnectivity, Path, AtomEnvironment and HashedAtomEnvironment Types. For extended-connectivity fingerprints, it is the maximum diameter of the features generated. For path-based fingerprints, it is the length of the paths (in bonds) that are considered. Fingerprint Naming Convention For extended-connectivity and path-based fingerprints, the generated property name has a particular format. The first character of the fingerprint name is the atom abstraction used: F: Functional class E: Atom type L: AlogP types S: SYBYL atom types R: Reaction atom typing The second character represents the type of fingerprints: C: Extended-connectivity fingerprints P: Path-based fingerprints E: Atom Environment fingerprints H: Hashed Atom Environment fingerprints The third character is always F. The fourth character is either P or C for Fingerprints or Counts, respectively. The fourth character is followed by an underscore and the maximum distance. For example, a functional class extended-connectivity fingerprint of maximum diameter 6 generates a property named FCFP_6. Tip: Understanding the naming convention is useful with a learning component that refers to the properties by name. To identify what a particular set of parameter values can generate, read a molecule, pipe it through a Molecular Fingerprints component with your settings, and view it in the Notepad Viewer. The name of the fingerprint is displayed just above the fingerprint values. Options This parameter provides options for the fingerprint calculation. The options include: IncludeStereo: (#S) includes information from stereoatoms into the fingerprint calculation. OutputBitDistance: (#D) outputs an array with the length or diameter of each bit. OutputBitSubstructure: (#C) outputs an array with SMARTS of the fragment example. OutputBitAllAtoms: (#A) outputs an array with the set of all atoms involved with a feature anywhere in the molecule. OutputBitFeatureAtoms: (#F) outputs an array with the set of atoms showing one example of the feature bit. Note: IncludeStereo changes the fingerprint. The other options cause the calculation of other properties with associated information. Fingerprint Storage Formats Although you do not need to be concerned with how data is stored to perform most tasks in Pipeline Pilot, there are situations where being aware of storage types might come in handy. For example, if you are importing data from external sources and teaching that the data should be interpreted as fingerprint data: Data types tell what is known about how to interpret the information associated with a given property. Data storage types tell what is the current format used in storing the raw data. Chemistry Collection: Basic Chemistry Guide 17

18 Preferred Storage Types Fingerprints are stored and manipulated. Given their importance in a number of key operations, it is necessary to provide native types for them. A given data type has a preferred storage type (for example, the LongType has a preferred storage type of LongStorage), though data of that type might be stored in any of a number of storage types (for example, data of LongType might be stored as StringStorage). Note: For this discussion, a type and its preferred storage type are equivalent. Pipeline Pilot has different native fingerprint types and each type has a corresponding preferred storage type. These preferred storage types are generated by the fingerprint calculators used within the Molecular Fingerprints component. Fingerprint Type LongFingerprintType DoubleFingerprintType StringFingerprintType BitFingerprintType Corresponding Preferred Storage Type LongArrayStorage DoubleArrayStorage StringArrayStorage BitsetStorage Extended-connectivity Fingerprints and Path-based Fingerprints generate properties of type LongFingerprintType and are stored an array of long (32-bit) integers. MDL Public Key and User-key Fingerprints return properties of type BitFingerprintType and are stored as fixed-length bit arrays; 166 bits for MDL public keys or M bits for user keys, where M is the number of the largest bit in any of the features in the user keys feature directory. Tip: If you are working with an advanced task that involves expertise with fingerprint storage formats, contact Accelrys Technical Support for further assistance. Path-based Fingerprints Path-based fingerprints are derived from fingerprints derived by Daylight. SciTegic path-based fingerprints do the following: Include the same options for initial atom coding as our extended-connectivity fingerprints, allowing for abstractions that are useful in learning and clustering. Immediately fold the fingerprint down to a small set of a few hundred or thousand bits. Learning methods in the program are suitable, even for thousands of different bits, and keeping them separate aids in learning and interpretation. Are similar to extended-connectivity fingerprints in that a fingerprint for a given maximum path length also contains the bits for all paths of shorter lengths. Path-based fingerprints are generated by detecting all paths up to a given length, and then generating a feature that represents those paths. The union of all different features present in a molecule is the pathbased fingerprint for that molecule. For a particular path, the feature bit is generated as follows: A path containing N bonds has N+1 atoms, so an array of 2N+1 is allocated. The first element is filled with the initial atom code for one of the end atoms; the next element with the bond type to the following atom; the initial atom code for the next atom; and so on, until the entire path is in the array. The array is hashed to give the feature code for that path. 18 Pipeline Pilot

MDL Public Key Fingerprints MDL keys are a set of 960, mostly substructural features, developed for rapid substructural searching of ISIS databases.

19 MDL Public Key Fingerprints MDL keys are a set of 960, mostly substructural features, developed for rapid substructural searching of ISIS databases. They are also useful as descriptors for learning, although controversy exists about their true quality for this purpose. Molecular Design considered the full definition proprietary, but released the definition of 166 of the full set of 960. These are referred to as the MDL Public keys. In Pipeline Pilot, selecting the Type parameter as MDLPublicKeys adds a property called MDLPublicKeys to the property list of the molecule. The fingerprint contains the list of key numbers for features that exist in the given molecule. Calculation of Keys Parameters for MDLPublicKeys For ease of inspection, most of these keys are implemented as MOL file queries. The list of queries is located in data\queries\mdlqueries. We do not recommend changing or editing the queries in this directory. The following keys can be turned on without these substructures by checking atom types or functional groups internally (although for some bits there are substructural queries that can also turn them on): Atom sets Single atom types Miscellaneous Atom Sets Key Bit 3 Bit 4 Bit 5 Bit 6 Bit 7 Bit 9 Bit 10 Bit 12 Bit 18 Bit 35 Bit 44 Bit 134 Description Group IVA, VA, VIA, PERIODS 4-6 (Ge...) ACTINIDE Group IIIB, IVB LANTHANIDE Group VB, VIB, VIIB Group VIII Group IIA Group IB, IIB Group IIIA Group IA OTHER X Chemistry Collection: Basic Chemistry Guide 19

20 Single Atom Types Key Bit 20 Bit 27 Bit 29 Bit 42 Bit 46 Bit 88 Bit 103 Bit 161 Bit 164 Description Silicon Iodine Phosphorus Fluorine Bromine Sulfur Chlorine Nitrogen Oxygen Miscellaneous Key Description Bit 68 QInRing > 0 Bit 1 numisotopes > 0 Bit 2 numunusual > 0 Bit 22 ringsofsize[3] > 0 Bit 11 ringsofsize[4] > 0 Bit 96 ringsofsize[5] > 0 Bit 99 numdoubleccbonds > 0 Bit 140 numdoubleccbonds > 1 Bit 163 ringsofsize[6] > 0 Bit 145 ringsofsize[6] > 1 Bit 19 ringsofsize[7] > 0 Bit 101 ringsofsize[8] > 0 Bit 137 numqinring > 0 Bit 120 numqinring > 1 Bit 121 numninring > 0 Bit 138 numrare1 > 0 Bit 140 numqrare1 > 0 Bit 141 numqrare1 > 1 Bit 141 nummethyl > 2 Bit 149 nummethyl > 1 Bit 160 nummethyl > 0 Bit 162 numaromaticrings > 0 20 Pipeline Pilot

Key Description Bit 125 numaromaticrings > 1 Bit 142 NumNitrogen > 1 Bit 159 NumOxygen > 1 Bit 146 NumOxygen > 2 Bit 140 NumOxygen > 3 Bit 165 numrings > 0 Bit 166 numfragments > 1 Tip: To create

21 Key Description Bit 125 numaromaticrings > 1 Bit 142 NumNitrogen > 1 Bit 159 NumOxygen > 1 Bit 146 NumOxygen > 2 Bit 140 NumOxygen > 3 Bit 165 numrings > 0 Bit 166 numfragments > 1 Tip: To create altered or novel fingerprints using substructural queries, work with the user key fingerprints described in the next section. User Key Fingerprints User key fingerprints are derived from the same underlying system as MDL public key fingerprints. Pipeline Plot provides examples of user-defined fingerprints as illustrations of how you can create your own MDL key style fingerprints. To work with user-key fingerprints, open the Molecular Fingerprints component and select the value UserKeys for the Type parameter. The query files (in MOL or SD format) are located in data/queries/userqueries and include the following: ExampleBit1.mol ExampleBit2.sd ExampleBit3.sd Tip: You can remove these query files and add your own to create user key fingerprints. ExampleBit1.mol This query is for a non-hydrogen, non-carbon atom attached to a non-hydrogen, non-carbon atom. The name of the query is 1, the first line in a MOL file. This is the number of the bit turned on when that feature is present. When this query is found in a molecule, the feature number is added to the fingerprint bit list. If you open ExampleBit1.mol in the Windows Notepad, the following information is displayed: Chemistry Collection: Basic Chemistry Guide 21

ExampleBit2.sd You can also define the bit code using the Bit property in an SD format query file. This query is for a nonhydrogen, non-carbon atom attached to an oxygen.

22 ExampleBit2.sd You can also define the bit code using the Bit property in an SD format query file. This query is for a nonhydrogen, non-carbon atom attached to an oxygen. When found, the bit number 2 is added to the fingerprint bit list, as shown in the following example: ExampleBit3.sd Another option is the ability to test for multiple occurrences of a feature (a requirement that a feature be present a given number of times before the bit is turned on). The MinimumCount property is used to declare that the given query must be found at least twice in the given molecule, for that feature to be added to the output fingerprint. The following is an example of how this is done: Atom Environment Fingerprints Atom Environment fingerprints generate higher-order bits from the atom types specified by AtomAbstraction using an algorithm described in Bender, A., Mussa, H.Y., Glen, R.C., and Reiling, S., Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naive Bayesian Classifier. J. Chem. Inf. Comput. Sci. 2004, 44, This generates an output of String Fingerprint type. Hashed Atom Environment fingerprints use a hashing algorithm to create an Integer Fingerprint representation of the Atom Environment fingerprint; this fingerprint is often easier to deal with as inputs to other components (such as Molecular Learners). 22 Pipeline Pilot

Extended Connectivity Fingerprints Extended Connectivity Fingerprints (ECFPs) are a new class of fingerprint well-suited to the learning methods available in Pipeline Pilot.

23 Extended Connectivity Fingerprints Extended Connectivity Fingerprints (ECFPs) are a new class of fingerprint well-suited to the learning methods available in Pipeline Pilot. Each feature represents the presence of a structural (not substructural) unit. You differentiate these fingerprints from those with features that represent substructural ones (such as MDL keys or Daylight path-based fingerprints). The difference is best explained with an example. Assume there are features representing a para-substituted benzene ring in both a MDL fingerprint and in an ECFP: Para-substituted benzene ring In an MDL fingerprint, the structure is present as a substructure somewhere in the target molecule. For example, the following estrogenic structure turns on that feature: Estrogenic structure For ECFPs, the estrogen does not contain the feature, as there are substitutions on the ring at locations other than the specified attachment atoms marked A. Thus, an ECFP feature represents an exact structure with limited, specified attachment points. The following molecule contains the feature: Estrogenic structure mapped by ECFPs The reason for ECFPs and substructure-based fingerprints is twofold. First, substructure-based fingerprints are intended for a different task database searching. Substructural fingerprints have the property that all features contained within a query must also be contained within a target, if the query can map onto that target. This allows the fingerprint to rapidly eliminate molecules from consideration, when performing a substructure search against a database. For ECFPs, it is not required that they be useful for database optimization. Chemistry Collection: Basic Chemistry Guide 23

Second, ECFPs represent a much larger set of features than what is common for other fingerprints. The virtual size of the fingerprint is four billion different features.

24 Second, ECFPs represent a much larger set of features than what is common for other fingerprints. The virtual size of the fingerprint is four billion different features. For a given molecule, only a small subset of those features is present. (This means the fingerprints are usually stored as a list of features that are present, rather than as a binary bit array.) This allows ECFPs to present a huge number of different structural units that may be valuable for learning or molecular comparison. A typical molecule may generate fingerprints containing tens or hundreds of features; a typical molecular catalog may contain several thousand or millions of different features. Advantages of ECFPs There are several advantages to ECFPs including: They are fast to calculate, as explained later in this guide. Even large datasets can be processed rapidly without the need to pre-process the data in weekend-long batch jobs. They represent a much larger set of features than many fingerprints; compared to 960 features in the MDL private keys, or even the 25,000 features in products such as LeadScope. Further, these features are not pre-selected, but are generated directly from the molecules. Novel molecular classes are as easily handled as the more common classes present in pre-selected lists of interesting features. They represent information about tertiary and quaternary centers, which is not the case for path-based fingerprints such as Daylight fingerprints. Even some stereochemical information can be represented. The features represent the molecule at differing levels of detail. For example, some may represent single atoms, such as the presence of a halogen. Others may represent a large section of molecular structure, such as the A-B rings of a steroid ring system shown here: Steroid ring system Different atom abstractions can be used to generate different fingerprints. For example, standard ECFPs use the atom type as part of the initial atom code; this differentiates Chlorine and Bromines. A variant of ECFPs called functional-class fingerprints (FCFPs) uses the role of an atom in the initial atom code. In this case, both Chlorine and Bromine are seen as equivalent instances of halogen atoms. Fingerprint Generation Method The fingerprint generation method is based on one of the original algorithms in computational organic chemistry called the Morgan algorithm. The goal of the Morgan algorithm is to assign a unique identity to each atom in a molecule, so that a molecule can be described in a way that is invariant to the original numbering of atoms. The algorithm has two parts: the assignment of an initial code to each atom, and an iterative part in which each atom code is updated to reflect the codes of each atom s neighbors. A similar scheme is used in ECFPs, with two important changes. First, the Morgan algorithm is only interested in disambiguating atoms within a single molecule, so the generated codes are not comparable between different molecules. SciTegic uses a hashing scheme to generate codes comparable across molecules. Second, the Morgan algorithm iterates until every atom is unique, or as close to unique as symmetry allows, and intermediate results are discarded. However, it is exactly those intermediate results that are of interest, allowing you to represent features that reflect many different levels of structural abstraction. The following information describes the generation of the initial atom codes and the iteration that generates the fingerprint features. 24 Pipeline Pilot

25 Generation of Initial Atom Codes The generation of an ECFP or FCFP fingerprint for a molecule begins with the assignment of an initial atom code for each heavy (non-hydrogen) atom in the molecule. In theory, any atom-typing rule can be used. There are two rules that are most useful the ECFP rule and the FCFP rule. (Only differences in the initial atom code distinguish ECFPs and FCFPs; once the codes are assigned, both fingerprints are developed through the same process.) For ECFPs, the initial atom code is derived from the following features: Number of connections to the atom Element type Charge Atom mass Atoms that differ in any of these features generate a different ECFP initial atom code. For FCFPs, the initial atom code is based on the quick estimate of the functional role the atom plays. This role indicates that the atom must be a combination of the following: Hydrogen-bond acceptor Hydrogen-bond donor Positively ionized or positively ionizable Negatively ionized or negatively ionizable Aromatic Halogen An example of the initial FCFP atom codes for a small molecule is shown below (the nitrogen is given a code 3, as it both an H-bond acceptor and an H-bond donor): Initial FCFP atom codes If you were to stop here, you would generate the fingerprint called FCFP_0, where the number zero is the maximum diameter explored around each atom. The fingerprint is the set of features {0, 1, 3, 16}. More typically, these features are used as a starting point for the iterative process described in the next section. Chemistry Collection: Basic Chemistry Guide 25

26 Note: There is another initial abstraction available within Pipeline Pilot. It uses the ALogP atom type codes, a set of 120 different categories that atoms may include. The use of ALogP types within the extendedconnectivity fingerprint calculation is an experimental feature that you can try if you have a lot of experience with Pipeline Pilot. It is not covered in detail in this guide, as ECFPs and FCFPs are the most widely used and best understood. Iteration to Generate Higher-Order Features An iterative process is used to generate features that represent each atom in larger and larger structural neighborhoods. After each iteration, the new feature codes for the atoms are added to the set of features from all previous steps. When the desired neighborhood size is reached, the process is complete, and the set of all features is returned as the fingerprint. A visual interpretation of the process is shown below: Iterative fingerprint generation process This sample shows the features generated for a single atom the carbon atom in the aromatic ring where the amide functional group is attached. At iteration 0 (that is, before iterating), it only has information about the atom itself, encoded into its initial atom code. During the first iteration, it collects information from all the atom s immediate neighbors and generates a new code. That new code represents the presence of a molecular structure incorporating four atoms: the core atom and its immediate neighbors. This process is not only performed for this one atom, but also for each atom in the molecule, so that all atoms have a new code representing the immediate neighbor around them. Note: A hashing scheme is used to generate the new code from the codes of an atom and its neighbors. It is not necessary to understand this scheme to successfully use extended-connectivity fingerprints. For more details, see Fingerprint Feature Code Generation Hashing Schemes. For the second iteration, it repeats the process of collecting information from the neighbors and generates a new code. But this time, instead of using the initial atom codes for the atom and its neighbors, it uses the updated codes from iteration 1. The code generated from this step represents an even larger structure around the core atom, in this case, all atoms within two bonds of the core atom. 26 Pipeline Pilot

The number of iterations performed is determined by the maximum diameter of the neighborhoods requested. This diameter is displayed in the fingerprint name as an appended number.

27 The number of iterations performed is determined by the maximum diameter of the neighborhoods requested. This diameter is displayed in the fingerprint name as an appended number. For example, FCFP_6 generates features around each atom up to a diameter (in bonds) of six, which requires three iterations. (Because each iteration increases the diameter of the neighborhood by two bonds, there are no oddnumbered fingerprints such as FCFP_1. Instead, the series of legal fingerprints is FCFP_0, FCFP_2, FCFP_4, FCFP_6, etc.). Calculate Extended Connectivity Fingerprints There are different ways to calculate extended-connectivity fingerprints for your molecular data. First, molecular learners (such as Learn Good Molecules) and clustering methods (such as Cluster Molecules) may contain some extended-connectivity fingerprints by name in the parameter PredefinedSet. This is a predefined list of calculable properties useful for the learning or clustering. For example, by default, the Cluster Molecules parameter uses a functional class fingerprint of maximum diameter 4 (FCFP_4) for clustering. Molecular Fingerprints and Clustering A second method for calculating fingerprints is to request their calculation by name using the Custom Manipulator (PilotScript) component (e.g., calculate( FCFP_4 );). A final method for calculating extended-connectivity fingerprints (and other fingerprint types) is by using the Molecular Fingerprints component. This component and its parameters are explained more fully in the section Fingerprint Parameters. Fingerprint Feature Code Generation Hashing Schemes This section provides information about how the fingerprint feature codes are developed with hashing schemes. You do not need to know this information for general use of extended-connectivity fingerprints. In the Morgan algorithm, a prime-number scheme is used to generate higher-order codes for each atom during the iteration. In this scheme, each different code value is assigned a prime number, and the new code is the product of the prime number of the parent atom with all its neighbors. The products can get very large, so at the end of each cycle, each unique product is replaced with a small integer that represents the atom class. This method guarantees that no two atoms in different structural neighborhoods ever get the same code. This guarantee of uniqueness is vital because the Morgan algorithm is preparing the molecule for storage in a database, where any confusion can lead to lost data. In the extended-connectivity fingerprint process, this uniqueness is not vital. In fact, by mapping all feature codes into an address space of 232 feature codes, there is always a sight risk that two different structural features will have the same code. Given the size of the space of feature codes, this risk is minimal, and even if it does occur, there is little effect on learning. Chemistry Collection: Basic Chemistry Guide 27

28 This folding of features is done explicitly in Daylight fingerprints to reduce the fingerprint to a small size suitable for storage and manipulation in a binary array. You can use a rapid hashing scheme, which has the additional advantage in that codes from the hashing scheme are invariant across different molecules (something that is not possible with the Morgan-generated codes). Look at how a single iteration is performed: Generation of Atom Codes The molecule (with its original atom numbering) is shown on the top left, and the molecules (with atoms marked with the initial FCFP atom codes) on the top right. Look at the generation of the new code for atom 5. First, an array of number is generated that represents the local environment of the core atom. The array starts with a single number, the current atom code (16). Next, add two numbers to the array for each non-hydrogen attachment. The first of the two numbers is the bond type code for the bond to that attachment: 1 for a single bond, 2 for a double bond, 3 for a triple bond, and 4 for an aromatic bond. The second of the two numbers is the current atom type code of the neighbor. To avoid order-dependency in the attachment list, sort the attachments using their number pairs. In this case, the final order for the pairs is (1, 0), (4, 16), (4, 16). Finally, take the array of numbers and apply a hashing function to generate a single number, in this case, the number This is the number that represents the four-atom feature centered on atom 5. One way to think of this number is as the index of a bit in a large virtual bit array. A molecule containing this structural feature would have bit 203,667,720 on. Since most molecules have a most a few hundred features, the bits are usually stored as a list of on bits, rather than as actual on bits in a large, non-virtual bit array. The final fingerprint is the collection of all features generated for each atom at each iteration level. For the benzoic acid amide shown above, you can display the feature codes. Read the file data\queries\benzoicacidamide.mol using an SD Reader and a Custom Manipulator (Pilot Script) component configured with the following expression: calculate('fcfp_0','fcfp_2','fcfp_4'); Run the protocol and display the results in the Notepad Viewer. The results should look like this: 28 Pipeline Pilot

29 Protocol results displayed in Notepad Normally, the fingerprint feature codes are not directly inspected, although they may become important if a particular feature is identified during learning. In this case, use the Learned Feature Filter to identify compounds with a particular feature or features. Notice that fingerprints with larger diameters (such as FCFP_4) contain all the features present in the corresponding fingerprint at smaller dimensions (such as FCFP_0 or FCFP_2). It is not necessary to include a series of such fingerprints, only the largest diameter one. This is how extended-connectivity fingerprints can contain features at a variety of levels of abstraction. The features with negative signs are an artifact of the output procedure. Since the hash function uses all 32 bits in an integer, and most printing methods treat the first bit as a sign bit, some features are displayed as negative numbers. Functional Class Fingerprints Functional-class fingerprints (FCFPs) are a type of extended-connectivity fingerprint that use a simple, rapid, functional-class atom typing scheme for their initial atom codes. Each code is a number in the range [0, 63]. The initial code becomes the starting point for the extended-connectivity calculation. The functional code is defined for each atom as described in the following C code. The final code is the logical OR of six different atomic feature bits. (If none of the features applies to a given atom, its code is zero.) code = 0; if (atom.isacceptor() > 0) code = 1; if (atom.isdonor()) code = 2; if (atom.isnegativeionizable()) Chemistry Collection: Basic Chemistry Guide 29

30 code = 4; if (atom.ispositiveionizable()) code = 8; if (atom.isaromatic()) code = 16; if (atom.ishalogen()) code = 32; The function names are meant to be suggestive rather than definitive. For example, a precise estimation of whether an atom is ionizable requires a lengthy quantum-mechanical calculation. Our goal is simpler the rapid partitioning of the atoms into general functional classes, for which an approximate method is satisfactory. IsAcceptor is a complicated method that depends on the connectivity, charge, and atom type. A true value is only possible for the following: atom.gettype() == Oxygen atom.gettype() == Nitrogen atom.gettype() == Sulfur atom.gettype() == Phosphorus IsDonor is a rapid test of whether an atom can be a hydrogen-bond donor. It returns true, if the atom is oxygen or nitrogen, and has one or more hydrogens attached. IsNegativeIonizable is true, if the atom contains a negative charge or the atom is an ionizable (acidic) oxygen atom. IsPositiveIonizable is true, if the atom contains a positive charge or if the atom is a nitrogen with no hydrogens or sp2-hybridized neighbors. IsAromatic is true, if the atom is aromatic by our definition (based on a Huckel 4n+2 rule). IsHalogen is true, if the atom is a Chlorine, Fluorine, Bromine, or Iodine. Reaction Fingerprints Reaction fingerprints (RCFPs) are a type of extended-connectivity fingerprint that use reaction-specific information to determine the initial atom codes. The following contribute to the initial atom codes for RCFP s: Element type Charge Hybridization Whether the atom is a Reactant atom or Product atom Whether or not the atom is in the Reaction Site The Reaction Site is perceived from the atom-atom mappings of a reaction. It includes atoms that are changed by the reaction and atoms attached to bonds that are changed by the reaction. Atoms without mappings are automatically included in the site as they are removed from the reactant side and added to the product side. The Highlight Reaction Site component can be used to show how the reaction site is being perceived. Here are the reaction sides of two different esterification reactions. Note how the changed atoms are very similar in both reactions while the inert regents are quite different: 30 Pipeline Pilot

Two different esterification reactions with atoms in each reaction site highlighted Additionally, with RCFPs, only atoms within the Reaction Site can be bit centers.

31 Two different esterification reactions with atoms in each reaction site highlighted Additionally, with RCFPs, only atoms within the Reaction Site can be bit centers. Neighboring non-site atoms are only considered at higher distances. This allows you to use the Distance parameter to configure how much of the non-site region to sample with the fingerprint. The two very different esterification reactions are indistinguishable using only the bit centers (RCFP_0), while the differences between the two show up at higher distances. Fingerprint RCFP_0 1.0 Similarity RCFP_ RCFP_ RCFP_ RCFP_ Note: As a variant, an additional reaction fingerprint called QCFP can be calculated. The algorithm for calculating the initial atom code is the same as that for RCFP. And as with RCFP, only atoms within the reaction site can be centers. QCFP differs in that it does not consider atoms outside the site at higher distances. This variant is not available from the Molecular Fingerprints component interface, but is available on demand as a calculable property. Because QCFPs do not explore outside the reaction site, they remain extremely specific at larger distances. Chemistry Collection: Basic Chemistry Guide 31

Fingerprint QCFP_0 1.0 QCFP_2 1.0 QCFP_4 1.0 QCFP_6 1.0 QCFP_8 1.0 Similarity Reaction Fingerprint Validation Reaction Fingerprints (RCFPs) have been validated using a several methods.

32 Fingerprint QCFP_0 1.0 QCFP_2 1.0 QCFP_4 1.0 QCFP_6 1.0 QCFP_8 1.0 Similarity Reaction Fingerprint Validation Reaction Fingerprints (RCFPs) have been validated using a several methods. One method is to analyze how similarities, clustering, and Bayesian learners perform on reaction datasets that have been tagged with descriptive keywords (e.g., alkylation, halogenations, etc.) These keywords can be used as the categories in Bayesian categorical model. Here is an analysis of the leave-one-out cross-validation ROC scores for the 200+ keyword categories in a dataset of 70,000 metabolite reactions: Fingerprint EstXVAUC_Mean EstXVAUC_StdDev RCFP_ RCFP_ RCFP_ RCFP_ ECFP_ ECFP_ ECFP_ ECFP_ MDLRxnCenterKeys The RCFPs produce better ROC scores than either the MDLRxnCenterKeys or considering molecular features alone (ECFPs). Similarity studies in which the average pairwise similarities for reactions within the same category were compared with the average pairwise similarities with reactions outside the class were also conducted. The Enrichment Factor was calculated as the average similarity within the class divided by the average similarity outside the class. The following chart shows the results for the reactions in the metabolite dataset. QCFPs and RCFPs tend to perform slightly better than MDLRxnCenterKeys, and are clearly better than ECFPs: 32 Pipeline Pilot

Results for the reactions in the metabolite dataset In another similarity study using a subset of ~70,000 reactions from the CIRX dataset representing 77 categories, different reaction fingerprints

33 Results for the reactions in the metabolite dataset In another similarity study using a subset of ~70,000 reactions from the CIRX dataset representing 77 categories, different reaction fingerprints were used to calculate the top 20 more similar reactions to each reaction in the subset and then calculate the percentage of those similar reactions that contain all the category keywords present in the query reaction. The following chart shows the results as the average calculated over all the reactions. In this case, the MDLRxnCenterKeys did a little better that either RCFP s or QCFP s, and all these fingerprints did clearly better than ECFP s, which does not include any reaction-specific features: Chemistry Collection: Basic Chemistry Guide 33

Figuring Out the Fingerprint Name Results as the average calculated over all the reactions The Molecular Fingerprints component has many options that control the type of fingerprint to generate.

34 Figuring Out the Fingerprint Name Results as the average calculated over all the reactions The Molecular Fingerprints component has many options that control the type of fingerprint to generate. The fingerprint name varies, based on the option that is selected. It s easiest to try a set of options and then find out the corresponding name. However, there is a method to this naming, described as follows: Fingerprint Types without an Encoded Name For the following two values of the parameter Type, the AtomAbstraction, OutputType, and MaxDistance parameters are not relevant and the following names are used: Type MDLPublicKeys UserKeys Fingerprint Name MDLPublicKeys UserKeys Encoded Fingerprint Names All other Types of fingerprints have a name in the form of XXFX_N, where the values of AtomAbstraction, Type, and OutputType determine the first, second and fourth letters respectively (the third character is always F ) while the MaxDistance parameter determines the number following the underscore. First Letter The first letter of an encoded fingerprint name is determined by the AtomAbstraction parameter: AtomAbstraction FunctionalClass AtomType First Letter F E 34 Pipeline Pilot

35 AtomAbstraction ALogPCode SYBYL Reaction UserAtomType First Letter L S R U Second Letter The second letter of an encoded fingerprint name is determined by the Type parameter: Type ExtendedConnectivity Path AtomEnvironment HashedAtomEnvironment MDLPublicKeys UserAtomType Second Letter C P E H NA - Fingerprint Name not encoded (see above) NA - Fingerprint Name not encoded (see above) Third Letter The third letter of an encoded fingerprint name is always F. Fourth Letter The fourth letter of an encoded fingerprint name is determined by the OutputType parameter. Fingerprint returns a list of the features present in the molecule, with duplicates removed, while Counts returns a list of the features present in the molecule, with duplicates retained; if a feature occurs more than once in a molecule, that bit value is included more than once in the output list. OutputType Fingerprint Counts Fourth Letter P C Number Following the Underscore The number following the underscore of an encoded fingerprint name is determined by the MaxDistance parameter. For extended connectivity fingerprints, this is a maximum diameter (in bond lengths) of the largest structure represented by the fingerprint. For path fingerprints, this is the maximum length of the path. For both, this is only a maximum; all bits at all lower levels are included. Note that this number is always even. Examples If you chose Path as the Type, AlogPCode as the AtomAbstraction, 4 as the MaximumDistance, and Fingerprint as the OutputType, the name is LPFP_4, and you call this fingerprint (starting from the left) ALogPCode path-based fingerprint of length 4. FCFC_6 is functional-class extended-connectivity fingerprint count up to diameter 6. Note: For backward compatibility, if you choose path-based fingerprints and AtomType, only the element atom number is used, and not the full Daylight invariant, which also includes charge, mass, and connectivity. Chemistry Collection: Basic Chemistry Guide 35

36 Fingerprint Options Calculable property options are appended to a property name, and start with the character #. An example is the parameter include stereo: when stereo is included in the extended connectivity calculation, then #S is appended, as in: FCFP_6#S. To illustrate options, consider the following protocol: The molecule is alanine, which is shown with atom numbers: Alanine molecule The output looks like this when displayed in the Notepad Viewer: Protocol output displayed in Notepad Viewer 36 Pipeline Pilot

If you add the option #S, you get the fingerprint with stereochemistry: Protocol output with new property displayed in Notepad Viewer In this case, the option changed the calculation of the

37 If you add the option #S, you get the fingerprint with stereochemistry: Protocol output with new property displayed in Notepad Viewer In this case, the option changed the calculation of the fingerprint to give a different result. However, this is not always the case. Many options cause the calculation of the fingerprint along with the calculation of additional properties. These additional properties offer information about the individual bits of the fingerprint. For example, consider the output if you request a calculation of FCFP_6#F. This is a request for additional information about each feature bit; in this case, an example of a set of atoms in the molecule which illustrates that feature. Chemistry Collection: Basic Chemistry Guide 37

38 The output contains two new properties: FCFP_6 and FCFP_6#F: Protocol output with two new properties displayed in Notepad Viewer Each bit in the array of FCFP_6 has a corresponding member in the array of FCFP_6#F. The entry in FCFP_6#F is the set of atoms involved in generating the bit in FCFP_6. Thus, the option #F does not change the fingerprint output, but only controls the output of additional associated information. 38 Pipeline Pilot

39 A similar option is #A. Calculating FCFP_6#A gives the following output: Protocol output with new property displayed in Notepad Viewer In this case, the set is all atoms contains in any instance of a particular feature, rather than one example of the atoms in the feature as done by #F. Note how feature 0 is contained in three atoms (2, 3, and 6) because it was generated at different places in the molecule. Chemistry Collection: Basic Chemistry Guide 39

40 A useful option is #C. Calculating FCFP_6#C gives the following output: Protocol output with two new properties displayed in Notepad Viewer In this case, the associated information is a SMARTS string that describes the substructure obtained by excising the feature from the remainder of the molecule, with the attachment atoms shown as * atoms. Keep in mind that these are examples of structures that generated a particular bit, and are not definitions of a feature. Depending on the initial atom abstraction, ring closures, and other details based on the generating process, different substructures may be examples of the same bit. 40 Pipeline Pilot

41 Another useful option is #D. FCFP_6#D gives the following output: Protocol output with new property displayed in Notepad Viewer In this case, the associated information is the diameter of a particular feature (or length, for path-based fingerprints). For extended-connectivity fingerprints, you do not get a bit for each atom at every level. Bits that are duplicates of other bits (where duplicate is defined as two features defined by the same atom set) are not included. Indeed, bit contains all of the information in the molecule (that is, contains every atom), so no new bits are generated at the next level. This avoids generating bits that are mere duplicates of information you already have elsewhere. Chemistry Collection: Basic Chemistry Guide 41

42 This distance option also works with path-based fingerprints, as illustrated in the following example: Protocol output with path-based fingerprints displayed in Notepad Viewer Unfortunately, these options do not work with all fingerprint types. Currently, only extended-connectivity and path-based fingerprints acknowledge them. The #Z option will output the index of the central atom associated with that bit (OutputCentralBitAtom): 42 Pipeline Pilot

43 Protocol output with two new properties displayed in Notepad Viewer In this case, the parallel FCFP_6_Z array shows which atom is central to the bit present in FCFP_6. Atom 1 creates the 3 bit, Atom 2 creates the 0 bit, and so on. If more than one atom is associated with a particular bit, only the first atom associated with that bit is listed. Using Counts instead of Fingerprints will preserve duplicate bits. For a different view of the atoms central to each bit, use the #P option (AddBitsToCentralAtom): Chemistry Collection: Basic Chemistry Guide 43

Protocol output showing bits added to central atom as atom properties in the HTML Molecular Table Viewer In this case, the correspondence between the atom and the associated fingerprint bits is made

This parameter is a list of options: IncludeStereo, OutputBitDistance, OutputBitSubstructure, OutputBitAllAtoms, OutputBitFeatureAtoms, OutputBitCentralAtom and AddBitsToCentralAtom.

44 Protocol output showing bits added to central atom as atom properties in the HTML Molecular Table Viewer In this case, the correspondence between the atom and the associated fingerprint bits is made with an atom property. A parameter called Options exposes fingerprint options. This parameter is a list of options: IncludeStereo, OutputBitDistance, OutputBitSubstructure, OutputBitAllAtoms, OutputBitFeatureAtoms, OutputBitCentralAtom and AddBitsToCentralAtom. They correspond to the options #S, #D, #F, #A, #C, #Z and #P. Molecular Formula Options parameter for Molecular Fingerprints component This component calculates the formula of a molecule a sequence of atomic symbols, followed by the number of atoms with that element type in the molecule. For example, the molecular formulas of the first 10 molecules in Asinex are as follows: Molecular_Formula C19H18N2O3 C18H18N2O2S2Cl2 C16H23NO3S2 C15H21NO3S2 C6H6N8O4 C24H28 C17H16N8O8 C22H17N3O2Cl2 C28H23N3O3 44 Pipeline Pilot

45 C13H16N2OS Molecular Properties This component calculates the following whole-molecule properties: Value FormalCharge CoordDimension IsChiral BondDistance_Table AverageBondLength Number of Total formal charge of the molecule. Indicator for the atomic coordinates: 0 (all coordinates are zero), 2 (have X,Y coordinates), 3 (have X,Y,Z coordinates). Flag to indicate whether the molecule exists only in the represented absolute stereo configuration or as a pair of enantiomers. This flag mirrors the Chiral flag in the MDL CTAB format. Calculates the number of bonds in the shortest path between each pairs of atoms in the molecule. Calculates the average bond length for the molecule based on the atomic coordinates. Molecular Property Counts This component is a type of molecular property calculator that can calculate the following values: Value Num_Atoms Num_Bonds Num_ExplicitAtoms Num_ExplicitBonds Num_Hydrogens Num_ExplicitHydrogens Num_PositiveAtoms Num_NegativeAtoms Num_RingBonds Num_RotatableBonds Num_AromaticBonds Num_BridgeBonds Num_SingleBonds Num_DoubleBonds Num_TripleBonds Num_AliphaticSingleBonds Number of Heavy (non-hydrogen) atoms. Bonds between heavy atoms. Heavy atoms and explicit hydrogens Bonds between any pair of atoms, including hydrogens Hydrogens, both implicit and explicit. Explicit Hydrogens Atoms with a positive charge. Atoms with a negative charge. Bonds in a ring. Rotatable bonds, defined as single bonds between heavy atoms that are both not in a ring and not terminal (that is, connected to a heavy atom that is attached to only hydrogens). As a special case, amide C-N bonds are not rotatable. Bonds in aromatic ring systems. Bonds in bridgehead ring systems, defined as any rings that share more than one bond in common. Number of single bonds between heavy atoms. Number of double bonds. Number of triple bonds. Number of single bonds between heavy atoms that are not in aromatic rings. Chemistry Collection: Basic Chemistry Guide 45

46 Value Num_AliphaticDoubleBonds Num_Rings Num_AromaticRings Num_RingAssemblies Number of Number of double bonds that are not in aromatic rings. Base rings, defined as the number of rings in the smallest set of smallest rings (SSSR). Base rings that are aromatic. Num_Rings3 Number of rings of size 3 Num_Rings4 Number of rings of size 4 Num_Rings5 Number of rings of size 5 Num_Rings6 Number of rings of size 6 Num_Rings7 Number of rings of size 7 Num_Rings8 Number of rings of size 8 Num_Rings9Plus Num_Chains Num_ChainAssemblies Num_Fragments Num_StereoAtoms Num_StereoBonds Ring assemblies, defined as the fragments remaining when all non-ring bonds are removed from the molecule. For example, naphthalene has one ring assembly, while biphenyl has two. Number of rings of size 9 or bigger Unbranched chains needed to cover all the non-ring bonds in the molecule. Chain assemblies, defined as the fragments remaining when all ring bonds are removed from the molecule. Total fragments in the molecule; two pieces are fragments, if none of their atoms are connected via a covalent bond. Atoms marked as EvenAtomStereo, OddAtomStereo, or UnknownAtomStereo. Bonds marked CisBondStereo, TransBondStereo, or UnknownBondStereo. Num_UnknownStereoAtoms Atoms marked UnknownAtomStereo. Num_UnknownStereoBonds Bonds marked UnknownBondStereo. Num_TrueStereoAtoms Num_UnknownTrueStereoAt oms Num_PseudoStereoAtoms Num_UnknownPseudoStere oatoms Num_MesoStereoAtoms Atoms that are internally perceived as having stereo and that are marked as EvenAtomStereo or OddAtomStereo. Atoms that are internally perceived as having stereo and that are not marked as EvenAtomStereo or OddAtomStereo. Stereo atoms that are diametrically opposite each other in a ring system. Atoms that are internally perceived as having pseudo stereo and that are not marked with wedge bonds as EvenAtomStereo or OddAtomStereo. Atoms that are true stereo centers in a molecule that, due to symmetry, is not chiral. Num_EnhancedStereoAtoms Atoms that are marked with EnhancedStereo (e.g. relative stereo groups from V3000 CTAB import). Num_AtomClasses Different atom classes from symmetry perception (excluding hydrogens). For example, benzene would have a value "1" and toluene would have a value "5". 46 Pipeline Pilot

47 Value Num_Macro_Chains Num_Macro_Residues Num_TerminalRotomers Number of Chain records defined for macromolecules in PDB files. Residue records defined for macromolecules in PDB files. Terminal groups such as -CF3, -CCl3, -COO, -NOO. A terminal rotomer is defined as either a non-terminal sp3 atom connected to three terminal atoms of the same type, or a non-terminal sp2 atom connected to two terminal atoms of the same type. Notice that groups such as CH3 and NH2 are not counted as terminal rotomers because the bond to the heavy atom is not considered terminal (the heavy atom is attached to only hydrogens) This property can be used to adjust the Num_RotatableBonds count, which includes bonds to terminal rotomers. For example, the Num_RotatableBonds count calculated for C6H5-CF3 is 1, and the Num_TerminalRotomers count is also 1. A modified number of rotatable bonds that excludes terminal rotomers can be calculated as Num_RotatableBonds - Num_TerminalRotomers using PilotScript. Num_SpiroAtoms Num_BridgeHeadAtoms Num_MetalAtoms Num_SGroups Num_RepeatUnits Num_CustomData Num_PiBonds Num_Superatoms Num_Isotopes Num_QueryAtoms Num_QueryBonds Num_V3000Templates A spiro atom is a linkage between two rings consisting of a single atom common to both. A free spiro atom is a linkage that constitutes the only union direct or indirect between the two rings. We count only free spiro atoms. A bridgehead atom connects a bridge to a ring. Atoms classified as metallic. Number of MDL SGroups present in the molecule, as determined by the SGroup M STY lines Number of repeat units present in the molecule. Repeat units are represented as monomers with associated repetition counts or ranges and connection types (Head to Tail, Head to Head, etc.). They are read from MDL SD or SKC files with SGroups or from Accord files Number of custom data present in the molecule. Custom data are text objects with specific coordinates which can be associated with molecules, atoms, bonds, or repeat units. They are read from MDL SD or SKC files with SGroups or from Accord files Number of pi bonds and pi systems present in molecules such as metallocenes and other organometallic compounds. Pi bonds are read from Accord files. Number of super atoms present in the molecule. A superatom is an SGroup of type SUP. It s a group of atoms that are to be replaced by a single textual node when the molecule is depicted. Number of atoms that are marked with an isotope. This includes cases where the marked isotope matches natural abundance. Number of atoms that contain query features. Number of bonds that contain query features. Number of V3000 template fragments. Chemistry Collection: Basic Chemistry Guide 47

48 Value Num_RGroupFragments Number of Number of total fragments. Molecular Weight This component can calculate the molecular weight and mass of the input molecule and create new properties to hold the results. Molecular weight is calculated using the atomic weights of the individual atoms in the molecule. Molecular mass is calculated using the sum of the atomic weights with the most common isotope. Num H AcceptorDonors This component calculates the number of hydrogen acceptors and/or donors and adds a separate property to the data record for each result. Hydrogen Bond Acceptors are defined as heteroatoms (Oxygen, Nitrogen, Sulfur, or Phosphorus) with one or more lone pairs, excluding atoms with positive formal charges, amide and pyrrole-type Nitrogens, and aromatic Oxygen and Sulfur atoms in heterocyclic rings. Hydrogen Bond Donors are defined as heteroatoms (Oxygen, Nitrogen, Sulfur, or Phosphorus) with one or more attached Hydrogen atoms. Solubility This component calculates aqueous solubility. It outputs the aqueous Solubility expressed as logs, where S is the solubility in mol/l. The method used to estimate the solubility is the multiple linear regression model based on Electrotopological State indices published by Tetko et al. [J Chem Inf. Comput. Sci, 2001, 41, , Estimation of Aqueous Solubility of Chemical Compounds Using E-State Indices ]. Solubility Model Water solubility is calculated using a multiple linear regression model based on E-state keys published by Tetko et al (Tetko, I., Tanchuk Yu. V., Kasheva T., Villa A., "Estimation of Aqueous Solubility of Chemical Compounds Using E-State Indices", J. Chem. Inf. Comput. Sci., 2001, 41, ). The following plot shows the correlation between solubility values calculated by Pipeline Pilot using this model and the values reported in the paper using their final neural net model based on the E-state keys for a set of test molecules used in the study. 48 Pipeline Pilot

Substructure Count from File Correlation between solubility values This component evaluates each molecule for the presence of indicated substructure(s) using the queries found in a file.

49 Substructure Count from File Correlation between solubility values This component evaluates each molecule for the presence of indicated substructure(s) using the queries found in a file. The number of times the substructure or substructures are found in the molecule is counted and written to a given property name. The substructure or substructures are provided as MDL-format queries using the Source parameter. For example, you can use ISIS/Draw to sketch the molecule, select all, and export to a MOL file. You also provide a prefix (for example, Nitro ). It outputs the property with a name of the prefix and _Count (for example, Nitro_Count ). Substructure Count from Tag Parameters for Substructure Count from File This component evaluates each molecule for the presence of indicated substructure(s) using the queries received that are tagged on the incoming data stream. The number of times the substructure or substructures are found in the molecule is counted and written to a given property name. The substructure or substructures are provided as queries, tagged with a particular property name, given in parameter QueryTag. You also provide a prefix (for example, Nitro ). It outputs the property with a name of the prefix and _Count (for example, Nitro_Count ). Chemistry Collection: Basic Chemistry Guide 49

Substructure Map Parameters for Substructure Count from Tag This component searches each molecule for the presence of one or more substructures.

50 Substructure Map Parameters for Substructure Count from Tag This component searches each molecule for the presence of one or more substructures. You can select different properties that you want to add to the property list. They indicate the number of matches and/or the atom and bond maps for each match. NumQueries: Contains the total number of queries. NumQueriesMapped: The number of queries that mapped. QueriesMapped: Contains a list of the names of the mapped queries. If SeparateQueryOutputs is True, the atom-to-atom mappings are contained in properties that begin with the query name and end with _Maps or _AllMapped. The former is an array of the individual mappings. Each mapping is a sequence of numbers containing the number of the target atom that the ith query atom maps onto. The latter is an array of all target atoms in any of the mappings, in no particular order If SeparateQueryOutputs is False, then all mappings are placed in Query_Maps, and the list of all atom in Query_AllMapped. Similar properties can be output for the bond mappings. They are named _BondMaps or _AllBondsMapped for separate queries or Query_BondMaps, and Query_AllBondsMapped for all queries together in the same array. Surface Area and Volume This component calculates a variety of surface area and volume properties for each molecule. It calculates one or more of the following: Molecular_SurfaceArea and Molecular_PolarSurfaceArea: Calculates the total surface area and/or polar surface area for each molecule using a 2D approximation. Molecular_Volume: Calculates the 3D volume for each molecule using the current 3D coordinates. The component will fail if there are no 3D coordinates for the molecule. The 3D Coords and/or Minimize Molecule component can be used prior to the molecule volume calculation if no 3D coordinates are present for the molecules on the input stream. Molecular_SASA, Molecular_PolarSASA, and Molecular_SAVol: Calculates the total solvent accessible surface area, the polar solvent accessible surface area and the solvent accessible volume for each molecule using a 2D approximation. The polar solvent accessible surface area is defined as the sum of the solvent accessible surface area of all the selected polar elements, which can include N, O, P, and S. Solvent accessible surface area and solvent accessible volume are calculated assuming a solvent probe radius of 1.4 Angstroms. Solvent Accessible Surface Area The Surface Area and Volume component includes options for calculating solvent-accessible surface area and other related properties including: Solvent-accessible surface area (Molecular_SASA) Polar solvent-accessible surface area (Molecular_PolarSASA) 50 Pipeline Pilot

Solvent-accessible volume (Molecular_SAVol) All these quantities are calculated using models based on E-state keys as independent variables, which requires only 2D structures and hence are very fast.

51 Solvent-accessible volume (Molecular_SAVol) All these quantities are calculated using models based on E-state keys as independent variables, which requires only 2D structures and hence are very fast. The models were obtained by fitting solvent-accessible surface areas for 3D conformers of molecules from the NCI drug database. The following plot shows the correlation between the 3D solvent-accessible surface areas and the calculated values using the 2D approximation. Correlation between 3D solvent-accessible surface areas and calculated values using 2D approximation Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations Molecular Energy This component calculates the energy of a molecule, either in its current configuration or after a rapid minimization procedure. It can calculate the following values: Energy: Gives the energy of the molecules current 3D conformation. It calculates the point energy of the current conformation. Minimized_Energy: Gives the energy after a fast minimization procedure. It takes a bit longer to calculate, as it performs a quick minimization procedure before calculating the energy. Strain_Energy: Gives the point strain energy. Strain_Energy is the difference between Energy and Minimized_Energy. PiSystem Properties This component calculates several properties pertaining to each pi system in the molecule. When more than one pi system is present in a molecule, the resultant properties are arrays. PiSystem_Hapticity: The number of atoms in the pi system. Chemistry Collection: Basic Chemistry Guide 51

52 PiSystem_ElectronCount: The number of pi electrons in the pi system. PiSystem_Charge: The charge that is delocalized across the pi system. PiSystem_Radical: The radical count that is delocalized across the pi system. 52 Pipeline Pilot

53 Chapter 6 Manipulators A data manipulator is a component that alters data records as they are passed through it. The Chemistry collection includes manipulators that modify structures in a variety of ways. You can normalize sets of molecules before comparing them. For optimal display characteristics, you can employ manipulators for structural alignment and for 2D layout. The current set of manipulator components in the Chemistry collection includes: 3D Conformations 3D Coordinates 2D Coords 2D Coords (Advanced) Add Bond Orders Add Hydrogens Aggregate Fragments Align Molecules from Tag Align Molecules using Substructure Center Molecule Clean Molecule Convert Fingerprint Deprotonate Bases Generate Fragments Generate Salts Identify Salts Identify Salts from Tag Ionize Molecule at ph Keep Largest Fragment Merge Molecules Minimize Molecule Normalize Structure (Cheshire) Protonate Acids Remove Hydrogens Remove Salts Separate Fragments Standardize Molecule Strip Salts Strip Salts from Tag Tile Fragments 2D Depiction Algorithms The 2D Coords component includes advanced parameters that give you more control over the depiction algorithm. You can use 2D templates for ring and bridge assemblies with predefined coordinates. A set of several thousand templates is provided by SciTegic in /data/templates2d/scitegic. You can define your own templates and place them in data/templates2d/user. Additional options to try to resolve bumps that could be present in the 2D structures include: Shorten bond length of terminal bonds Flip torsions (single torsions and pairs of torsions) Bend torsions (single torsions and pairs of torsions) Rotate terminal atoms Check for bond crossing Chemistry Collection: Basic Chemistry Guide 53

54 2D depiction algorithms for several molecules are shown below: 54 Pipeline Pilot

55 Chemistry Collection: Basic Chemistry Guide 55

56 Standardize Molecule Standardize Molecule is a molecular manipulator component that provides a number of useful actions for taking molecules from different sources and correcting non-uniform features. By enabling the Track Actions Taken parameter, you can monitor which actions resulted in changes made to the input molecule. The available actions in the Standardize Molecules component include: Action StandardizeStereo StandardizeCharges CenterMolecule RemoveSingleAtomFragments KeepSmallestFragment KeepLargestFragment MakeNon[H]Atoms[C]Atoms MakeNon[C,H]Atoms[Q]Atoms MakeNon[H]Atoms[A]Atoms MakeAllBondsSingle ClearCoordinates FixCoordinateDimension StraightenTripleBonds ClearMolecule RemoveMolecule ClearStereo ClearEnhancedStereo ClearUnknownStereo Description Sets or repairs the stereo on a molecule to a standard form using the coordinates as the guide. Atoms perceived as true stereo atoms, that have no stereochemical markings (UnknownAtomStereo, EvenAtomStereo, or OddAtomStereo), are set to UnknownAtomStereo. Atoms with stereochemical markings, that are not true stereoatoms, are set to NoAtomStereo. 2D or 3D coordinates are not used in this process. Similarly, bonds perceived as true stereo double bonds, that have no stereochemical markings (UnknownBondStereo, CisBondStereo, or TransBondStereo), are set to UnknownBondStereo. Bonds with stereochemical markingsm that are not true stereo bonds, are set to NoBondStereo. Again, 2D or 3D coordinates are not used in this process. Sets the charges on a molecule to a standard form. For example, nitro groups are detected and converted to a standard form. (A complete definition of the standardization rules is available later in this topic.) Translates a molecule so its geometric center lies at the origin. Removes any fragments that consist of only a single heavy atom. Keeps only the smallest fragment in the molecule. Keeps only the largest fragment in the molecule. Converts all atoms in the molecule to Carbon. Converts all non-carbon, non-hydrogen atoms in the molecule to the Q query atom type. Converts all non-hydrogen atoms in the molecule to the A query atom type. Converts all bonds in the molecule to Single bonds. Sets all x, y, z coordinates to zero. Sets the coordinate dimension (0D, 2D, 3D) based on the atomic coordinates. Finds atoms with triple bonds with non-linear geometry and fixes them so that the bond angles are 180 degrees. Deletes all atoms and bonds in the molecule, keeping the molecule object in the data record. Deletes the molecule object from the data record. Sets all atoms and bonds to NoStereo. Removes all relative stereo groupings (e.g. MDL V3000 Enhanced Stereo). Sets all atoms and bonds marked UnknownStereo to NoStereo. 56 Pipeline Pilot

57 Action ClearUnknownAtomStereo ClearUnknownCisTransBondSte reo ClearCisTransBondStereo ClearCharges SetStereoFromCoordinates RepositionStereoBonds FixDirectionOfWedgeBonds NeutralizeBondedZwitterions ClearSGroupData ClearRepeatUnits ClearCustomData ClearPiBonds ClearHighlightColors ClearQueryInfo ClearAtomLabels ClearBondLabels ClearUnusualValence ClearIsotopes LocalizeMarushRAtomsOnRings InvalidateCustomDataCoordinat es InvalidateRepeatUnitCoordinat es Description Sets all atoms marked UnknownStereo to NoStereo. Sets all bonds marked UnknownStereo to NoStereo. Sets all bonds marked CisStereo or TransStereo to UnknownStereo. Sets all formal charges to zero. Uses 2D coordinates and up/down bond markings (or 3D coordinates) to assign the stereochemistry of the atoms or bonds. Typically, this is done by readers or by molecule from text components. Occasionally components may create molecules that need to have their stereo reperceived. Repositions the stereo bond markings, trying to find the best bond to mark as a wedge bond for each stereo atom. Checks the wedge bonds in the molecule to ensure that the wedge is drawn with the stereo atom at the narrow end of the wedge. Any wedge bond for which there is a stereo atom at the wide end, and no stereo atom at the narrow end, is reversed to point in the other direction. A separate option, Invert Wedge Bond When Changing Direction, controls inverting the bond stereo (up or down) when changing the direction of the wedge bond. Converts directly bonded zwitterions (positively charged atom bonded to negatively charged atom, A+B-) to the neutral representation (A=B) Clears any SGroup information from the molecule. Clears any repeat unit information from the molecule, leaving only the molecule with the monomers included, but without any repetition or linking information Clears any custom data information from the molecule Clears any pi bonds and pi systems from the molecule Clears any highlight colors from atoms and bonds. Deletes all query information from atoms and bonds. Clears labels from atoms. Clears labels from bonds. Clears any atom valence query features and resets all implicit hydrogen counts to their standard values. Clears all isotope markings from atoms. R atoms bonded to the centers of rings are converted to R atoms at all open positions on the ring. Clears any coordinates in custom data objects. Clears any coordinates in repeat unit objects which forces the locations of their brackets to be re-perceived. Chemistry Collection: Basic Chemistry Guide 57

58 Action InvalidateSGroupCoordinates Description Clears any coordinates in MDL SGroup objects. Standardize Charges The Standardize Molecule component has a parameter value called Standardize Charges that uses molecular connectivity to set the charges on heteroatoms. Rules for Heteroatoms (Standardize Charges) The following information describes the rules for specific heteroatoms. Heteroatom(s) AtomIndex NumSingle NumDouble NumTriple NumAromatic NumSingleToOxygen NumDoubleToOxygen NumAromaticToOxygen NumAtt Oxy OxyNumAtt Bond<n> Att<n> Description Index of the core atom. A count of the number of bonds of the given type to the central atom. Number of single, double, or aromatic bonds to oxygen atoms. Number of bonds attached to the central atom. Index of an attached oxygen (if any). Number of attachments to the attached oxygen (if any). nth attached bond. nth attached atom. Transformation Method The following information describes the transformation method in C-style pseudocode: switch (AtomType(atomIndex)) { case Nitrogen: { // Simple quaternary if (numsingle == 4) { SetCharge(AtomIndex, +1); if ((numsingletooxygen == 1) && (oxynumatt == 1)) SetCharge(oxy, -1); } // quaternary-style aromatic else if ((numsingle == 1) && (numaromatic == 2)) { SetCharge(AtomIndex, +1); if ((numsingletooxygen == 1) && (oxynumatt == 1)) SetCharge(oxy, -1); } // nitro else if ((numsingle == 2) && (numdouble == 1) && (numsingletooxygen == 1)) { SetCharge(AtomIndex, +1); 58 Pipeline Pilot

59 if (oxynumatt == 1) SetCharge(oxy, -1); } // double-bonded quaternary else if ((numsingle == 2) && (numdouble == 1)) SetCharge(AtomIndex, +1); } break; case Oxygen: { // Simple quaternary if (numatt == 3) SetCharge(AtomIndex, +1); else if (numatt == 2) { // single-single is OK if ((BondType(atomIndex, bond1) == SingleBond) && (BondType(atomIndex, bond2) == SingleBond)) break; // X=O-C is charged if ((BondType(atomIndex, bond1) == SingleBond) && (BondType(atomIndex, bond2) == DoubleBond)) { if (AtomType(atomIndex, att1) == Carbon) { SetCharge(AtomIndex, +1); break; } } if ((BondType(atomIndex, bond2) == SingleBond) && (BondType(atomIndex, bond1) == DoubleBond)) { if (AtomType(atomIndex, att2) == Carbon) { SetCharge(AtomIndex, +1); break; } } } } break; case Sulfur: { // Simple quaternary if (numatt == 3) SetCharge(AtomIndex, +1); else if (numatt == 2) { // single-single is OK if ((BondType(atomIndex, bond1) == SingleBond) && (BondType(atomIndex, bond2) == SingleBond)) break; // X=S-C is charged if ((BondType(atomIndex, bond1) == SingleBond) && (BondType(atomIndex, bond2) == DoubleBond)) { if (AtomType(atomIndex, att1) == Carbon) Chemistry Collection: Basic Chemistry Guide 59

60 { SetCharge(AtomIndex, +1); break; } } if ((BondType(atomIndex, bond2) == SingleBond) && (BondType(atomIndex, bond1) == DoubleBond)) { if (AtomType(atomIndex, att2) == Carbon) { SetCharge(AtomIndex, +1); break; } } } } break; } 3D Coordinates, 3D Conformations and Minimize Energy The 3D Coords and 3D Conformations components provide a quick way for generating 3D. The resulting coordinates are checked to make sure that there are no bumps between atoms, but electrostatics is not taken into account. The algorithm used to generate 3D conformations changes only the torsion angles, keeping bond lengths and bond angles fixed. The Minimize Energy component uses the Clean force-field described in Receptor Surface Models. 1. Definition and Construction. M. Hahn; J. Med. Chem.; 1995; 38(12); Generate Fragments The Generate Fragments component extracts one or more type of fragment from the molecule, as specified by the user. The available types of fragments are: Ring Assemblies: Contiguous ring systems Bridge Assemblies: Contiguous ring systems that share two or more bonds Rings: Individual rings Chain Assemblies: Contiguous chains BemisMurcko Assemblies: Bemis-Murcko assemblies are contiguous ring systems plus chains that link two or more rings, as defined in The Properties of Known Drugs. 1. Molecular Frameworks, Guy W. Bemis and Mark A. Murcko, J. Med. Chem. 1996, 39, The different fragments that this component can generate are illustrated for the following molecule: 60 Pipeline Pilot

61 Chain Assemblies Sets of contiguous chain atoms, including any ring atom that terminates a chain: Rings Ring Assemblies Individual rings Contiguous ring systems, including fused rings and bridge systems Chemistry Collection: Basic Chemistry Guide 61

62 Bridge Assemblies Contiguous ring systems that share two or more bonds: Bemis-Murcko Assemblies Bemis-Murcko assemblies are ring systems and any chain that links two or more rings. Any other chains are clipped from the molecule. The component has two parameters that control what to do with the attachment points when generating the fragments. Set IncludeAlphaAtoms to True to include the first atom outside the fragment as an attachment point (Z atom). Set MarkAttachmentAtoms to True to replace the atoms at the point of attachment by Z atoms. The following figures illustrate the use of these options for ring assembly fragments. Ring Assemblies (IncludeAlphaAtoms = True) 62 Pipeline Pilot

63 Ring Assemblies (MarkAttachmentAtoms = True) Deprotonate Bases, Protonate Acids and Ionize Molecule at ph Deprotonate Bases and Protonate Acids perform a quick analysis of the molecule to identify simple acids and bases and neutralize them. Acid functional groups are defined as Oxygen or Sulfur atoms with a negative formal charge, attached to only one, uncharged, atom. Basic functional groups are defined as Nitrogen atoms with a positive formal charge and one or more attached Hydrogen atoms. The Ionize Molecule at ph component uses the pka framework to identify ionization sites and calculate their pka values. It then ionizes the sites based on the calculated pka values and the user-defined ph. Chemistry Collection: Basic Chemistry Guide 63

64 Chapter 7 Filters A filter is a component that identifies and diverts specific subsets of records. It provides a powerful way to customize a pipeline to process specific subsets of your data differently than other subsets. These components are designed to screen data according to criteria that depend on the nature of the component. For example, they can remove data records with a property value in some desired range and remove duplicate records. Filters typically have an input port, a pass port (displayed in green), and a fail port (displayed in red). The Chemistry collection provides filters that act on molecular data and are capable of evaluating specific chemical and structural features of compound records. You can filter using substructural queries, similarity to reference molecules, molecular appropriateness filters (such as Lipinski's rule), molecule uniqueness, and fragment properties. The current set of Filter components in the Chemistry collection includes: Bad Isotope Filter Bad Stereo Filter Bad Triple Bond Filter Bad Valence Filter Bump Check Filter Check and Normalize Structure HTS Filter Lipinski Filter Most Frequent Fragments Organic Filter Query Features Filter 64 Pipeline Pilot

65 Chapter 8 Search and Similarity The Search and Similarity components are designed to perform database-style searching of molecules over pipelined data, removing the need to pre-load a molecular database to search. Search and Similarity components in the Chemistry collection include: Find Novel Fragments Find Novel Molecules Remove Duplicate Molecules Substructure Filter from File Substructure Filter from Tag In addition, the Data Modeling collection contains components to equipartioning data into equal sized groups and for performing similarity searches for a set of target compounds against a set of reference compounds using Tanimoto similarity. Molecular Similarity (Tanimoto, etc.) The Molecular Similarity (Tanimoto, etc.) component calculates similarity values for each target molecule with respect to one or more reference molecules using molecular fingerprint properties (ECFP, FPFP, etc.). This component can calculate several different similarity coefficients, the most common being the Tanimoto similarity coefficient. The Tanimoto similarity coefficient is defined by the expression: where: Tanimoto = SA SA + SB + SC SA = Number of bits defined in both the target and the reference SB = Number of bits defined in the target but not the reference SC = Number of bits defined in the reference but not the target The Tanimoto similarity ranges from zero (there are no common bits between the reference and the target molecules) to one (the reference and the target molecules have exactly the same bits ). For more details, see Molecular Similarity (Tanimoto, etc.) in the Data Modeling Component Collection User Guide. Chemistry Collection: Basic Chemistry Guide 65

66 Chapter 9 Database Content Database Content components can be used to access content information from DiscoveryGate Web Service (DGWS) and other public Chemistry Web Services such as PubChem and NCI. A valid license key is needed to use the DGWS components. The license key is defined as a global in the Pipeline Pilot Administration Portal. The Chemistry Web Services accessed by the Database Content components are: DiscoveryGate Web Service: Molecule to Name, Molecule from Name, Identity, Similarity, Substructure Searches, Activity Searches, Reaction Searches, Content information NCI/CADD: Molecule to Name, Molecule from Name PubChem: Molecule from Name, Identity, Similarity, Substructure Searches, Molecular Properties ChemSpider: Molecule from Name, Molecule from InChI Key emolecules: Molecule from Name ChemExper: Molecule from Name Depending on your DGWS license, you have access to the following databases: Data Source Name Type ACD Available Chemicals Directory Molecule SCD Screening Compounds Directory Molecule MDDR MDL Drug Data Report Molecule NCI National Cancer Institute Databases Molecule CMC Comprehensive Medicinal Chemistry Molecule TOX Toxicity Database Molecule CIRX ChemInform Reaction Library Reaction DJSM Derwent Journal of Synthetic Methods Reaction ORGSYN ORGSYN Database Reaction SPORE Solid-Phase Organic Reactions Reaction REFLIB The Reference Library of Synthetic Methodology Reaction SPRESI Storage and Retrieval of Chemical Structure Information (SPeicherung und REcherche Strukturchemischer Information) Reaction Most Database Content components use the Web Service (SOAP) component internally to access the Chemistry Web Services through their corresponding WSDL. An example of a component that connects to DGWS to retrieve structures based on a chemical name is shown below: 66 Pipeline Pilot

Subprotocol that uses the Web Service (SOAP) component The internal Web Service (SOAP) components are parameterized to call specific WSDL methods, setting any required parameters from data in the

67 Subprotocol that uses the Web Service (SOAP) component The internal Web Service (SOAP) components are parameterized to call specific WSDL methods, setting any required parameters from data in the input stream or from user-specified parameters in the wrapping components. The following are the parameter settings needed to get molecules by name using DGWS: Chemistry Collection: Basic Chemistry Guide 67

Parameterization of the Web Service (SOAP) component to call a DGWS method through a WSDL file The DGWS components are configured to retrieve the content data as a hierarchical data record,

68 Parameterization of the Web Service (SOAP) component to call a DGWS method through a WSDL file The DGWS components are configured to retrieve the content data as a hierarchical data record, preserving the structure and relationships of the data in the content databases. The hierarchical data is processed to extract each leaf node as a separate record, including all properties of the parent nodes all the way up to the root node. 68 Pipeline Pilot

69 Hierarchical data record with content retrieved from DGWS Components and Example Protocols The components are organized in Fetch Information, Search and Similarity, Utilities, and Viewers. Fetch Information Check Chemistry Web Services Find Compound Activity Find Compound ID Find Molecule from InChI Key Find Molecule from Name Find Molecule from Name(Batched) Find Molecule Names Find Molecule by Activity Get Content from Compound ID Get Content from RXN MDLNumber Get PubChem Properties Get Service Information Get Supplier Information Get Vocabulary Chemistry Collection: Basic Chemistry Guide 69

70 Search and Similarity Find Molecule in Reactions Identity Search Reactions Search Similarity Search Substructure Search Utilities This subfolder contains internal components used by other Database Content components, such as individual Molecule to/from Name converters for the different Chemistry Web Services, and other utilities. Viewers This subfolder contains one component, Content Viewer, which displays a table with DGWS Data Source information for each input molecule, with interactive links to retrieve other content information. Database Content component Examples 01 Get Molecules from Names 02 Get Names from Molecules 03 Find Compound IDs 04 Find Biological Activities 05 Find Similar Molecules 06 Find Molecules as Substructures 07 Search Reaction Databases 08 Get Content from DiscoveryGate Web Service 09 Filter Molecules by Procurement Data 10 View Content from DiscoveryGate Web Service Web Services Utilities Examples Check response Times of Chemistry Web Services Get DiscoveryGate Web Service Information Get DiscoveryGate Web Service Vocabulary Web Port Examples 08 Find Molecules by Name: This example opens a form that allows you to retrieve Data Sources and Procurement information for a given compound using DGWS: 70 Pipeline Pilot

71 WebPort example using DGWS to find structure and suppliers from a chemical name 09 JDraw Sketcher and DGWS Search: In this example, you can sketch a structure using JDraw and perform a Substructure, Similarity, or Identity search in DGWS. Chemistry Collection: Basic Chemistry Guide 71

Reaxys Pipeline Pilot Components Installation and User Guide

1 1 Reaxys Pipeline Pilot components for Pipeline Pilot 9.5 Reaxys Pipeline Pilot Components Installation and User Guide Version 1.0 2 Introduction The Reaxys and Reaxys Medicinal Chemistry Application