CHEMISTRY COLLECTION Basic Chemistry Guide

Size: px
Start display at page:

Download "CHEMISTRY COLLECTION Basic Chemistry Guide"

Transcription

1 CHEMISTRY COLLECTION Basic Chemistry Guide

2 Copyright Notice Copyright 2011 Accelrys Software Inc. All rights reserved. This product (software and/or documentation) is furnished under a License Agreement and may be used only in accordance with the terms of such agreement. Trademarks The registered trademarks or trademarks of Accelrys Software Inc. include but are not limited to: ACCELRYS ACCELRYS & Logo PIPELINE PILOT All other trademarks are the property of their respective owners. Restrictions on Government Use This is a commercial product. Use, release, duplication, or disclosure by United States Government agencies is subject to restrictions set forth in DFARS or FAR , as applicable, and any successor rules and regulations. Acknowledgments and References To print photographs or files of computational results (figures and/or data) obtained using Accelrys software, acknowledge the source in an appropriate format. For example: Imaging results obtained using software programs from Accelrys Software Inc. Data management and analysis performed with the Pipeline Pilot Imaging collection. Graphical displays generated with the Discovery Studio Visualizer. To reference an Accelrys Software Inc. publication in another publication, Accelrys Software Inc. is the author and the publisher. For example: Accelrys Software Inc., Chemistry Collection: Basic Chemistry User Guide, Pipeline Pilot, San Diego: Accelrys Software Inc., Request for Permission to Reprint and Acknowledgment Accelrys may grant permission to republish or reprint its copyrighted materials. Requests should be submitted to Accelrys Scientific and Technical Support, either through to support@accelrys.com or in writing to: Accelrys Scientific and Technical Support Telesis Court Suite 100 San Diego, CA Please include an acknowledgment Reprinted with permission from Accelrys Software Inc., [Document name], [Month Year], Accelrys Software Inc., San Diego. For example: Reprinted with permission from Accelrys Software, Inc., Pipeline Pilot Next Gen Sequencing Collection: User Guide, May, 2011, Accelrys Software, Inc., San Diego.

3 Contents Chapter 1 Introduction Who Should Read this Guide... 4 Requirements... 4 Client-side Software Requirements... 4 Server-side Software Requirements... 5 Additional Information... 5 Chapter 2 Readers Shared Parameters for Readers... 6 Source Files for Readers... 7 Chapter 3 Viewers and Writers Viewers... 8 Writers... 8 Shared Parameters for Writers... 9 Chapter 4 Converters Molecule From Text Molecule To Text Chapter 5 Calculators ALogP Canonical Smiles Translating SMILES Output to Molecular Data E-state Keys What to Output Element Count Advanced Parameters Molecular Fingerprints Fingerprint Storage Formats Path-based Fingerprints MDL Public Key Fingerprints User Key Fingerprints Atom Environment Fingerprints Extended Connectivity Fingerprints Fingerprint Generation Method Calculate Extended Connectivity Fingerprints Fingerprint Feature Code Generation Hashing Schemes Functional Class Fingerprints Reaction Fingerprints Figuring Out the Fingerprint Name Molecular Formula Molecular Properties Molecular Property Counts Molecular Weight Num H AcceptorDonors Solubility Solubility Model Substructure Count from File Substructure Count from Tag Substructure Map Surface Area and Volume Solvent Accessible Surface Area Molecular Energy PiSystem Properties Chapter 6 Manipulators 2D Depiction Algorithms Standardize Molecule Standardize Charges D Coordinates, 3D Conformations and Minimize Energy Generate Fragments Chain Assemblies Rings Ring Assemblies Bridge Assemblies Bemis-Murcko Assemblies Deprotonate Bases, Protonate Acids and Ionize Molecule at ph Chapter 7 Filters Chapter 8 Search and Similarity Molecular Similarity (Tanimoto, etc.) Chapter 9 Database Content Components and Example Protocols Appendix Appendix A: Substructure Searching Appendix B: Tetrahedral Stereo Perception Appendix C: Support for MDL Enhanced Stereo Representation Appendix D: Support for Features in ChemDraw Files.. 77 Appendix E: Glossary of Terms Chemistry Collection: Basic Chemistry Guide 3

4 Chapter 1 Introduction The Chemistry collection allows you to deploy Pipeline Pilot in a chemistry setting. Use these specialized tools to efficiently perform compound processing and cheminformatics research and analysis. This large collection of components includes data readers, writers and viewers, molecular property calculators, filters, manipulators, converters, and utilities. With Chemistry components, you can create protocols for a variety of applications including: Compound library acquisition Library cleanup and standardization Substructure searching Extensive property profiling and subset selection Extended Connectivity Fingerprints (ECFP): The Chemistry collection uses Extended Connectivity Fingerprints (ECFP), SciTegic s proprietary method for calculating structural fingerprints. This method offers excellent characterization of molecules that indexes the environments of every atom in a molecule by using up to four billion different structural features. ECFP is an efficient and useful method for performing searching and clustering Partial support for Non-Specific Structures (NONS) read from SGroup lines in MDL SD or SKC files or from Accord files Representation, generation and enumeration of Markush structures Use of Markush structures as queries in substructure searching Representation, enumeration and depiction of repeat units Representation and depiction of custom data Combined with the separately available Modeling collection: Substructure activity modeling Compound clustering When combined with the Integration collection, the Molecular Toolkit s Java and Perl APIs provide programmatic access to the molecular data model and its searching methods. Who Should Read this Guide The Chemistry collection includes a large number of components organized into numerous folders. Information about using these components is available in two separate guides. The Chemistry components covered in this basic guide include: Readers Viewers Writers Converters Calculators Manipulators Search and Similarity Requirements Some collections require third-party software that is not included with Pipeline Pilot. This software might need to be installed on a client or on the server (depending on component). Client-side Software Requirements Chemistry components that require third-party software on your client system include: 4 Pipeline Pilot

5 To use this component: Accelrys DS Visualizer ISISDraw Sketcher Chemistry Sketcher ISIS for Excel Reader ISIS for Excel Viewer Accord for Excel Reader Accord for Excel Viewer Accord for Excel Writer You need this software: Accelrys DS Visualizer or Accelrys Discovery Studio ISISDraw A sketch application such as ISISDraw, SymyxDraw, AccelrysDraw, or ChemDraw ISISForExcel AccordForExcel Server-side Software Requirements Chemistry components that require third-party software on your server include: To use this component: ISIS Reader Bioisosteres components You need this software: ISISBase Accelrys Bioster database license See Also For information on utility components and advanced chemistry conception, such as reactions and enumeration, MCSS, and pka, see the Advanced Chemistry User Guide. For detailed information about the latest changes in the Chemistry collection, see the Chemistry Release Notes. Additional Information For more information about the Chemistry collection for Pipeline Pilot and other Accelrys software products, visit Chemistry Collection: Basic Chemistry Guide 5

6 Chapter 2 Readers A reader component generates a stream of data records that subsequent components in the pipeline both receive (as input) and send (as output). The data records that flow throughout the pipeline are based on input from a data source (usually a file or database). Readers are frequently used as the initial component in a pipeline. You can also use them in many other locations throughout a protocol to add more sources of data and to read temporary files. Since they are designed to generate data, readers do not expose input ports by default. The Chemistry collection includes readers that provide import facilities for several commonly used molecular file formats including SD, RG, RD, and RXN (from MDL), SMILES, SMIRKS, SMARTS, and TDT (from Daylight), MOL2 (from Tripos), ChemDraw and ChemDraw XML (from CambridgeSoft), Maestro (from Schrodinger) and public formats such as PDB. You can also read molecular data from databases such as ISIS and MDL Direct (from MDL), ActivityBase (from IDBS), and Accord (from Accelrys). The current set of readers in the Chemistry collection includes: Accord for Excel Reader Accord Reader ChemDraw Reader ChemDraw XML Reader Embedded Molecules Reader ISISDraw Sketcher ISIS Reader Maestro Reader Mol2 Reader PDB Reader RD Reader RG Reader RXN Reader SD Reader SKC Reader SMARTS Reader SMILES Reader SMIRKS Reader TDT Reader Shared Parameters for Readers In addition to their own specific settings, most readers support the following parameters: Source: Specifies the location of the data to read into the component. You can select source files from the SciTegic server, a client machine, or on another machine that is accessible on the network. You can also read URL sources including HTTP and FTP. Maximum: Allows you to specify a limit on the number of data records to read. (Reads all records if a value is not set.) Keep Properties: Allows you to preview the data records and define the properties on the data record you want to include or exclude from the pipeline. (Retains all properties if a value is not set.) SourceTag: Labels data records based on their point of origin in the pipeline. It is useful to identify the source of data records downstream or in the results. You insert an extra property called SourceTag into each data record that identifies the data source location. The value can be the name of the source file location or a more general identifier (such as a number or letter). Use this property somewhere else in the pipeline to filter or group data records. 6 Pipeline Pilot

7 Source Files for Readers When you run a protocol, all files that the protocol uses must be accessible on the server. All readers let you select source files on the SciTegic server, client, or another machine on the network. If the location of the source file is not shared, you are prompted to upload it on the server so the protocol can run. Tip: With the SD Reader component, you can read all.mol or.sd files in a folder at once. You can manually edit the Source parameter to include an asterisk (*) in the filename (like a wildcard asterisk used in other Windows applications). For example, Parameter Value = data/examples/e*.sd. Chemistry Collection: Basic Chemistry Guide 7

8 Chapter 3 Viewers and Writers Viewers A viewer component displays information or results of a protocol on a client machine. Viewers are frequently used as the final component in a pipeline. You can also use them to display intermediate results or to provide Pipeline Pilot with additional required information at protocol runtime. The Chemistry collection integrates several popular third-party structural viewers and a variety of Web viewers are also available for graphically displaying your molecular data and property information. The current set of viewers in the Chemistry collection includes: Accelrys DS Visualizer Accord for Excel Viewer Excel Structure Viewer HTML Molecular Cluster Viewer HTML Molecular Grouped Viewer HTML Molecular Table Viewer ISIS for Excel Viewer SAR Viewer Visual Molecule Selector IMPORTANT! For viewer components that require third-party applications, the software needs to be installed on all clients that run the protocols. If you share protocols with other client users, ensure that the client machines can support the third-party applications. We recommend you configure all third-party applications to open automatically so that your protocols can run without interruptions from password prompts or login dialogs. (You might not see a password/login dialog if the Pipeline Pilot program window is maximized, because it can cover all other open windows.) Writers A writer is a component that saves data into a pipeline to a format that you specify when you set the value for the parameter. Typically, the saved data is stored in a file or database. Writers are frequently used as the final component in a pipeline. They are also used in many other locations throughout protocols for storing intermediate data and for writing temporary files. You can also write data into different databases using these components. Most writers do not have output ports since they are designed to be the last component in a pipeline. The Chemistry collection includes writers that provide import facilities for a number of commonly used molecular file formats including HTML (hypertext markup language), MOL2 (from Tripos), ChemDraw and ChemDraw XML (from CambridgeSoft), Maestro (from Schrodinger), Accord (from Accelrys), public formats such as SD, SKC, RD, RG, and RXN (from MDL), SMILES and TDT (from Daylight), and PDB. The current set of writers in the Chemistry collection includes: Accord for Excel Accord Writer ChemDraw Writer ChemDraw XML Writer HTML Molecular Grouped Writer HTML Molecular Table Writer Maestro Writer PDB Writer RD Writer RG Writer RXN Writer SD Writer SKC Writer SMILES Writer 8 Pipeline Pilot

9 Mol2 Writer TDT Writer Shared Parameters for Writers In addition to their own specific settings, data writers generally support the following parameters: Destination: Specifies the target location for the data. Writers only output data on the server. Maximum: Allows you to specify a limit on the number of data records to output. (Writes all records if a value is not set.) IfFileExists: Specifies what to do if the file name already exists at the output destination. You can overwrite the existing file, append to it, or halt the pipeline. Tip: Writer components can save files in compressed (zipped) or uncompressed format. To save files as compressed, add.gz to the end of the filename. Chemistry Collection: Basic Chemistry Guide 9

10 Chapter 4 Converters The Converter components perform molecule to text conversions. They are organized into the following subfolders: Molecule From Text: Components that allow the translation of a textual molecule description into a molecule. Molecule To Text: Components that allow the translation of the molecular information into a block of text that is stored on some property. Molecule From Text These components convert text properties in the following formats into molecules: MDL Mol file (CTAB), SYBYL Mol2, Accelrys Accord formats, PDB, ChemDraw, ChemDraw XML, Maestro, Daylight SMARTS, Daylight SMILES, MDL RXN, MDL SKC, MDL Chime, and Pipeline Pilot Chemistry. Molecule From Text components include: Identify Molecular Format Molecule from Accord Molecule from ChemDraw Molecule from ChemDraw XML Molecule from Chime Molecule from CTAB Molecule from InChI_AuxInfo Molecule from Maestro Molecule from MOL2 Molecule from PDB Molecule from Pipeline Pilot Chemistry Molecule from SKC Molecule from SMARTS Molecule from SMILES Molecule from Text Reaction from RXN Reaction from SMIRKS Molecule To Text These components create a text property containing a representation of the molecular data record in one of the commonly used formats or an image of the molecule in JPEG, PNG or SVG format. This information is useful for storing molecular information in a database in a format that can be used later to reconstruct the molecule. You can use the component to reconstruct the molecule from a property value containing the text with the molecular representation. Molecule to text components include: Image From Molecule Molecule to Accord Molecule to ChemDraw Molecule to ChemDraw XML Molecule to Chime Molecule to CTAB Molecule to Image Molecule to InChI Molecule to JPEG Molecule to Maestro Molecule to NEMA Molecule to MOL2 Molecule to PDB Molecule to Pipeline Pilot Chemistry Molecule to PNG Molecule to SMARTS Molecule to SKC Molecule to SMILES Molecule to SVG Molecule to Text Reaction to RXN 10 Pipeline Pilot

11 The following record shows an example of MDL CTAB text, the CTAB of an acetamide molecule (this record is a single text string with internal returns): ACETAMIDE SciTegic D V C C O N M END Note: CTAB, Chime, InChI, MOL2, NEMA, PDB, and SMILES are different ways to represent a molecule as text. CTAB, MOL2, and PDB are larger in size, and the internal carriage returns can cause problems, if written to some formats (such as delimited files). PDB is commonly used to store protein structures. They preserve atom coordinates. Chime is a compressed and encrypted CTAB. It contains no internal carriage returns, but is not human readable. SMILES is more compact and Canonical_SMILES can be used for molecular comparison, but atom coordinates are not preserved. NEMA is a unique format thatcan also be used for molecular comparison, but it is a one-way conversion (you cannot convert back). InChI is the international chemical identifier from IUPAC. With InChI, two strings can be calculated: InChI and InChI_AuxInfo. InChI is similar to Canonical_SMILES in that it can be used for molecular comparison and contains no atom, but has the additional feature that different tautomers of the same compound have the same InChI string. InChI_AuxInfo is the string recommended for recreating the molecule (Molecule From InChI_AuxInfo). It is a more verbose string which contains atom coordinates and bond orders are preserved. For more information about InChI, see and The Reaction to RXN component creates a text property containing an MDL RXN representation of the reaction in a molecular data record. The Molecule to JPEG, Molecule to PNG, and Molecule to SVG components calculate images of the molecule in the following formats: JPEG: (Joint Photographic Experts Group) A format used for compressed high-color or true-color images such as photographs. PNG: (Portable Network Graphics) A newer format used for bitmapped images (similar to GIF without legal restrictions). It provides high color support and improved compression. Newer versions of browsers such as Internet Explorer support this format. SVG: (Scalable Vector Graphics) A modularized language for describing two-dimensional vector and mixed vector/raster graphics in XML. The current coordinates of the molecule are used. For a given data record, a property containing a JPEG, PNG, or SVG image can be saved to an image file using the Text Writer, setting the output mode to Binary. Tip: If the molecule is not currently represented in 2D, use the 2D_Coords component to generate 2D coordinates. Chemistry Collection: Basic Chemistry Guide 11

12 Chapter 5 Calculators An important feature of a Pipeline Pilot protocol is its ability to calculate some properties on-the-fly. Calculator components allow the translation of the molecular information into a block of text that is stored on some property. Components that implement these on-the-fly property calculations are called property calculators. They declare the properties they can calculate using the Output parameter. The properties these components calculate are called calculable properties. For example, the molecular property ALogP can be used within a PilotScript expression. If the value is not already defined, the required property calculator is invoked automatically. The Chemistry collection includes property calculators that calculate numeric molecular descriptors. The current set of calculators in the Chemistry collection includes: Physicochemical AlogP Solubility Surface Area and Volume Surface Area and Volume 3D Structural Canonical Smiles Element Count Gasteiger Charges MDL Key Fingerprints Molecular Energy Molecular Fingerprints Molecular Formula Molecular Properties Molecular Property Counts Num H AcceptorDonors PiSystem Properties Substructure Mapping Substructure Count from File Substructure Count from Tag Substructure Map from File Substructure Map from Tag Topological indices Balaban Wiener and Zagreb Indices Chi Indices E-state Keys InfoContent Descriptors Kappa Shape Indices Subgraph Counts ALogP The ALogP component calculates the Ghose/Crippen group-contribution estimate for LogP, where P is the relative solubility of a compound in octanol vs. water. For more details see Ghose, A.K., Viswanadhan, V.N., and Wendoloski, J.J., Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragment Methods: An Analysis of AlogP and CLogP Methods. J. Phys. Chem. A, 1998, 102, ALogP can calculate the following properties: AlogP: The Ghose/Crippen group-contribution estimate for LogP, where P is the relative solubility of a compound in oil (actually, octanol) vs. water. AlogP_MR: The Ghose/Crippen estimate of molar refractivity, which contains information about molecular volume and polarizability of a compound. AlogP_Count: Returns an array of 120 numbers, which correspond to the 120 Ghose/Crippen atom types. The content of each array element is the number of atoms in the molecule of that particular atom type. 12 Pipeline Pilot

13 Canonical Smiles The Canonical Smiles component is a type of molecular property calculator that calculates a SMILES representation of the input molecule, optimally canonicalized so it s independent of the original atom numbering. SMILES is a text-based representation for molecular information developed by Daylight. A canonical SMILES is independent of the original atom numbering or explicit vs. implicit hydrogens. The SMILES string is written as text to a property. Canonical SMILES is unique to a given molecule, regardless of how it was drawn. You can use it as the key in merge or join operations to perform molecular comparisons without a molecular database. For example, the canonical SMILES representations of the first 10 molecules in Asinex are: Canonical_Smiles CN(C)c1ccc(\C=C\C(=O)\C=C\c2ccc(cc2)[N+](=O)[O-])cc1 Cc1ccccc1OCCNC(=S)SCC(=O)Nc2ccc(Cl)c(Cl)c2 Cc1ccc(OCCNC(=S)SCC(=O)OC(C)(C)C)cc1 CCCOC(=O)CSC(=S)NCCOc1ccccc1C [O-][N+](=O)c1ncn(CCn2ncnc2[N+](=O)[O-])n1 Cc1ccc(cc1)C23CC4CC(CC(C4)(C2)c5ccc(C)cc5)C3 [O-][N+](=O)c1ccc(N\N=C\CCC\C=N\Nc2ccc(cc2[N+](=O)[O-])[N+](=O)[O-])c(c1)[N+](=O)[O-] Clc1cccc(Oc2nc(CCCC#CC=C)nc(Oc3cccc(Cl)c3)n2)c1 C=CC#CCCCc1ccc(Oc2nc(Oc3ccccc3)nc(Oc4ccccc4)n2)cc1 CC(C)CC1NC(=S)N(C1=O)c2ccccc2 Note: The Canonicalization algorithm is Accelrys ; while it is derived from the Daylight algorithm, it will not necessarily give identical results. Compare two SMILES for identity only when both are canonicalized by the same method. Translating SMILES Output to Molecular Data To translate the SMILES output of Canonical Smiles into a molecular data record, use the Molecule From Smiles component. You can write molecular information into spreadsheets and databases to recreate it later, upon retrieval. E-state Keys This component calculates the Electrotopological State (E-State) descriptors defined by Kier and Hall (Hall L., Mohney B., Kier L., J. Chem. Inf. Comput. Sci., 1991, 31, 76-82, Hall L., Kier L., J. Chem. Inf. Comput. Sci., 2000, 40, ). The E-state keys are atomic indices that combine the electronic properties and the topological environment for each atom in the molecule. Keys are calculated for C, N, O, S, P, F, Cl, Br, I, Li, Be, B, Si, Ge, As, Se, Sn, and Pb atoms, which are classified into 79 atom types. These descriptors are widely used in structure similarity, library comparison, and QSAR/QSPR studies. Chemistry Collection: Basic Chemistry Guide 13

14 E-state Keys calculates either the sums of the E-state values or the counts of each atom type. The E-state Counts are the number of occurrences in the molecule of each of the 79 different atom types. E-state Sums are the sum of the E-state values for each of the 79 atom types. The output can be as a number of individual properties, one for each atom type, or as a single property with an array of values. The descriptors can be output as arrays or as individual properties. An option in the E-state calculator allows the display of the E- state type and E-state value for each atom in the molecule in the 2D depictions in the HTML Molecular Table Viewer, as shown in the figure below. What to Output E-state type and E-state value for each atom in the molecule The output type for E-state Keys is controlled by the What to Output parameter, which includes the following values: Estate_Keys_Properties: Calculates the E-state sums for all atom types and outputs them as individual properties. Estate_Counts_Properties: Calculates the E-state counts for all atom types and outputs them as individual properties. Estate_Keys: Calculates the E-state sums for all atom types and outputs them in one property as an array of double values. Estate_Counts: Calculates the E-state counts for all atom types and outputs them in one property as an array of integer values. Estate_NumUnknown: Outputs the number of atoms that could not be classified into an E-state atom type. E-state keys are calculated for organic elements (C, N, O, P, and S), halogens (F, Cl, Br, and I) and for Li, Be, B, Si, Ge, As, Se, Sn, and Pb. Element Count The Element Count component is a type of molecular property calculator that counts the atoms of each of the selected element types. The advanced parameters (described in detail below) allow you to return the total number of atoms using a list of element types. The Output parameter for this component contains a number of properties appended with the string _Count. The letters before the appended string are the atomic symbol for the given element. If you do not make a selection from the Output list, defaults are used to generate the output. Note: You can query a number of non-standard element symbols, such as *, R, X, D and T. Depending on their source, they may represent unknown atom types, abstract atom types (for example, Q means non-hydrogen non-carbon atoms in MDL queries), or other features. For common data sources, these are not present. 14 Pipeline Pilot

15 Advanced Parameters The advanced parameters that are available for this component include: Elements: An alternate mechanism of specifying elements to count (list by one or two letter codes, separated by commas). For example, enter Li,Na,K,Rb,Cs to generate the properties Li_Count, Na_Count, K_Count, Rb_Count, and Cs_Count. Total: The output name for totaled count of atoms listed in Elements. If you enter a value in this parameter (for example, Group1A_Count ), individual counts are not generated. Instead, a single count is generated that contains the sum of all of the individual element types. NotList: Specifies that the elements to count are those not listed in Elements. This inverts the logic of the total. It returns the count of all atoms with element types not contained in the given list. For example, use the following parameter values to create a calculator for Inorganic_Count : Elements: O,C,N,S,P,F,Cl,Br,I Total: Inorganic_Count NotList: True This example sets NotList to True and instructs the component to write the output to the property Inorganic_Count. Molecular Fingerprints This component calculates a variety of molecular fingerprints for the input molecules and reactions. It uses one the following algorithms to calculate fingerprints: SciTegic extended-connectivity fingerprints Daylight-style path fingerprints Atom Environment fingerprints MDL public key fingerprints For both the extended-connectivity and path fingerprints, a number of methods are available to define the atom abstraction used to generate the initial atom code. You should also specify the maximum path distance (such as number of bonds) to use for indexing an individual fragment. The next section provides more details about molecular fingerprints including: Fingerprint Parameters Fingerprint Storage Formats Path-based Fingerprints MDL Public Key Fingerprints Atom Environment Fingerprints Extended Connectivity Fingerprints Fingerprint Generation Method Calculating Extended Connectivity Fingerprints Hashing Schemes Functional Class Fingerprints Figuring out the Fingerprint Name Chemistry Collection: Basic Chemistry Guide 15

16 Fingerprint Parameters The Parameters tab for the Molecular Fingerprints component looks like this: Type Parameters for Molecular Fingerprints This parameter is the type of fingerprint to calculate. You can use the following values: ExtendedConnectivity: Generates extended-connectivity fingerprints. Path: Generates Daylight-style path-based fingerprints. Atom Environment: Generates higher-order features from atom types using a method developed by Bender et al. This creates a String Fingerprint. HashAtomEnvironment: Uses a hash code to create an Integer Fingerprint representation of the AtomEnvironment fingerprints for ease of use (e.g. learned models, etc.). MDLPublicKeys: Generates the MDL Public key fingerprints. UserKeys: Generates fingerprints derived from substructures that you define (user key fingerprints). AtomAbstraction This parameter is only used with the ExtendedConnectivity and Path types. It determines the method for generating the initial atom feature codes for the heavy (non-hydrogen) atoms in the molecule. You can use the following values: FunctionalClass: Uses the rapid functional-role codes. This abstraction generates extended-connectivity fingerprints (FCFPs) and path fingerprints (FPFP). The functional-role code is a combination of a hydrogenbond acceptor, hydrogen-bond donor, positively ionized or positively ionizable, negatively ionized or negatively ionizable, aromatic, and halogen. AtomType: Uses a code derived from the number of connections to an atom, the element type, the charge, and the atomic mass. This abstraction generates extended-connectivity fingerprints (ECFPs) and path fingerprints (EPFPs). ALogPCode: Uses a code from the 120 atom types used in the calculation of ALogP. This abstraction generates extended-connectivity fingerprints (LCFPs) and path fingerprints (LPFPs). SYBYL: Uses the SYBYL atom types used in the Tripos Mol2 File Format. UserAtomTypes: Assumes that the property UserAtomTypes is defined on the molecule and contains an array of integers, one for each atom in the molecule. The i th value in the array is the user atom type for the i th atom in the molecule. Reaction: Uses type, charge, hybridization, reactant or product, and reaction site information. Only available for reaction inputs. OutputType This parameter controls way the fingerprint is presented. There are two methods available: Fingerprint: A list of the features present in the molecule, with duplicates removed. Counts: A list of the features present in the molecule, with duplicates retained. If a feature occurs more than once in a molecule, that bit value is included more than once in the output list. 16 Pipeline Pilot

17 MaximumDistance This parameter is used with ExtendedConnectivity, Path, AtomEnvironment and HashedAtomEnvironment Types. For extended-connectivity fingerprints, it is the maximum diameter of the features generated. For path-based fingerprints, it is the length of the paths (in bonds) that are considered. Fingerprint Naming Convention For extended-connectivity and path-based fingerprints, the generated property name has a particular format. The first character of the fingerprint name is the atom abstraction used: F: Functional class E: Atom type L: AlogP types S: SYBYL atom types R: Reaction atom typing The second character represents the type of fingerprints: C: Extended-connectivity fingerprints P: Path-based fingerprints E: Atom Environment fingerprints H: Hashed Atom Environment fingerprints The third character is always F. The fourth character is either P or C for Fingerprints or Counts, respectively. The fourth character is followed by an underscore and the maximum distance. For example, a functional class extended-connectivity fingerprint of maximum diameter 6 generates a property named FCFP_6. Tip: Understanding the naming convention is useful with a learning component that refers to the properties by name. To identify what a particular set of parameter values can generate, read a molecule, pipe it through a Molecular Fingerprints component with your settings, and view it in the Notepad Viewer. The name of the fingerprint is displayed just above the fingerprint values. Options This parameter provides options for the fingerprint calculation. The options include: IncludeStereo: (#S) includes information from stereoatoms into the fingerprint calculation. OutputBitDistance: (#D) outputs an array with the length or diameter of each bit. OutputBitSubstructure: (#C) outputs an array with SMARTS of the fragment example. OutputBitAllAtoms: (#A) outputs an array with the set of all atoms involved with a feature anywhere in the molecule. OutputBitFeatureAtoms: (#F) outputs an array with the set of atoms showing one example of the feature bit. Note: IncludeStereo changes the fingerprint. The other options cause the calculation of other properties with associated information. Fingerprint Storage Formats Although you do not need to be concerned with how data is stored to perform most tasks in Pipeline Pilot, there are situations where being aware of storage types might come in handy. For example, if you are importing data from external sources and teaching that the data should be interpreted as fingerprint data: Data types tell what is known about how to interpret the information associated with a given property. Data storage types tell what is the current format used in storing the raw data. Chemistry Collection: Basic Chemistry Guide 17

18 Preferred Storage Types Fingerprints are stored and manipulated. Given their importance in a number of key operations, it is necessary to provide native types for them. A given data type has a preferred storage type (for example, the LongType has a preferred storage type of LongStorage), though data of that type might be stored in any of a number of storage types (for example, data of LongType might be stored as StringStorage). Note: For this discussion, a type and its preferred storage type are equivalent. Pipeline Pilot has different native fingerprint types and each type has a corresponding preferred storage type. These preferred storage types are generated by the fingerprint calculators used within the Molecular Fingerprints component. Fingerprint Type LongFingerprintType DoubleFingerprintType StringFingerprintType BitFingerprintType Corresponding Preferred Storage Type LongArrayStorage DoubleArrayStorage StringArrayStorage BitsetStorage Extended-connectivity Fingerprints and Path-based Fingerprints generate properties of type LongFingerprintType and are stored an array of long (32-bit) integers. MDL Public Key and User-key Fingerprints return properties of type BitFingerprintType and are stored as fixed-length bit arrays; 166 bits for MDL public keys or M bits for user keys, where M is the number of the largest bit in any of the features in the user keys feature directory. Tip: If you are working with an advanced task that involves expertise with fingerprint storage formats, contact Accelrys Technical Support for further assistance. Path-based Fingerprints Path-based fingerprints are derived from fingerprints derived by Daylight. SciTegic path-based fingerprints do the following: Include the same options for initial atom coding as our extended-connectivity fingerprints, allowing for abstractions that are useful in learning and clustering. Immediately fold the fingerprint down to a small set of a few hundred or thousand bits. Learning methods in the program are suitable, even for thousands of different bits, and keeping them separate aids in learning and interpretation. Are similar to extended-connectivity fingerprints in that a fingerprint for a given maximum path length also contains the bits for all paths of shorter lengths. Path-based fingerprints are generated by detecting all paths up to a given length, and then generating a feature that represents those paths. The union of all different features present in a molecule is the pathbased fingerprint for that molecule. For a particular path, the feature bit is generated as follows: A path containing N bonds has N+1 atoms, so an array of 2N+1 is allocated. The first element is filled with the initial atom code for one of the end atoms; the next element with the bond type to the following atom; the initial atom code for the next atom; and so on, until the entire path is in the array. The array is hashed to give the feature code for that path. 18 Pipeline Pilot

19 MDL Public Key Fingerprints MDL keys are a set of 960, mostly substructural features, developed for rapid substructural searching of ISIS databases. They are also useful as descriptors for learning, although controversy exists about their true quality for this purpose. Molecular Design considered the full definition proprietary, but released the definition of 166 of the full set of 960. These are referred to as the MDL Public keys. In Pipeline Pilot, selecting the Type parameter as MDLPublicKeys adds a property called MDLPublicKeys to the property list of the molecule. The fingerprint contains the list of key numbers for features that exist in the given molecule. Calculation of Keys Parameters for MDLPublicKeys For ease of inspection, most of these keys are implemented as MOL file queries. The list of queries is located in data\queries\mdlqueries. We do not recommend changing or editing the queries in this directory. The following keys can be turned on without these substructures by checking atom types or functional groups internally (although for some bits there are substructural queries that can also turn them on): Atom sets Single atom types Miscellaneous Atom Sets Key Bit 3 Bit 4 Bit 5 Bit 6 Bit 7 Bit 9 Bit 10 Bit 12 Bit 18 Bit 35 Bit 44 Bit 134 Description Group IVA, VA, VIA, PERIODS 4-6 (Ge...) ACTINIDE Group IIIB, IVB LANTHANIDE Group VB, VIB, VIIB Group VIII Group IIA Group IB, IIB Group IIIA Group IA OTHER X Chemistry Collection: Basic Chemistry Guide 19

20 Single Atom Types Key Bit 20 Bit 27 Bit 29 Bit 42 Bit 46 Bit 88 Bit 103 Bit 161 Bit 164 Description Silicon Iodine Phosphorus Fluorine Bromine Sulfur Chlorine Nitrogen Oxygen Miscellaneous Key Description Bit 68 QInRing > 0 Bit 1 numisotopes > 0 Bit 2 numunusual > 0 Bit 22 ringsofsize[3] > 0 Bit 11 ringsofsize[4] > 0 Bit 96 ringsofsize[5] > 0 Bit 99 numdoubleccbonds > 0 Bit 140 numdoubleccbonds > 1 Bit 163 ringsofsize[6] > 0 Bit 145 ringsofsize[6] > 1 Bit 19 ringsofsize[7] > 0 Bit 101 ringsofsize[8] > 0 Bit 137 numqinring > 0 Bit 120 numqinring > 1 Bit 121 numninring > 0 Bit 138 numrare1 > 0 Bit 140 numqrare1 > 0 Bit 141 numqrare1 > 1 Bit 141 nummethyl > 2 Bit 149 nummethyl > 1 Bit 160 nummethyl > 0 Bit 162 numaromaticrings > 0 20 Pipeline Pilot

21 Key Description Bit 125 numaromaticrings > 1 Bit 142 NumNitrogen > 1 Bit 159 NumOxygen > 1 Bit 146 NumOxygen > 2 Bit 140 NumOxygen > 3 Bit 165 numrings > 0 Bit 166 numfragments > 1 Tip: To create altered or novel fingerprints using substructural queries, work with the user key fingerprints described in the next section. User Key Fingerprints User key fingerprints are derived from the same underlying system as MDL public key fingerprints. Pipeline Plot provides examples of user-defined fingerprints as illustrations of how you can create your own MDL key style fingerprints. To work with user-key fingerprints, open the Molecular Fingerprints component and select the value UserKeys for the Type parameter. The query files (in MOL or SD format) are located in data/queries/userqueries and include the following: ExampleBit1.mol ExampleBit2.sd ExampleBit3.sd Tip: You can remove these query files and add your own to create user key fingerprints. ExampleBit1.mol This query is for a non-hydrogen, non-carbon atom attached to a non-hydrogen, non-carbon atom. The name of the query is 1, the first line in a MOL file. This is the number of the bit turned on when that feature is present. When this query is found in a molecule, the feature number is added to the fingerprint bit list. If you open ExampleBit1.mol in the Windows Notepad, the following information is displayed: Chemistry Collection: Basic Chemistry Guide 21

22 ExampleBit2.sd You can also define the bit code using the Bit property in an SD format query file. This query is for a nonhydrogen, non-carbon atom attached to an oxygen. When found, the bit number 2 is added to the fingerprint bit list, as shown in the following example: ExampleBit3.sd Another option is the ability to test for multiple occurrences of a feature (a requirement that a feature be present a given number of times before the bit is turned on). The MinimumCount property is used to declare that the given query must be found at least twice in the given molecule, for that feature to be added to the output fingerprint. The following is an example of how this is done: Atom Environment Fingerprints Atom Environment fingerprints generate higher-order bits from the atom types specified by AtomAbstraction using an algorithm described in Bender, A., Mussa, H.Y., Glen, R.C., and Reiling, S., Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naive Bayesian Classifier. J. Chem. Inf. Comput. Sci. 2004, 44, This generates an output of String Fingerprint type. Hashed Atom Environment fingerprints use a hashing algorithm to create an Integer Fingerprint representation of the Atom Environment fingerprint; this fingerprint is often easier to deal with as inputs to other components (such as Molecular Learners). 22 Pipeline Pilot

23 Extended Connectivity Fingerprints Extended Connectivity Fingerprints (ECFPs) are a new class of fingerprint well-suited to the learning methods available in Pipeline Pilot. Each feature represents the presence of a structural (not substructural) unit. You differentiate these fingerprints from those with features that represent substructural ones (such as MDL keys or Daylight path-based fingerprints). The difference is best explained with an example. Assume there are features representing a para-substituted benzene ring in both a MDL fingerprint and in an ECFP: Para-substituted benzene ring In an MDL fingerprint, the structure is present as a substructure somewhere in the target molecule. For example, the following estrogenic structure turns on that feature: Estrogenic structure For ECFPs, the estrogen does not contain the feature, as there are substitutions on the ring at locations other than the specified attachment atoms marked A. Thus, an ECFP feature represents an exact structure with limited, specified attachment points. The following molecule contains the feature: Estrogenic structure mapped by ECFPs The reason for ECFPs and substructure-based fingerprints is twofold. First, substructure-based fingerprints are intended for a different task database searching. Substructural fingerprints have the property that all features contained within a query must also be contained within a target, if the query can map onto that target. This allows the fingerprint to rapidly eliminate molecules from consideration, when performing a substructure search against a database. For ECFPs, it is not required that they be useful for database optimization. Chemistry Collection: Basic Chemistry Guide 23

24 Second, ECFPs represent a much larger set of features than what is common for other fingerprints. The virtual size of the fingerprint is four billion different features. For a given molecule, only a small subset of those features is present. (This means the fingerprints are usually stored as a list of features that are present, rather than as a binary bit array.) This allows ECFPs to present a huge number of different structural units that may be valuable for learning or molecular comparison. A typical molecule may generate fingerprints containing tens or hundreds of features; a typical molecular catalog may contain several thousand or millions of different features. Advantages of ECFPs There are several advantages to ECFPs including: They are fast to calculate, as explained later in this guide. Even large datasets can be processed rapidly without the need to pre-process the data in weekend-long batch jobs. They represent a much larger set of features than many fingerprints; compared to 960 features in the MDL private keys, or even the 25,000 features in products such as LeadScope. Further, these features are not pre-selected, but are generated directly from the molecules. Novel molecular classes are as easily handled as the more common classes present in pre-selected lists of interesting features. They represent information about tertiary and quaternary centers, which is not the case for path-based fingerprints such as Daylight fingerprints. Even some stereochemical information can be represented. The features represent the molecule at differing levels of detail. For example, some may represent single atoms, such as the presence of a halogen. Others may represent a large section of molecular structure, such as the A-B rings of a steroid ring system shown here: Steroid ring system Different atom abstractions can be used to generate different fingerprints. For example, standard ECFPs use the atom type as part of the initial atom code; this differentiates Chlorine and Bromines. A variant of ECFPs called functional-class fingerprints (FCFPs) uses the role of an atom in the initial atom code. In this case, both Chlorine and Bromine are seen as equivalent instances of halogen atoms. Fingerprint Generation Method The fingerprint generation method is based on one of the original algorithms in computational organic chemistry called the Morgan algorithm. The goal of the Morgan algorithm is to assign a unique identity to each atom in a molecule, so that a molecule can be described in a way that is invariant to the original numbering of atoms. The algorithm has two parts: the assignment of an initial code to each atom, and an iterative part in which each atom code is updated to reflect the codes of each atom s neighbors. A similar scheme is used in ECFPs, with two important changes. First, the Morgan algorithm is only interested in disambiguating atoms within a single molecule, so the generated codes are not comparable between different molecules. SciTegic uses a hashing scheme to generate codes comparable across molecules. Second, the Morgan algorithm iterates until every atom is unique, or as close to unique as symmetry allows, and intermediate results are discarded. However, it is exactly those intermediate results that are of interest, allowing you to represent features that reflect many different levels of structural abstraction. The following information describes the generation of the initial atom codes and the iteration that generates the fingerprint features. 24 Pipeline Pilot

25 Generation of Initial Atom Codes The generation of an ECFP or FCFP fingerprint for a molecule begins with the assignment of an initial atom code for each heavy (non-hydrogen) atom in the molecule. In theory, any atom-typing rule can be used. There are two rules that are most useful the ECFP rule and the FCFP rule. (Only differences in the initial atom code distinguish ECFPs and FCFPs; once the codes are assigned, both fingerprints are developed through the same process.) For ECFPs, the initial atom code is derived from the following features: Number of connections to the atom Element type Charge Atom mass Atoms that differ in any of these features generate a different ECFP initial atom code. For FCFPs, the initial atom code is based on the quick estimate of the functional role the atom plays. This role indicates that the atom must be a combination of the following: Hydrogen-bond acceptor Hydrogen-bond donor Positively ionized or positively ionizable Negatively ionized or negatively ionizable Aromatic Halogen An example of the initial FCFP atom codes for a small molecule is shown below (the nitrogen is given a code 3, as it both an H-bond acceptor and an H-bond donor): Initial FCFP atom codes If you were to stop here, you would generate the fingerprint called FCFP_0, where the number zero is the maximum diameter explored around each atom. The fingerprint is the set of features {0, 1, 3, 16}. More typically, these features are used as a starting point for the iterative process described in the next section. Chemistry Collection: Basic Chemistry Guide 25

26 Note: There is another initial abstraction available within Pipeline Pilot. It uses the ALogP atom type codes, a set of 120 different categories that atoms may include. The use of ALogP types within the extendedconnectivity fingerprint calculation is an experimental feature that you can try if you have a lot of experience with Pipeline Pilot. It is not covered in detail in this guide, as ECFPs and FCFPs are the most widely used and best understood. Iteration to Generate Higher-Order Features An iterative process is used to generate features that represent each atom in larger and larger structural neighborhoods. After each iteration, the new feature codes for the atoms are added to the set of features from all previous steps. When the desired neighborhood size is reached, the process is complete, and the set of all features is returned as the fingerprint. A visual interpretation of the process is shown below: Iterative fingerprint generation process This sample shows the features generated for a single atom the carbon atom in the aromatic ring where the amide functional group is attached. At iteration 0 (that is, before iterating), it only has information about the atom itself, encoded into its initial atom code. During the first iteration, it collects information from all the atom s immediate neighbors and generates a new code. That new code represents the presence of a molecular structure incorporating four atoms: the core atom and its immediate neighbors. This process is not only performed for this one atom, but also for each atom in the molecule, so that all atoms have a new code representing the immediate neighbor around them. Note: A hashing scheme is used to generate the new code from the codes of an atom and its neighbors. It is not necessary to understand this scheme to successfully use extended-connectivity fingerprints. For more details, see Fingerprint Feature Code Generation Hashing Schemes. For the second iteration, it repeats the process of collecting information from the neighbors and generates a new code. But this time, instead of using the initial atom codes for the atom and its neighbors, it uses the updated codes from iteration 1. The code generated from this step represents an even larger structure around the core atom, in this case, all atoms within two bonds of the core atom. 26 Pipeline Pilot

27 The number of iterations performed is determined by the maximum diameter of the neighborhoods requested. This diameter is displayed in the fingerprint name as an appended number. For example, FCFP_6 generates features around each atom up to a diameter (in bonds) of six, which requires three iterations. (Because each iteration increases the diameter of the neighborhood by two bonds, there are no oddnumbered fingerprints such as FCFP_1. Instead, the series of legal fingerprints is FCFP_0, FCFP_2, FCFP_4, FCFP_6, etc.). Calculate Extended Connectivity Fingerprints There are different ways to calculate extended-connectivity fingerprints for your molecular data. First, molecular learners (such as Learn Good Molecules) and clustering methods (such as Cluster Molecules) may contain some extended-connectivity fingerprints by name in the parameter PredefinedSet. This is a predefined list of calculable properties useful for the learning or clustering. For example, by default, the Cluster Molecules parameter uses a functional class fingerprint of maximum diameter 4 (FCFP_4) for clustering. Molecular Fingerprints and Clustering A second method for calculating fingerprints is to request their calculation by name using the Custom Manipulator (PilotScript) component (e.g., calculate( FCFP_4 );). A final method for calculating extended-connectivity fingerprints (and other fingerprint types) is by using the Molecular Fingerprints component. This component and its parameters are explained more fully in the section Fingerprint Parameters. Fingerprint Feature Code Generation Hashing Schemes This section provides information about how the fingerprint feature codes are developed with hashing schemes. You do not need to know this information for general use of extended-connectivity fingerprints. In the Morgan algorithm, a prime-number scheme is used to generate higher-order codes for each atom during the iteration. In this scheme, each different code value is assigned a prime number, and the new code is the product of the prime number of the parent atom with all its neighbors. The products can get very large, so at the end of each cycle, each unique product is replaced with a small integer that represents the atom class. This method guarantees that no two atoms in different structural neighborhoods ever get the same code. This guarantee of uniqueness is vital because the Morgan algorithm is preparing the molecule for storage in a database, where any confusion can lead to lost data. In the extended-connectivity fingerprint process, this uniqueness is not vital. In fact, by mapping all feature codes into an address space of 232 feature codes, there is always a sight risk that two different structural features will have the same code. Given the size of the space of feature codes, this risk is minimal, and even if it does occur, there is little effect on learning. Chemistry Collection: Basic Chemistry Guide 27

28 This folding of features is done explicitly in Daylight fingerprints to reduce the fingerprint to a small size suitable for storage and manipulation in a binary array. You can use a rapid hashing scheme, which has the additional advantage in that codes from the hashing scheme are invariant across different molecules (something that is not possible with the Morgan-generated codes). Look at how a single iteration is performed: Generation of Atom Codes The molecule (with its original atom numbering) is shown on the top left, and the molecules (with atoms marked with the initial FCFP atom codes) on the top right. Look at the generation of the new code for atom 5. First, an array of number is generated that represents the local environment of the core atom. The array starts with a single number, the current atom code (16). Next, add two numbers to the array for each non-hydrogen attachment. The first of the two numbers is the bond type code for the bond to that attachment: 1 for a single bond, 2 for a double bond, 3 for a triple bond, and 4 for an aromatic bond. The second of the two numbers is the current atom type code of the neighbor. To avoid order-dependency in the attachment list, sort the attachments using their number pairs. In this case, the final order for the pairs is (1, 0), (4, 16), (4, 16). Finally, take the array of numbers and apply a hashing function to generate a single number, in this case, the number This is the number that represents the four-atom feature centered on atom 5. One way to think of this number is as the index of a bit in a large virtual bit array. A molecule containing this structural feature would have bit 203,667,720 on. Since most molecules have a most a few hundred features, the bits are usually stored as a list of on bits, rather than as actual on bits in a large, non-virtual bit array. The final fingerprint is the collection of all features generated for each atom at each iteration level. For the benzoic acid amide shown above, you can display the feature codes. Read the file data\queries\benzoicacidamide.mol using an SD Reader and a Custom Manipulator (Pilot Script) component configured with the following expression: calculate('fcfp_0','fcfp_2','fcfp_4'); Run the protocol and display the results in the Notepad Viewer. The results should look like this: 28 Pipeline Pilot

29 Protocol results displayed in Notepad Normally, the fingerprint feature codes are not directly inspected, although they may become important if a particular feature is identified during learning. In this case, use the Learned Feature Filter to identify compounds with a particular feature or features. Notice that fingerprints with larger diameters (such as FCFP_4) contain all the features present in the corresponding fingerprint at smaller dimensions (such as FCFP_0 or FCFP_2). It is not necessary to include a series of such fingerprints, only the largest diameter one. This is how extended-connectivity fingerprints can contain features at a variety of levels of abstraction. The features with negative signs are an artifact of the output procedure. Since the hash function uses all 32 bits in an integer, and most printing methods treat the first bit as a sign bit, some features are displayed as negative numbers. Functional Class Fingerprints Functional-class fingerprints (FCFPs) are a type of extended-connectivity fingerprint that use a simple, rapid, functional-class atom typing scheme for their initial atom codes. Each code is a number in the range [0, 63]. The initial code becomes the starting point for the extended-connectivity calculation. The functional code is defined for each atom as described in the following C code. The final code is the logical OR of six different atomic feature bits. (If none of the features applies to a given atom, its code is zero.) code = 0; if (atom.isacceptor() > 0) code = 1; if (atom.isdonor()) code = 2; if (atom.isnegativeionizable()) Chemistry Collection: Basic Chemistry Guide 29

30 code = 4; if (atom.ispositiveionizable()) code = 8; if (atom.isaromatic()) code = 16; if (atom.ishalogen()) code = 32; The function names are meant to be suggestive rather than definitive. For example, a precise estimation of whether an atom is ionizable requires a lengthy quantum-mechanical calculation. Our goal is simpler the rapid partitioning of the atoms into general functional classes, for which an approximate method is satisfactory. IsAcceptor is a complicated method that depends on the connectivity, charge, and atom type. A true value is only possible for the following: atom.gettype() == Oxygen atom.gettype() == Nitrogen atom.gettype() == Sulfur atom.gettype() == Phosphorus IsDonor is a rapid test of whether an atom can be a hydrogen-bond donor. It returns true, if the atom is oxygen or nitrogen, and has one or more hydrogens attached. IsNegativeIonizable is true, if the atom contains a negative charge or the atom is an ionizable (acidic) oxygen atom. IsPositiveIonizable is true, if the atom contains a positive charge or if the atom is a nitrogen with no hydrogens or sp2-hybridized neighbors. IsAromatic is true, if the atom is aromatic by our definition (based on a Huckel 4n+2 rule). IsHalogen is true, if the atom is a Chlorine, Fluorine, Bromine, or Iodine. Reaction Fingerprints Reaction fingerprints (RCFPs) are a type of extended-connectivity fingerprint that use reaction-specific information to determine the initial atom codes. The following contribute to the initial atom codes for RCFP s: Element type Charge Hybridization Whether the atom is a Reactant atom or Product atom Whether or not the atom is in the Reaction Site The Reaction Site is perceived from the atom-atom mappings of a reaction. It includes atoms that are changed by the reaction and atoms attached to bonds that are changed by the reaction. Atoms without mappings are automatically included in the site as they are removed from the reactant side and added to the product side. The Highlight Reaction Site component can be used to show how the reaction site is being perceived. Here are the reaction sides of two different esterification reactions. Note how the changed atoms are very similar in both reactions while the inert regents are quite different: 30 Pipeline Pilot

31 Two different esterification reactions with atoms in each reaction site highlighted Additionally, with RCFPs, only atoms within the Reaction Site can be bit centers. Neighboring non-site atoms are only considered at higher distances. This allows you to use the Distance parameter to configure how much of the non-site region to sample with the fingerprint. The two very different esterification reactions are indistinguishable using only the bit centers (RCFP_0), while the differences between the two show up at higher distances. Fingerprint RCFP_0 1.0 Similarity RCFP_ RCFP_ RCFP_ RCFP_ Note: As a variant, an additional reaction fingerprint called QCFP can be calculated. The algorithm for calculating the initial atom code is the same as that for RCFP. And as with RCFP, only atoms within the reaction site can be centers. QCFP differs in that it does not consider atoms outside the site at higher distances. This variant is not available from the Molecular Fingerprints component interface, but is available on demand as a calculable property. Because QCFPs do not explore outside the reaction site, they remain extremely specific at larger distances. Chemistry Collection: Basic Chemistry Guide 31

32 Fingerprint QCFP_0 1.0 QCFP_2 1.0 QCFP_4 1.0 QCFP_6 1.0 QCFP_8 1.0 Similarity Reaction Fingerprint Validation Reaction Fingerprints (RCFPs) have been validated using a several methods. One method is to analyze how similarities, clustering, and Bayesian learners perform on reaction datasets that have been tagged with descriptive keywords (e.g., alkylation, halogenations, etc.) These keywords can be used as the categories in Bayesian categorical model. Here is an analysis of the leave-one-out cross-validation ROC scores for the 200+ keyword categories in a dataset of 70,000 metabolite reactions: Fingerprint EstXVAUC_Mean EstXVAUC_StdDev RCFP_ RCFP_ RCFP_ RCFP_ ECFP_ ECFP_ ECFP_ ECFP_ MDLRxnCenterKeys The RCFPs produce better ROC scores than either the MDLRxnCenterKeys or considering molecular features alone (ECFPs). Similarity studies in which the average pairwise similarities for reactions within the same category were compared with the average pairwise similarities with reactions outside the class were also conducted. The Enrichment Factor was calculated as the average similarity within the class divided by the average similarity outside the class. The following chart shows the results for the reactions in the metabolite dataset. QCFPs and RCFPs tend to perform slightly better than MDLRxnCenterKeys, and are clearly better than ECFPs: 32 Pipeline Pilot

33 Results for the reactions in the metabolite dataset In another similarity study using a subset of ~70,000 reactions from the CIRX dataset representing 77 categories, different reaction fingerprints were used to calculate the top 20 more similar reactions to each reaction in the subset and then calculate the percentage of those similar reactions that contain all the category keywords present in the query reaction. The following chart shows the results as the average calculated over all the reactions. In this case, the MDLRxnCenterKeys did a little better that either RCFP s or QCFP s, and all these fingerprints did clearly better than ECFP s, which does not include any reaction-specific features: Chemistry Collection: Basic Chemistry Guide 33

34 Figuring Out the Fingerprint Name Results as the average calculated over all the reactions The Molecular Fingerprints component has many options that control the type of fingerprint to generate. The fingerprint name varies, based on the option that is selected. It s easiest to try a set of options and then find out the corresponding name. However, there is a method to this naming, described as follows: Fingerprint Types without an Encoded Name For the following two values of the parameter Type, the AtomAbstraction, OutputType, and MaxDistance parameters are not relevant and the following names are used: Type MDLPublicKeys UserKeys Fingerprint Name MDLPublicKeys UserKeys Encoded Fingerprint Names All other Types of fingerprints have a name in the form of XXFX_N, where the values of AtomAbstraction, Type, and OutputType determine the first, second and fourth letters respectively (the third character is always F ) while the MaxDistance parameter determines the number following the underscore. First Letter The first letter of an encoded fingerprint name is determined by the AtomAbstraction parameter: AtomAbstraction FunctionalClass AtomType First Letter F E 34 Pipeline Pilot

35 AtomAbstraction ALogPCode SYBYL Reaction UserAtomType First Letter L S R U Second Letter The second letter of an encoded fingerprint name is determined by the Type parameter: Type ExtendedConnectivity Path AtomEnvironment HashedAtomEnvironment MDLPublicKeys UserAtomType Second Letter C P E H NA - Fingerprint Name not encoded (see above) NA - Fingerprint Name not encoded (see above) Third Letter The third letter of an encoded fingerprint name is always F. Fourth Letter The fourth letter of an encoded fingerprint name is determined by the OutputType parameter. Fingerprint returns a list of the features present in the molecule, with duplicates removed, while Counts returns a list of the features present in the molecule, with duplicates retained; if a feature occurs more than once in a molecule, that bit value is included more than once in the output list. OutputType Fingerprint Counts Fourth Letter P C Number Following the Underscore The number following the underscore of an encoded fingerprint name is determined by the MaxDistance parameter. For extended connectivity fingerprints, this is a maximum diameter (in bond lengths) of the largest structure represented by the fingerprint. For path fingerprints, this is the maximum length of the path. For both, this is only a maximum; all bits at all lower levels are included. Note that this number is always even. Examples If you chose Path as the Type, AlogPCode as the AtomAbstraction, 4 as the MaximumDistance, and Fingerprint as the OutputType, the name is LPFP_4, and you call this fingerprint (starting from the left) ALogPCode path-based fingerprint of length 4. FCFC_6 is functional-class extended-connectivity fingerprint count up to diameter 6. Note: For backward compatibility, if you choose path-based fingerprints and AtomType, only the element atom number is used, and not the full Daylight invariant, which also includes charge, mass, and connectivity. Chemistry Collection: Basic Chemistry Guide 35

36 Fingerprint Options Calculable property options are appended to a property name, and start with the character #. An example is the parameter include stereo: when stereo is included in the extended connectivity calculation, then #S is appended, as in: FCFP_6#S. To illustrate options, consider the following protocol: The molecule is alanine, which is shown with atom numbers: Alanine molecule The output looks like this when displayed in the Notepad Viewer: Protocol output displayed in Notepad Viewer 36 Pipeline Pilot

37 If you add the option #S, you get the fingerprint with stereochemistry: Protocol output with new property displayed in Notepad Viewer In this case, the option changed the calculation of the fingerprint to give a different result. However, this is not always the case. Many options cause the calculation of the fingerprint along with the calculation of additional properties. These additional properties offer information about the individual bits of the fingerprint. For example, consider the output if you request a calculation of FCFP_6#F. This is a request for additional information about each feature bit; in this case, an example of a set of atoms in the molecule which illustrates that feature. Chemistry Collection: Basic Chemistry Guide 37

38 The output contains two new properties: FCFP_6 and FCFP_6#F: Protocol output with two new properties displayed in Notepad Viewer Each bit in the array of FCFP_6 has a corresponding member in the array of FCFP_6#F. The entry in FCFP_6#F is the set of atoms involved in generating the bit in FCFP_6. Thus, the option #F does not change the fingerprint output, but only controls the output of additional associated information. 38 Pipeline Pilot

39 A similar option is #A. Calculating FCFP_6#A gives the following output: Protocol output with new property displayed in Notepad Viewer In this case, the set is all atoms contains in any instance of a particular feature, rather than one example of the atoms in the feature as done by #F. Note how feature 0 is contained in three atoms (2, 3, and 6) because it was generated at different places in the molecule. Chemistry Collection: Basic Chemistry Guide 39

40 A useful option is #C. Calculating FCFP_6#C gives the following output: Protocol output with two new properties displayed in Notepad Viewer In this case, the associated information is a SMARTS string that describes the substructure obtained by excising the feature from the remainder of the molecule, with the attachment atoms shown as * atoms. Keep in mind that these are examples of structures that generated a particular bit, and are not definitions of a feature. Depending on the initial atom abstraction, ring closures, and other details based on the generating process, different substructures may be examples of the same bit. 40 Pipeline Pilot

41 Another useful option is #D. FCFP_6#D gives the following output: Protocol output with new property displayed in Notepad Viewer In this case, the associated information is the diameter of a particular feature (or length, for path-based fingerprints). For extended-connectivity fingerprints, you do not get a bit for each atom at every level. Bits that are duplicates of other bits (where duplicate is defined as two features defined by the same atom set) are not included. Indeed, bit contains all of the information in the molecule (that is, contains every atom), so no new bits are generated at the next level. This avoids generating bits that are mere duplicates of information you already have elsewhere. Chemistry Collection: Basic Chemistry Guide 41

42 This distance option also works with path-based fingerprints, as illustrated in the following example: Protocol output with path-based fingerprints displayed in Notepad Viewer Unfortunately, these options do not work with all fingerprint types. Currently, only extended-connectivity and path-based fingerprints acknowledge them. The #Z option will output the index of the central atom associated with that bit (OutputCentralBitAtom): 42 Pipeline Pilot

43 Protocol output with two new properties displayed in Notepad Viewer In this case, the parallel FCFP_6_Z array shows which atom is central to the bit present in FCFP_6. Atom 1 creates the 3 bit, Atom 2 creates the 0 bit, and so on. If more than one atom is associated with a particular bit, only the first atom associated with that bit is listed. Using Counts instead of Fingerprints will preserve duplicate bits. For a different view of the atoms central to each bit, use the #P option (AddBitsToCentralAtom): Chemistry Collection: Basic Chemistry Guide 43

44 Protocol output showing bits added to central atom as atom properties in the HTML Molecular Table Viewer In this case, the correspondence between the atom and the associated fingerprint bits is made with an atom property. A parameter called Options exposes fingerprint options. This parameter is a list of options: IncludeStereo, OutputBitDistance, OutputBitSubstructure, OutputBitAllAtoms, OutputBitFeatureAtoms, OutputBitCentralAtom and AddBitsToCentralAtom. They correspond to the options #S, #D, #F, #A, #C, #Z and #P. Molecular Formula Options parameter for Molecular Fingerprints component This component calculates the formula of a molecule a sequence of atomic symbols, followed by the number of atoms with that element type in the molecule. For example, the molecular formulas of the first 10 molecules in Asinex are as follows: Molecular_Formula C19H18N2O3 C18H18N2O2S2Cl2 C16H23NO3S2 C15H21NO3S2 C6H6N8O4 C24H28 C17H16N8O8 C22H17N3O2Cl2 C28H23N3O3 44 Pipeline Pilot

45 C13H16N2OS Molecular Properties This component calculates the following whole-molecule properties: Value FormalCharge CoordDimension IsChiral BondDistance_Table AverageBondLength Number of Total formal charge of the molecule. Indicator for the atomic coordinates: 0 (all coordinates are zero), 2 (have X,Y coordinates), 3 (have X,Y,Z coordinates). Flag to indicate whether the molecule exists only in the represented absolute stereo configuration or as a pair of enantiomers. This flag mirrors the Chiral flag in the MDL CTAB format. Calculates the number of bonds in the shortest path between each pairs of atoms in the molecule. Calculates the average bond length for the molecule based on the atomic coordinates. Molecular Property Counts This component is a type of molecular property calculator that can calculate the following values: Value Num_Atoms Num_Bonds Num_ExplicitAtoms Num_ExplicitBonds Num_Hydrogens Num_ExplicitHydrogens Num_PositiveAtoms Num_NegativeAtoms Num_RingBonds Num_RotatableBonds Num_AromaticBonds Num_BridgeBonds Num_SingleBonds Num_DoubleBonds Num_TripleBonds Num_AliphaticSingleBonds Number of Heavy (non-hydrogen) atoms. Bonds between heavy atoms. Heavy atoms and explicit hydrogens Bonds between any pair of atoms, including hydrogens Hydrogens, both implicit and explicit. Explicit Hydrogens Atoms with a positive charge. Atoms with a negative charge. Bonds in a ring. Rotatable bonds, defined as single bonds between heavy atoms that are both not in a ring and not terminal (that is, connected to a heavy atom that is attached to only hydrogens). As a special case, amide C-N bonds are not rotatable. Bonds in aromatic ring systems. Bonds in bridgehead ring systems, defined as any rings that share more than one bond in common. Number of single bonds between heavy atoms. Number of double bonds. Number of triple bonds. Number of single bonds between heavy atoms that are not in aromatic rings. Chemistry Collection: Basic Chemistry Guide 45

46 Value Num_AliphaticDoubleBonds Num_Rings Num_AromaticRings Num_RingAssemblies Number of Number of double bonds that are not in aromatic rings. Base rings, defined as the number of rings in the smallest set of smallest rings (SSSR). Base rings that are aromatic. Num_Rings3 Number of rings of size 3 Num_Rings4 Number of rings of size 4 Num_Rings5 Number of rings of size 5 Num_Rings6 Number of rings of size 6 Num_Rings7 Number of rings of size 7 Num_Rings8 Number of rings of size 8 Num_Rings9Plus Num_Chains Num_ChainAssemblies Num_Fragments Num_StereoAtoms Num_StereoBonds Ring assemblies, defined as the fragments remaining when all non-ring bonds are removed from the molecule. For example, naphthalene has one ring assembly, while biphenyl has two. Number of rings of size 9 or bigger Unbranched chains needed to cover all the non-ring bonds in the molecule. Chain assemblies, defined as the fragments remaining when all ring bonds are removed from the molecule. Total fragments in the molecule; two pieces are fragments, if none of their atoms are connected via a covalent bond. Atoms marked as EvenAtomStereo, OddAtomStereo, or UnknownAtomStereo. Bonds marked CisBondStereo, TransBondStereo, or UnknownBondStereo. Num_UnknownStereoAtoms Atoms marked UnknownAtomStereo. Num_UnknownStereoBonds Bonds marked UnknownBondStereo. Num_TrueStereoAtoms Num_UnknownTrueStereoAt oms Num_PseudoStereoAtoms Num_UnknownPseudoStere oatoms Num_MesoStereoAtoms Atoms that are internally perceived as having stereo and that are marked as EvenAtomStereo or OddAtomStereo. Atoms that are internally perceived as having stereo and that are not marked as EvenAtomStereo or OddAtomStereo. Stereo atoms that are diametrically opposite each other in a ring system. Atoms that are internally perceived as having pseudo stereo and that are not marked with wedge bonds as EvenAtomStereo or OddAtomStereo. Atoms that are true stereo centers in a molecule that, due to symmetry, is not chiral. Num_EnhancedStereoAtoms Atoms that are marked with EnhancedStereo (e.g. relative stereo groups from V3000 CTAB import). Num_AtomClasses Different atom classes from symmetry perception (excluding hydrogens). For example, benzene would have a value "1" and toluene would have a value "5". 46 Pipeline Pilot

47 Value Num_Macro_Chains Num_Macro_Residues Num_TerminalRotomers Number of Chain records defined for macromolecules in PDB files. Residue records defined for macromolecules in PDB files. Terminal groups such as -CF3, -CCl3, -COO, -NOO. A terminal rotomer is defined as either a non-terminal sp3 atom connected to three terminal atoms of the same type, or a non-terminal sp2 atom connected to two terminal atoms of the same type. Notice that groups such as CH3 and NH2 are not counted as terminal rotomers because the bond to the heavy atom is not considered terminal (the heavy atom is attached to only hydrogens) This property can be used to adjust the Num_RotatableBonds count, which includes bonds to terminal rotomers. For example, the Num_RotatableBonds count calculated for C6H5-CF3 is 1, and the Num_TerminalRotomers count is also 1. A modified number of rotatable bonds that excludes terminal rotomers can be calculated as Num_RotatableBonds - Num_TerminalRotomers using PilotScript. Num_SpiroAtoms Num_BridgeHeadAtoms Num_MetalAtoms Num_SGroups Num_RepeatUnits Num_CustomData Num_PiBonds Num_Superatoms Num_Isotopes Num_QueryAtoms Num_QueryBonds Num_V3000Templates A spiro atom is a linkage between two rings consisting of a single atom common to both. A free spiro atom is a linkage that constitutes the only union direct or indirect between the two rings. We count only free spiro atoms. A bridgehead atom connects a bridge to a ring. Atoms classified as metallic. Number of MDL SGroups present in the molecule, as determined by the SGroup M STY lines Number of repeat units present in the molecule. Repeat units are represented as monomers with associated repetition counts or ranges and connection types (Head to Tail, Head to Head, etc.). They are read from MDL SD or SKC files with SGroups or from Accord files Number of custom data present in the molecule. Custom data are text objects with specific coordinates which can be associated with molecules, atoms, bonds, or repeat units. They are read from MDL SD or SKC files with SGroups or from Accord files Number of pi bonds and pi systems present in molecules such as metallocenes and other organometallic compounds. Pi bonds are read from Accord files. Number of super atoms present in the molecule. A superatom is an SGroup of type SUP. It s a group of atoms that are to be replaced by a single textual node when the molecule is depicted. Number of atoms that are marked with an isotope. This includes cases where the marked isotope matches natural abundance. Number of atoms that contain query features. Number of bonds that contain query features. Number of V3000 template fragments. Chemistry Collection: Basic Chemistry Guide 47

48 Value Num_RGroupFragments Number of Number of total fragments. Molecular Weight This component can calculate the molecular weight and mass of the input molecule and create new properties to hold the results. Molecular weight is calculated using the atomic weights of the individual atoms in the molecule. Molecular mass is calculated using the sum of the atomic weights with the most common isotope. Num H AcceptorDonors This component calculates the number of hydrogen acceptors and/or donors and adds a separate property to the data record for each result. Hydrogen Bond Acceptors are defined as heteroatoms (Oxygen, Nitrogen, Sulfur, or Phosphorus) with one or more lone pairs, excluding atoms with positive formal charges, amide and pyrrole-type Nitrogens, and aromatic Oxygen and Sulfur atoms in heterocyclic rings. Hydrogen Bond Donors are defined as heteroatoms (Oxygen, Nitrogen, Sulfur, or Phosphorus) with one or more attached Hydrogen atoms. Solubility This component calculates aqueous solubility. It outputs the aqueous Solubility expressed as logs, where S is the solubility in mol/l. The method used to estimate the solubility is the multiple linear regression model based on Electrotopological State indices published by Tetko et al. [J Chem Inf. Comput. Sci, 2001, 41, , Estimation of Aqueous Solubility of Chemical Compounds Using E-State Indices ]. Solubility Model Water solubility is calculated using a multiple linear regression model based on E-state keys published by Tetko et al (Tetko, I., Tanchuk Yu. V., Kasheva T., Villa A., "Estimation of Aqueous Solubility of Chemical Compounds Using E-State Indices", J. Chem. Inf. Comput. Sci., 2001, 41, ). The following plot shows the correlation between solubility values calculated by Pipeline Pilot using this model and the values reported in the paper using their final neural net model based on the E-state keys for a set of test molecules used in the study. 48 Pipeline Pilot

49 Substructure Count from File Correlation between solubility values This component evaluates each molecule for the presence of indicated substructure(s) using the queries found in a file. The number of times the substructure or substructures are found in the molecule is counted and written to a given property name. The substructure or substructures are provided as MDL-format queries using the Source parameter. For example, you can use ISIS/Draw to sketch the molecule, select all, and export to a MOL file. You also provide a prefix (for example, Nitro ). It outputs the property with a name of the prefix and _Count (for example, Nitro_Count ). Substructure Count from Tag Parameters for Substructure Count from File This component evaluates each molecule for the presence of indicated substructure(s) using the queries received that are tagged on the incoming data stream. The number of times the substructure or substructures are found in the molecule is counted and written to a given property name. The substructure or substructures are provided as queries, tagged with a particular property name, given in parameter QueryTag. You also provide a prefix (for example, Nitro ). It outputs the property with a name of the prefix and _Count (for example, Nitro_Count ). Chemistry Collection: Basic Chemistry Guide 49

50 Substructure Map Parameters for Substructure Count from Tag This component searches each molecule for the presence of one or more substructures. You can select different properties that you want to add to the property list. They indicate the number of matches and/or the atom and bond maps for each match. NumQueries: Contains the total number of queries. NumQueriesMapped: The number of queries that mapped. QueriesMapped: Contains a list of the names of the mapped queries. If SeparateQueryOutputs is True, the atom-to-atom mappings are contained in properties that begin with the query name and end with _Maps or _AllMapped. The former is an array of the individual mappings. Each mapping is a sequence of numbers containing the number of the target atom that the ith query atom maps onto. The latter is an array of all target atoms in any of the mappings, in no particular order If SeparateQueryOutputs is False, then all mappings are placed in Query_Maps, and the list of all atom in Query_AllMapped. Similar properties can be output for the bond mappings. They are named _BondMaps or _AllBondsMapped for separate queries or Query_BondMaps, and Query_AllBondsMapped for all queries together in the same array. Surface Area and Volume This component calculates a variety of surface area and volume properties for each molecule. It calculates one or more of the following: Molecular_SurfaceArea and Molecular_PolarSurfaceArea: Calculates the total surface area and/or polar surface area for each molecule using a 2D approximation. Molecular_Volume: Calculates the 3D volume for each molecule using the current 3D coordinates. The component will fail if there are no 3D coordinates for the molecule. The 3D Coords and/or Minimize Molecule component can be used prior to the molecule volume calculation if no 3D coordinates are present for the molecules on the input stream. Molecular_SASA, Molecular_PolarSASA, and Molecular_SAVol: Calculates the total solvent accessible surface area, the polar solvent accessible surface area and the solvent accessible volume for each molecule using a 2D approximation. The polar solvent accessible surface area is defined as the sum of the solvent accessible surface area of all the selected polar elements, which can include N, O, P, and S. Solvent accessible surface area and solvent accessible volume are calculated assuming a solvent probe radius of 1.4 Angstroms. Solvent Accessible Surface Area The Surface Area and Volume component includes options for calculating solvent-accessible surface area and other related properties including: Solvent-accessible surface area (Molecular_SASA) Polar solvent-accessible surface area (Molecular_PolarSASA) 50 Pipeline Pilot

51 Solvent-accessible volume (Molecular_SAVol) All these quantities are calculated using models based on E-state keys as independent variables, which requires only 2D structures and hence are very fast. The models were obtained by fitting solvent-accessible surface areas for 3D conformers of molecules from the NCI drug database. The following plot shows the correlation between the 3D solvent-accessible surface areas and the calculated values using the 2D approximation. Correlation between 3D solvent-accessible surface areas and calculated values using 2D approximation Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations Molecular Energy This component calculates the energy of a molecule, either in its current configuration or after a rapid minimization procedure. It can calculate the following values: Energy: Gives the energy of the molecules current 3D conformation. It calculates the point energy of the current conformation. Minimized_Energy: Gives the energy after a fast minimization procedure. It takes a bit longer to calculate, as it performs a quick minimization procedure before calculating the energy. Strain_Energy: Gives the point strain energy. Strain_Energy is the difference between Energy and Minimized_Energy. PiSystem Properties This component calculates several properties pertaining to each pi system in the molecule. When more than one pi system is present in a molecule, the resultant properties are arrays. PiSystem_Hapticity: The number of atoms in the pi system. Chemistry Collection: Basic Chemistry Guide 51

52 PiSystem_ElectronCount: The number of pi electrons in the pi system. PiSystem_Charge: The charge that is delocalized across the pi system. PiSystem_Radical: The radical count that is delocalized across the pi system. 52 Pipeline Pilot

53 Chapter 6 Manipulators A data manipulator is a component that alters data records as they are passed through it. The Chemistry collection includes manipulators that modify structures in a variety of ways. You can normalize sets of molecules before comparing them. For optimal display characteristics, you can employ manipulators for structural alignment and for 2D layout. The current set of manipulator components in the Chemistry collection includes: 3D Conformations 3D Coordinates 2D Coords 2D Coords (Advanced) Add Bond Orders Add Hydrogens Aggregate Fragments Align Molecules from Tag Align Molecules using Substructure Center Molecule Clean Molecule Convert Fingerprint Deprotonate Bases Generate Fragments Generate Salts Identify Salts Identify Salts from Tag Ionize Molecule at ph Keep Largest Fragment Merge Molecules Minimize Molecule Normalize Structure (Cheshire) Protonate Acids Remove Hydrogens Remove Salts Separate Fragments Standardize Molecule Strip Salts Strip Salts from Tag Tile Fragments 2D Depiction Algorithms The 2D Coords component includes advanced parameters that give you more control over the depiction algorithm. You can use 2D templates for ring and bridge assemblies with predefined coordinates. A set of several thousand templates is provided by SciTegic in /data/templates2d/scitegic. You can define your own templates and place them in data/templates2d/user. Additional options to try to resolve bumps that could be present in the 2D structures include: Shorten bond length of terminal bonds Flip torsions (single torsions and pairs of torsions) Bend torsions (single torsions and pairs of torsions) Rotate terminal atoms Check for bond crossing Chemistry Collection: Basic Chemistry Guide 53

54 2D depiction algorithms for several molecules are shown below: 54 Pipeline Pilot

55 Chemistry Collection: Basic Chemistry Guide 55

56 Standardize Molecule Standardize Molecule is a molecular manipulator component that provides a number of useful actions for taking molecules from different sources and correcting non-uniform features. By enabling the Track Actions Taken parameter, you can monitor which actions resulted in changes made to the input molecule. The available actions in the Standardize Molecules component include: Action StandardizeStereo StandardizeCharges CenterMolecule RemoveSingleAtomFragments KeepSmallestFragment KeepLargestFragment MakeNon[H]Atoms[C]Atoms MakeNon[C,H]Atoms[Q]Atoms MakeNon[H]Atoms[A]Atoms MakeAllBondsSingle ClearCoordinates FixCoordinateDimension StraightenTripleBonds ClearMolecule RemoveMolecule ClearStereo ClearEnhancedStereo ClearUnknownStereo Description Sets or repairs the stereo on a molecule to a standard form using the coordinates as the guide. Atoms perceived as true stereo atoms, that have no stereochemical markings (UnknownAtomStereo, EvenAtomStereo, or OddAtomStereo), are set to UnknownAtomStereo. Atoms with stereochemical markings, that are not true stereoatoms, are set to NoAtomStereo. 2D or 3D coordinates are not used in this process. Similarly, bonds perceived as true stereo double bonds, that have no stereochemical markings (UnknownBondStereo, CisBondStereo, or TransBondStereo), are set to UnknownBondStereo. Bonds with stereochemical markingsm that are not true stereo bonds, are set to NoBondStereo. Again, 2D or 3D coordinates are not used in this process. Sets the charges on a molecule to a standard form. For example, nitro groups are detected and converted to a standard form. (A complete definition of the standardization rules is available later in this topic.) Translates a molecule so its geometric center lies at the origin. Removes any fragments that consist of only a single heavy atom. Keeps only the smallest fragment in the molecule. Keeps only the largest fragment in the molecule. Converts all atoms in the molecule to Carbon. Converts all non-carbon, non-hydrogen atoms in the molecule to the Q query atom type. Converts all non-hydrogen atoms in the molecule to the A query atom type. Converts all bonds in the molecule to Single bonds. Sets all x, y, z coordinates to zero. Sets the coordinate dimension (0D, 2D, 3D) based on the atomic coordinates. Finds atoms with triple bonds with non-linear geometry and fixes them so that the bond angles are 180 degrees. Deletes all atoms and bonds in the molecule, keeping the molecule object in the data record. Deletes the molecule object from the data record. Sets all atoms and bonds to NoStereo. Removes all relative stereo groupings (e.g. MDL V3000 Enhanced Stereo). Sets all atoms and bonds marked UnknownStereo to NoStereo. 56 Pipeline Pilot

57 Action ClearUnknownAtomStereo ClearUnknownCisTransBondSte reo ClearCisTransBondStereo ClearCharges SetStereoFromCoordinates RepositionStereoBonds FixDirectionOfWedgeBonds NeutralizeBondedZwitterions ClearSGroupData ClearRepeatUnits ClearCustomData ClearPiBonds ClearHighlightColors ClearQueryInfo ClearAtomLabels ClearBondLabels ClearUnusualValence ClearIsotopes LocalizeMarushRAtomsOnRings InvalidateCustomDataCoordinat es InvalidateRepeatUnitCoordinat es Description Sets all atoms marked UnknownStereo to NoStereo. Sets all bonds marked UnknownStereo to NoStereo. Sets all bonds marked CisStereo or TransStereo to UnknownStereo. Sets all formal charges to zero. Uses 2D coordinates and up/down bond markings (or 3D coordinates) to assign the stereochemistry of the atoms or bonds. Typically, this is done by readers or by molecule from text components. Occasionally components may create molecules that need to have their stereo reperceived. Repositions the stereo bond markings, trying to find the best bond to mark as a wedge bond for each stereo atom. Checks the wedge bonds in the molecule to ensure that the wedge is drawn with the stereo atom at the narrow end of the wedge. Any wedge bond for which there is a stereo atom at the wide end, and no stereo atom at the narrow end, is reversed to point in the other direction. A separate option, Invert Wedge Bond When Changing Direction, controls inverting the bond stereo (up or down) when changing the direction of the wedge bond. Converts directly bonded zwitterions (positively charged atom bonded to negatively charged atom, A+B-) to the neutral representation (A=B) Clears any SGroup information from the molecule. Clears any repeat unit information from the molecule, leaving only the molecule with the monomers included, but without any repetition or linking information Clears any custom data information from the molecule Clears any pi bonds and pi systems from the molecule Clears any highlight colors from atoms and bonds. Deletes all query information from atoms and bonds. Clears labels from atoms. Clears labels from bonds. Clears any atom valence query features and resets all implicit hydrogen counts to their standard values. Clears all isotope markings from atoms. R atoms bonded to the centers of rings are converted to R atoms at all open positions on the ring. Clears any coordinates in custom data objects. Clears any coordinates in repeat unit objects which forces the locations of their brackets to be re-perceived. Chemistry Collection: Basic Chemistry Guide 57

58 Action InvalidateSGroupCoordinates Description Clears any coordinates in MDL SGroup objects. Standardize Charges The Standardize Molecule component has a parameter value called Standardize Charges that uses molecular connectivity to set the charges on heteroatoms. Rules for Heteroatoms (Standardize Charges) The following information describes the rules for specific heteroatoms. Heteroatom(s) AtomIndex NumSingle NumDouble NumTriple NumAromatic NumSingleToOxygen NumDoubleToOxygen NumAromaticToOxygen NumAtt Oxy OxyNumAtt Bond<n> Att<n> Description Index of the core atom. A count of the number of bonds of the given type to the central atom. Number of single, double, or aromatic bonds to oxygen atoms. Number of bonds attached to the central atom. Index of an attached oxygen (if any). Number of attachments to the attached oxygen (if any). nth attached bond. nth attached atom. Transformation Method The following information describes the transformation method in C-style pseudocode: switch (AtomType(atomIndex)) { case Nitrogen: { // Simple quaternary if (numsingle == 4) { SetCharge(AtomIndex, +1); if ((numsingletooxygen == 1) && (oxynumatt == 1)) SetCharge(oxy, -1); } // quaternary-style aromatic else if ((numsingle == 1) && (numaromatic == 2)) { SetCharge(AtomIndex, +1); if ((numsingletooxygen == 1) && (oxynumatt == 1)) SetCharge(oxy, -1); } // nitro else if ((numsingle == 2) && (numdouble == 1) && (numsingletooxygen == 1)) { SetCharge(AtomIndex, +1); 58 Pipeline Pilot

59 if (oxynumatt == 1) SetCharge(oxy, -1); } // double-bonded quaternary else if ((numsingle == 2) && (numdouble == 1)) SetCharge(AtomIndex, +1); } break; case Oxygen: { // Simple quaternary if (numatt == 3) SetCharge(AtomIndex, +1); else if (numatt == 2) { // single-single is OK if ((BondType(atomIndex, bond1) == SingleBond) && (BondType(atomIndex, bond2) == SingleBond)) break; // X=O-C is charged if ((BondType(atomIndex, bond1) == SingleBond) && (BondType(atomIndex, bond2) == DoubleBond)) { if (AtomType(atomIndex, att1) == Carbon) { SetCharge(AtomIndex, +1); break; } } if ((BondType(atomIndex, bond2) == SingleBond) && (BondType(atomIndex, bond1) == DoubleBond)) { if (AtomType(atomIndex, att2) == Carbon) { SetCharge(AtomIndex, +1); break; } } } } break; case Sulfur: { // Simple quaternary if (numatt == 3) SetCharge(AtomIndex, +1); else if (numatt == 2) { // single-single is OK if ((BondType(atomIndex, bond1) == SingleBond) && (BondType(atomIndex, bond2) == SingleBond)) break; // X=S-C is charged if ((BondType(atomIndex, bond1) == SingleBond) && (BondType(atomIndex, bond2) == DoubleBond)) { if (AtomType(atomIndex, att1) == Carbon) Chemistry Collection: Basic Chemistry Guide 59

60 { SetCharge(AtomIndex, +1); break; } } if ((BondType(atomIndex, bond2) == SingleBond) && (BondType(atomIndex, bond1) == DoubleBond)) { if (AtomType(atomIndex, att2) == Carbon) { SetCharge(AtomIndex, +1); break; } } } } break; } 3D Coordinates, 3D Conformations and Minimize Energy The 3D Coords and 3D Conformations components provide a quick way for generating 3D. The resulting coordinates are checked to make sure that there are no bumps between atoms, but electrostatics is not taken into account. The algorithm used to generate 3D conformations changes only the torsion angles, keeping bond lengths and bond angles fixed. The Minimize Energy component uses the Clean force-field described in Receptor Surface Models. 1. Definition and Construction. M. Hahn; J. Med. Chem.; 1995; 38(12); Generate Fragments The Generate Fragments component extracts one or more type of fragment from the molecule, as specified by the user. The available types of fragments are: Ring Assemblies: Contiguous ring systems Bridge Assemblies: Contiguous ring systems that share two or more bonds Rings: Individual rings Chain Assemblies: Contiguous chains BemisMurcko Assemblies: Bemis-Murcko assemblies are contiguous ring systems plus chains that link two or more rings, as defined in The Properties of Known Drugs. 1. Molecular Frameworks, Guy W. Bemis and Mark A. Murcko, J. Med. Chem. 1996, 39, The different fragments that this component can generate are illustrated for the following molecule: 60 Pipeline Pilot

61 Chain Assemblies Sets of contiguous chain atoms, including any ring atom that terminates a chain: Rings Ring Assemblies Individual rings Contiguous ring systems, including fused rings and bridge systems Chemistry Collection: Basic Chemistry Guide 61

62 Bridge Assemblies Contiguous ring systems that share two or more bonds: Bemis-Murcko Assemblies Bemis-Murcko assemblies are ring systems and any chain that links two or more rings. Any other chains are clipped from the molecule. The component has two parameters that control what to do with the attachment points when generating the fragments. Set IncludeAlphaAtoms to True to include the first atom outside the fragment as an attachment point (Z atom). Set MarkAttachmentAtoms to True to replace the atoms at the point of attachment by Z atoms. The following figures illustrate the use of these options for ring assembly fragments. Ring Assemblies (IncludeAlphaAtoms = True) 62 Pipeline Pilot

63 Ring Assemblies (MarkAttachmentAtoms = True) Deprotonate Bases, Protonate Acids and Ionize Molecule at ph Deprotonate Bases and Protonate Acids perform a quick analysis of the molecule to identify simple acids and bases and neutralize them. Acid functional groups are defined as Oxygen or Sulfur atoms with a negative formal charge, attached to only one, uncharged, atom. Basic functional groups are defined as Nitrogen atoms with a positive formal charge and one or more attached Hydrogen atoms. The Ionize Molecule at ph component uses the pka framework to identify ionization sites and calculate their pka values. It then ionizes the sites based on the calculated pka values and the user-defined ph. Chemistry Collection: Basic Chemistry Guide 63

64 Chapter 7 Filters A filter is a component that identifies and diverts specific subsets of records. It provides a powerful way to customize a pipeline to process specific subsets of your data differently than other subsets. These components are designed to screen data according to criteria that depend on the nature of the component. For example, they can remove data records with a property value in some desired range and remove duplicate records. Filters typically have an input port, a pass port (displayed in green), and a fail port (displayed in red). The Chemistry collection provides filters that act on molecular data and are capable of evaluating specific chemical and structural features of compound records. You can filter using substructural queries, similarity to reference molecules, molecular appropriateness filters (such as Lipinski's rule), molecule uniqueness, and fragment properties. The current set of Filter components in the Chemistry collection includes: Bad Isotope Filter Bad Stereo Filter Bad Triple Bond Filter Bad Valence Filter Bump Check Filter Check and Normalize Structure HTS Filter Lipinski Filter Most Frequent Fragments Organic Filter Query Features Filter 64 Pipeline Pilot

65 Chapter 8 Search and Similarity The Search and Similarity components are designed to perform database-style searching of molecules over pipelined data, removing the need to pre-load a molecular database to search. Search and Similarity components in the Chemistry collection include: Find Novel Fragments Find Novel Molecules Remove Duplicate Molecules Substructure Filter from File Substructure Filter from Tag In addition, the Data Modeling collection contains components to equipartioning data into equal sized groups and for performing similarity searches for a set of target compounds against a set of reference compounds using Tanimoto similarity. Molecular Similarity (Tanimoto, etc.) The Molecular Similarity (Tanimoto, etc.) component calculates similarity values for each target molecule with respect to one or more reference molecules using molecular fingerprint properties (ECFP, FPFP, etc.). This component can calculate several different similarity coefficients, the most common being the Tanimoto similarity coefficient. The Tanimoto similarity coefficient is defined by the expression: where: Tanimoto = SA SA + SB + SC SA = Number of bits defined in both the target and the reference SB = Number of bits defined in the target but not the reference SC = Number of bits defined in the reference but not the target The Tanimoto similarity ranges from zero (there are no common bits between the reference and the target molecules) to one (the reference and the target molecules have exactly the same bits ). For more details, see Molecular Similarity (Tanimoto, etc.) in the Data Modeling Component Collection User Guide. Chemistry Collection: Basic Chemistry Guide 65

66 Chapter 9 Database Content Database Content components can be used to access content information from DiscoveryGate Web Service (DGWS) and other public Chemistry Web Services such as PubChem and NCI. A valid license key is needed to use the DGWS components. The license key is defined as a global in the Pipeline Pilot Administration Portal. The Chemistry Web Services accessed by the Database Content components are: DiscoveryGate Web Service: Molecule to Name, Molecule from Name, Identity, Similarity, Substructure Searches, Activity Searches, Reaction Searches, Content information NCI/CADD: Molecule to Name, Molecule from Name PubChem: Molecule from Name, Identity, Similarity, Substructure Searches, Molecular Properties ChemSpider: Molecule from Name, Molecule from InChI Key emolecules: Molecule from Name ChemExper: Molecule from Name Depending on your DGWS license, you have access to the following databases: Data Source Name Type ACD Available Chemicals Directory Molecule SCD Screening Compounds Directory Molecule MDDR MDL Drug Data Report Molecule NCI National Cancer Institute Databases Molecule CMC Comprehensive Medicinal Chemistry Molecule TOX Toxicity Database Molecule CIRX ChemInform Reaction Library Reaction DJSM Derwent Journal of Synthetic Methods Reaction ORGSYN ORGSYN Database Reaction SPORE Solid-Phase Organic Reactions Reaction REFLIB The Reference Library of Synthetic Methodology Reaction SPRESI Storage and Retrieval of Chemical Structure Information (SPeicherung und REcherche Strukturchemischer Information) Reaction Most Database Content components use the Web Service (SOAP) component internally to access the Chemistry Web Services through their corresponding WSDL. An example of a component that connects to DGWS to retrieve structures based on a chemical name is shown below: 66 Pipeline Pilot

67 Subprotocol that uses the Web Service (SOAP) component The internal Web Service (SOAP) components are parameterized to call specific WSDL methods, setting any required parameters from data in the input stream or from user-specified parameters in the wrapping components. The following are the parameter settings needed to get molecules by name using DGWS: Chemistry Collection: Basic Chemistry Guide 67

68 Parameterization of the Web Service (SOAP) component to call a DGWS method through a WSDL file The DGWS components are configured to retrieve the content data as a hierarchical data record, preserving the structure and relationships of the data in the content databases. The hierarchical data is processed to extract each leaf node as a separate record, including all properties of the parent nodes all the way up to the root node. 68 Pipeline Pilot

69 Hierarchical data record with content retrieved from DGWS Components and Example Protocols The components are organized in Fetch Information, Search and Similarity, Utilities, and Viewers. Fetch Information Check Chemistry Web Services Find Compound Activity Find Compound ID Find Molecule from InChI Key Find Molecule from Name Find Molecule from Name(Batched) Find Molecule Names Find Molecule by Activity Get Content from Compound ID Get Content from RXN MDLNumber Get PubChem Properties Get Service Information Get Supplier Information Get Vocabulary Chemistry Collection: Basic Chemistry Guide 69

70 Search and Similarity Find Molecule in Reactions Identity Search Reactions Search Similarity Search Substructure Search Utilities This subfolder contains internal components used by other Database Content components, such as individual Molecule to/from Name converters for the different Chemistry Web Services, and other utilities. Viewers This subfolder contains one component, Content Viewer, which displays a table with DGWS Data Source information for each input molecule, with interactive links to retrieve other content information. Database Content component Examples 01 Get Molecules from Names 02 Get Names from Molecules 03 Find Compound IDs 04 Find Biological Activities 05 Find Similar Molecules 06 Find Molecules as Substructures 07 Search Reaction Databases 08 Get Content from DiscoveryGate Web Service 09 Filter Molecules by Procurement Data 10 View Content from DiscoveryGate Web Service Web Services Utilities Examples Check response Times of Chemistry Web Services Get DiscoveryGate Web Service Information Get DiscoveryGate Web Service Vocabulary Web Port Examples 08 Find Molecules by Name: This example opens a form that allows you to retrieve Data Sources and Procurement information for a given compound using DGWS: 70 Pipeline Pilot

71 WebPort example using DGWS to find structure and suppliers from a chemical name 09 JDraw Sketcher and DGWS Search: In this example, you can sketch a structure using JDraw and perform a Substructure, Similarity, or Identity search in DGWS. Chemistry Collection: Basic Chemistry Guide 71

Reaxys Pipeline Pilot Components Installation and User Guide

Reaxys Pipeline Pilot Components Installation and User Guide 1 1 Reaxys Pipeline Pilot components for Pipeline Pilot 9.5 Reaxys Pipeline Pilot Components Installation and User Guide Version 1.0 2 Introduction The Reaxys and Reaxys Medicinal Chemistry Application

More information

Pipeline Pilot Integration

Pipeline Pilot Integration Scientific & technical Presentation Pipeline Pilot Integration Szilárd Dóránt July 2009 The Component Collection: Quick facts Provides access to ChemAxon tools from Pipeline Pilot Free of charge Open source

More information

Pipeline Pilot Integration

Pipeline Pilot Integration Pipeline Pilot Integration Szilard Dorant Solutions for Cheminformatics The Component Collection: Quick facts Provides access to ChemAxon tools from Pipeline Pilot Developed and Supported by ChemAxon Free

More information

Marvin. Sketching, viewing and predicting properties with Marvin - features, tips and tricks. Gyorgy Pirok. Solutions for Cheminformatics

Marvin. Sketching, viewing and predicting properties with Marvin - features, tips and tricks. Gyorgy Pirok. Solutions for Cheminformatics Marvin Sketching, viewing and predicting properties with Marvin - features, tips and tricks Gyorgy Pirok Solutions for Cheminformatics The Marvin family The Marvin toolkit provides web-enabled components

More information

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009 ICM-Chemist How-To Guide Version 3.6-1g Last Updated 12/01/2009 ICM-Chemist HOW TO IMPORT, SKETCH AND EDIT CHEMICALS How to access the ICM Molecular Editor. 1. Click here 2. Start sketching How to sketch

More information

ISIS/Draw "Quick Start"

ISIS/Draw Quick Start ISIS/Draw "Quick Start" Click to print, or click Drawing Molecules * Basic Strategy 5.1 * Drawing Structures with Template tools and template pages 5.2 * Drawing bonds and chains 5.3 * Drawing atoms 5.4

More information

Imago: open-source toolkit for 2D chemical structure image recognition

Imago: open-source toolkit for 2D chemical structure image recognition Imago: open-source toolkit for 2D chemical structure image recognition Viktor Smolov *, Fedor Zentsev and Mikhail Rybalkin GGA Software Services LLC Abstract Different chemical databases contain molecule

More information

DiscoveryGate SM Version 1.4 Participant s Guide

DiscoveryGate SM Version 1.4 Participant s Guide Citation Searching in CrossFire Beilstein DiscoveryGate SM Version 1.4 Participant s Guide Citation Searching in CrossFire Beilstein DiscoveryGate SM Version 1.4 Participant s Guide Elsevier MDL 14600

More information

Build_model v User Guide

Build_model v User Guide Build_model v.2.0.1 User Guide MolTech Build_model User Guide 2008-2011 Molecular Technologies Ltd. www.moltech.ru Please send your comments and suggestions to contact@moltech.ru. Table of Contents Input

More information

Preparing a PDB File

Preparing a PDB File Figure 1: Schematic view of the ligand-binding domain from the vitamin D receptor (PDB file 1IE9). The crystallographic waters are shown as small spheres and the bound ligand is shown as a CPK model. HO

More information

Tautomerism in chemical information management systems

Tautomerism in chemical information management systems Tautomerism in chemical information management systems Dr. Wendy A. Warr http://www.warr.com Tautomerism in chemical information management systems Author: Wendy A. Warr DOI: 10.1007/s10822-010-9338-4

More information

Cheminformatics analysis and learning in a data pipelining environment

Cheminformatics analysis and learning in a data pipelining environment Molecular Diversity (2006) 10: 283 299 DOI: 10.1007/s11030-006-9041-5 c Springer 2006 Review Cheminformatics analysis and learning in a data pipelining environment Moises Hassan 1,, Robert D. Brown 1,

More information

Tutorial. Getting started. Sample to Insight. March 31, 2016

Tutorial. Getting started. Sample to Insight. March 31, 2016 Getting started March 31, 2016 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com Getting started

More information

Similarity Search. Uwe Koch

Similarity Search. Uwe Koch Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance

More information

Structure Searching in CrossFire Beilstein. DiscoveryGate SM Version 1.4 Participant s Guide

Structure Searching in CrossFire Beilstein. DiscoveryGate SM Version 1.4 Participant s Guide Structure Searching in CrossFire Beilstein DiscoveryGate SM Version 1.4 Participant s Guide Structure Searching in CrossFire Beilstein DiscoveryGate SM Version 1.4 Participant s Guide Elsevier MDL 14600

More information

The Schrödinger KNIME extensions

The Schrödinger KNIME extensions The Schrödinger KNIME extensions Computational Chemistry and Cheminformatics in a workflow environment Jean-Christophe Mozziconacci Volker Eyrich Topics What are the Schrödinger extensions? Workflow application

More information

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics... 1 1.1 Chemoinformatics... 2 1.1.1 Open-Source Tools... 2 1.1.2 Introduction to Programming Languages... 3 1.2 Chemical Structure

More information

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Iván Solt Solutions for Cheminformatics Drug Discovery Strategies for known targets High-Throughput Screening (HTS) Cells

More information

Exercises for Windows

Exercises for Windows Exercises for Windows CAChe User Interface for Windows Select tool Application window Document window (workspace) Style bar Tool palette Select entire molecule Select Similar Group Select Atom tool Rotate

More information

Dictionary of ligands

Dictionary of ligands Dictionary of ligands Some of the web and other resources Small molecules DrugBank: http://www.drugbank.ca/ ZINC: http://zinc.docking.org/index.shtml PRODRUG: http://www.compbio.dundee.ac.uk/web_servers/prodrg_down.html

More information

ArcGIS 9 ArcGIS StreetMap Tutorial

ArcGIS 9 ArcGIS StreetMap Tutorial ArcGIS 9 ArcGIS StreetMap Tutorial Copyright 2001 2008 ESRI All Rights Reserved. Printed in the United States of America. The information contained in this document is the exclusive property of ESRI. This

More information

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal Representation of molecular structures Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal A hierarchy of structure representations Name (S)-Tryptophan 2D Structure 3D Structure Molecular

More information

Appendix 4 Weather. Weather Providers

Appendix 4 Weather. Weather Providers Appendix 4 Weather Using weather data in your automation solution can have many benefits. Without weather data, your home automation happens regardless of environmental conditions. Some things you can

More information

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Anthony Arvanites Daylight User Group Meeting March 10, 2005 Outline 1. Company Introduction

More information

Command-line tools of ChemAxon: tips and tricks

Command-line tools of ChemAxon: tips and tricks Command-line tools of ChemAxon: tips and tricks György Pirok Solutions for Cheminformatics Command-line interface A command-line interface (CLI) is a mechanism for interacting with a computer operating

More information

POC via CHEMnetBASE for Identifying Unknowns

POC via CHEMnetBASE for Identifying Unknowns Table of Contents A red arrow was used to identify where buttons and functions are located in CHEMnetBASE. Figure Description Page Entering the Properties of Organic Compounds (POC) Database 1 Swain Home

More information

ST-Links. SpatialKit. Version 3.0.x. For ArcMap. ArcMap Extension for Directly Connecting to Spatial Databases. ST-Links Corporation.

ST-Links. SpatialKit. Version 3.0.x. For ArcMap. ArcMap Extension for Directly Connecting to Spatial Databases. ST-Links Corporation. ST-Links SpatialKit For ArcMap Version 3.0.x ArcMap Extension for Directly Connecting to Spatial Databases ST-Links Corporation www.st-links.com 2012 Contents Introduction... 3 Installation... 3 Database

More information

ChemAxon. Content. By György Pirok. D Standardization D Virtual Reactions. D Fragmentation. ChemAxon European UGM Visegrad 2008

ChemAxon. Content. By György Pirok. D Standardization D Virtual Reactions. D Fragmentation. ChemAxon European UGM Visegrad 2008 Transformers f off ChemAxon By György Pirok Content Standardization Virtual Reactions Metabolism M b li P Prediction di i Fragmentation 2 1 Standardization http://www.chemaxon.com/jchem/doc/user/standardizer.html

More information

POC via CHEMnetBASE for Identifying Unknowns

POC via CHEMnetBASE for Identifying Unknowns Table of Contents A red arrow is used to identify where buttons and functions are located in CHEMnetBASE. Figure Description Page Entering the Properties of Organic Compounds (POC) Database 1 CHEMnetBASE

More information

Instruction to search natural compounds on CH-NMR-NP

Instruction to search natural compounds on CH-NMR-NP Instruction to search natural compounds on CH-NMR-NP The CH-NMR-NP is a charge free service for all users. Please note that required information (name, affiliation, country, email) has to be submitted

More information

How to Create a Substance Answer Set

How to Create a Substance Answer Set How to Create a Substance Answer Set Select among five search techniques to find substances Since substances can be described by multiple names or other characteristics, SciFinder gives you the flexibility

More information

new interface and features

new interface and features Web version of SciFinder : new interface and features Bhawat Ruangying, CAS representative Updated at 22 Dec 2009 www.cas.org SciFinder web interface Technical aspects of SciFinder Web SciFinder URL :

More information

Capturing Chemistry. What you see is what you get In the world of mechanism and chemical transformations

Capturing Chemistry. What you see is what you get In the world of mechanism and chemical transformations Capturing Chemistry What you see is what you get In the world of mechanism and chemical transformations Dr. Stephan Schürer ead of Intl. Sci. Content Libraria, Inc. sschurer@libraria.com Distribution of

More information

DECEMBER 2014 REAXYS R201 ADVANCED STRUCTURE SEARCHING

DECEMBER 2014 REAXYS R201 ADVANCED STRUCTURE SEARCHING DECEMBER 2014 REAXYS R201 ADVANCED STRUCTURE SEARCHING 1 NOTES ON REAXYS R201 THIS PRESENTATION COMMENTS AND SUMMARY Outlines how to: a. Perform Substructure and Similarity searches b. Use the functions

More information

Ákos Tarcsay CHEMAXON SOLUTIONS

Ákos Tarcsay CHEMAXON SOLUTIONS Ákos Tarcsay CHEMAXON SOLUTIONS FINDING NOVEL COMPOUNDS WITH IMPROVED OVERALL PROPERTY PROFILES Two faces of one world Structure Footprint Linked Data Reactions Analytical Batch Phys-Chem Assay Project

More information

CDK & Mass Spectrometry

CDK & Mass Spectrometry CDK & Mass Spectrometry October 3, 2011 1/18 Stephan Beisken October 3, 2011 EBI is an outstation of the European Molecular Biology Laboratory. Chemistry Development Kit (CDK) An Open Source Java TM Library

More information

KATE2017 on NET beta version https://kate2.nies.go.jp/nies/ Operating manual

KATE2017 on NET beta version  https://kate2.nies.go.jp/nies/ Operating manual KATE2017 on NET beta version http://kate.nies.go.jp https://kate2.nies.go.jp/nies/ Operating manual 2018.03.29 KATE2017 on NET was developed to predict the following ecotoxicity values: 50% effective concentration

More information

EEOS 381 -Spatial Databases and GIS Applications

EEOS 381 -Spatial Databases and GIS Applications EEOS 381 -Spatial Databases and GIS Applications Lecture 5 Geodatabases What is a Geodatabase? Geographic Database ESRI-coined term A standard RDBMS that stores and manages geographic data A modern object-relational

More information

Prediction of Organic Reaction Outcomes. Using Machine Learning

Prediction of Organic Reaction Outcomes. Using Machine Learning Prediction of Organic Reaction Outcomes Using Machine Learning Connor W. Coley, Regina Barzilay, Tommi S. Jaakkola, William H. Green, Supporting Information (SI) Klavs F. Jensen Approach Section S.. Data

More information

Research Article. Chemical compound classification based on improved Max-Min kernel

Research Article. Chemical compound classification based on improved Max-Min kernel Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(2):368-372 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Chemical compound classification based on improved

More information

ncounter PlexSet Data Analysis Guidelines

ncounter PlexSet Data Analysis Guidelines ncounter PlexSet Data Analysis Guidelines NanoString Technologies, Inc. 530 airview Ave North Seattle, Washington 98109 USA Telephone: 206.378.6266 888.358.6266 E-mail: info@nanostring.com Molecules That

More information

Open PHACTS Explorer: Compound by Name

Open PHACTS Explorer: Compound by Name Open PHACTS Explorer: Compound by Name This document is a tutorial for obtaining compound information in Open PHACTS Explorer (explorer.openphacts.org). Features: One-click access to integrated compound

More information

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology Serge P. Parel, PhD ChemAxon User Group Meeting, Budapest 21 st May, 2014 Outline Exquiron Who

More information

Practical QSAR and Library Design: Advanced tools for research teams

Practical QSAR and Library Design: Advanced tools for research teams DS QSAR and Library Design Webinar Practical QSAR and Library Design: Advanced tools for research teams Reservationless-Plus Dial-In Number (US): (866) 519-8942 Reservationless-Plus International Dial-In

More information

Performing a Pharmacophore Search using CSD-CrossMiner

Performing a Pharmacophore Search using CSD-CrossMiner Table of Contents Introduction... 2 CSD-CrossMiner Terminology... 2 Overview of CSD-CrossMiner... 3 Searching with a Pharmacophore... 4 Performing a Pharmacophore Search using CSD-CrossMiner Version 2.0

More information

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Frank Hoonakker 1,3, Nicolas Lachiche 2, Alexandre Varnek 3, and Alain Wagner 3,4 1 Chemoinformatics laboratory,

More information

WeatherHawk Weather Station Protocol

WeatherHawk Weather Station Protocol WeatherHawk Weather Station Protocol Purpose To log atmosphere data using a WeatherHawk TM weather station Overview A weather station is setup to measure and record atmospheric measurements at 15 minute

More information

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann Information Extraction from Chemical Images Discovery Knowledge & Informatics April 24 th, 2006 Dr. Available Chemical Information Textbooks Reports Patents Databases Scientific journals and publications

More information

Supporting Information. Kekule.js: An Open Source JavaScript Chemoinformatics Toolkit

Supporting Information. Kekule.js: An Open Source JavaScript Chemoinformatics Toolkit Supporting Information Kekule.js: An Open Source JavaScript Chemoinformatics Toolkit Chen Jiang, *, Xi Jin, Ying Dong and Ming Chen Department of Organic Chemistry, China Pharmaceutical University, Nanjing

More information

cheminformatics toolkits: a personal perspective

cheminformatics toolkits: a personal perspective cheminformatics toolkits: a personal perspective Roger Sayle Nextmove software ltd Cambridge uk 1 st RDKit User Group Meeting, London, UK, 4 th October 2012 overview Models of Chemistry Implicit and Explicit

More information

How to add your reactions to generate a Chemistry Space in KNIME

How to add your reactions to generate a Chemistry Space in KNIME How to add your reactions to generate a Chemistry Space in KNIME Introduction to CoLibri This tutorial is supposed to show how normal drawings of reactions can be easily edited to yield precise reaction

More information

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic Cross Discipline Analysis made possible with Data Pipelining J.R. Tozer SciTegic System Genesis Pipelining tool created to automate data processing in cheminformatics Modular system built with generic

More information

Introduction to Spark

Introduction to Spark 1 As you become familiar or continue to explore the Cresset technology and software applications, we encourage you to look through the user manual. This is accessible from the Help menu. However, don t

More information

Methods for tautomer enumeration, -searching and -duplicate filtering

Methods for tautomer enumeration, -searching and -duplicate filtering Methods for tautomer enumeration, -searching and -duplicate filtering József Szegezdi, Zsolt Mohácsi, Tamás Csizmazia, Szilárd Dóránt, Ákos Papp, György Pirok, Szabolcs Csepregi, Ferenc Csizmadia Solutions

More information

Data Mining in the Chemical Industry. Overview of presentation

Data Mining in the Chemical Industry. Overview of presentation Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry

More information

InChI keys as standard global identifiers in chemistry web services. Russ Hillard ACS, Salt Lake City March 2009

InChI keys as standard global identifiers in chemistry web services. Russ Hillard ACS, Salt Lake City March 2009 InChI keys as standard global identifiers in chemistry web services Russ Hillard ACS, Salt Lake City March 2009 Context of this talk We have created a web service That aggregates sources built independently

More information

Molecular Modelling. Computational Chemistry Demystified. RSC Publishing. Interprobe Chemical Services, Lenzie, Kirkintilloch, Glasgow, UK

Molecular Modelling. Computational Chemistry Demystified. RSC Publishing. Interprobe Chemical Services, Lenzie, Kirkintilloch, Glasgow, UK Molecular Modelling Computational Chemistry Demystified Peter Bladon Interprobe Chemical Services, Lenzie, Kirkintilloch, Glasgow, UK John E. Gorton Gorton Systems, Glasgow, UK Robert B. Hammond Institute

More information

The File Geodatabase API. Craig Gillgrass Lance Shipman

The File Geodatabase API. Craig Gillgrass Lance Shipman The File Geodatabase API Craig Gillgrass Lance Shipman Schedule Cell phones and pagers Please complete the session survey we take your feedback very seriously! Overview File Geodatabase API - Introduction

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

NMR Predictor. Introduction

NMR Predictor. Introduction NMR Predictor This manual gives a walk-through on how to use the NMR Predictor: Introduction NMR Predictor QuickHelp NMR Predictor Overview Chemical features GUI features Usage Menu system File menu Edit

More information

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression APPLICATION NOTE QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression GAINING EFFICIENCY IN QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIPS ErbB1 kinase is the cell-surface receptor

More information

Dock Ligands from a 2D Molecule Sketch

Dock Ligands from a 2D Molecule Sketch Dock Ligands from a 2D Molecule Sketch March 31, 2016 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com

More information

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

has its own advantages and drawbacks, depending on the questions facing the drug discovery. 2013 First International Conference on Artificial Intelligence, Modelling & Simulation Comparison of Similarity Coefficients for Chemical Database Retrieval Mukhsin Syuib School of Information Technology

More information

Fast similarity searching making the virtual real. Stephen Pickett, GSK

Fast similarity searching making the virtual real. Stephen Pickett, GSK Fast similarity searching making the virtual real Stephen Pickett, GSK Introduction Introduction to similarity searching Use cases Why is speed so crucial? Why MadFast? Some performance stats Implementation

More information

Introducing a Bioinformatics Similarity Search Solution

Introducing a Bioinformatics Similarity Search Solution Introducing a Bioinformatics Similarity Search Solution 1 Page About the APU 3 The APU as a Driver of Similarity Search 3 Similarity Search in Bioinformatics 3 POC: GSI Joins Forces with the Weizmann Institute

More information

Chemically Intelligent Experiment Data Management

Chemically Intelligent Experiment Data Management Chemically Intelligent Experiment Data Management Offering tools specifically designed to optimize the workflow of synthetic, medicinal, process and analytical chemists, the E-WorkBook Suite delivers a

More information

A powerful site for all chemists CHOICE CRC Handbook of Chemistry and Physics

A powerful site for all chemists CHOICE CRC Handbook of Chemistry and Physics Chemical Databases Online A powerful site for all chemists CHOICE CRC Handbook of Chemistry and Physics Combined Chemical Dictionary Dictionary of Natural Products Dictionary of Organic Dictionary of Drugs

More information

The Periodic Table of the Elements

The Periodic Table of the Elements The Periodic Table of the Elements All matter is composed of elements. All of the elements are composed of atoms. An atom is the smallest part of an element which still retains the properties of that element.

More information

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters Drug Design 2 Oliver Kohlbacher Winter 2009/2010 11. QSAR Part 4: Selected Chapters Abt. Simulation biologischer Systeme WSI/ZBIT, Eberhard-Karls-Universität Tübingen Overview GRIND GRid-INDependent Descriptors

More information

Characterization of Pharmacophore Multiplet Fingerprints as Molecular Descriptors. Robert D. Clark 2004 Tripos, Inc.

Characterization of Pharmacophore Multiplet Fingerprints as Molecular Descriptors. Robert D. Clark 2004 Tripos, Inc. Characterization of Pharmacophore Multiplet Fingerprints as Molecular Descriptors Robert D. Clark Tripos, Inc. bclark@tripos.com 2004 Tripos, Inc. Outline Background o history o mechanics Finding appropriate

More information

Хемоінформатика. Докінг. Дизайн ліків. Біоінформатика (3 курс) Лекція 4 (частина 1)

Хемоінформатика. Докінг. Дизайн ліків. Біоінформатика (3 курс) Лекція 4 (частина 1) Хемоінформатика. Докінг. Дизайн ліків Біоінформатика (3 курс) Лекція 4 (частина 1) Формати файлів в хемоінформатиці Chemical information is usually provided as files or streams and many formats have been

More information

Comprehensive Chemoinformatics since Web-based, client/server, and toolkit approaches. Native Oracle (cartridge) and Microsoft technology.

Comprehensive Chemoinformatics since Web-based, client/server, and toolkit approaches. Native Oracle (cartridge) and Microsoft technology. CambridgeSoft Solutions CambridgeSoft Research Informatics Louis Culot Executive Director, Research Informatics Division Informatics Overview ChemDraw since 1986. Comprehensive Chemoinformatics since 1998.

More information

Creating Questions in Word Importing Exporting Respondus 4.0. Importing Respondus 4.0 Formatting Questions

Creating Questions in Word Importing Exporting Respondus 4.0. Importing Respondus 4.0 Formatting Questions 1 Respondus Creating Questions in Word Importing Exporting Respondus 4.0 Importing Respondus 4.0 Formatting Questions Creating the Questions in Word 1. Each question must be numbered and the answers must

More information

Physical Chemistry Final Take Home Fall 2003

Physical Chemistry Final Take Home Fall 2003 Physical Chemistry Final Take Home Fall 2003 Do one of the following questions. These projects are worth 30 points (i.e. equivalent to about two problems on the final). Each of the computational problems

More information

Manual Railway Industry Substance List. Version: March 2011

Manual Railway Industry Substance List. Version: March 2011 Manual Railway Industry Substance List Version: March 2011 Content 1. Scope...3 2. Railway Industry Substance List...4 2.1. Substance List search function...4 2.1.1 Download Substance List...4 2.1.2 Manual...5

More information

VCell Tutorial. Building a Rule-Based Model

VCell Tutorial. Building a Rule-Based Model VCell Tutorial Building a Rule-Based Model We will demonstrate how to create a rule-based model of EGFR receptor interaction with two adapter proteins Grb2 and Shc. A Receptor-monomer reversibly binds

More information

Appendix B Microsoft Office Specialist exam objectives maps

Appendix B Microsoft Office Specialist exam objectives maps B 1 Appendix B Microsoft Office Specialist exam objectives maps This appendix covers these additional topics: A Excel 2003 Specialist exam objectives with references to corresponding material in Course

More information

ADRIANA.Code and SONNIA. Tutorial

ADRIANA.Code and SONNIA. Tutorial ADRIANA.Code and SONNIA Tutorial Modeling Corticosteroid Binding Globulin Receptor Activity Molecular Networks GmbH Computerchemie July 2008 http://www.molecular-networks.com Henkestr. 91 91052 Erlangen

More information

mylab: Chemical Safety Module Last Updated: January 19, 2018

mylab: Chemical Safety Module Last Updated: January 19, 2018 : Chemical Safety Module Contents Introduction... 1 Getting started... 1 Login... 1 Receiving Items from MMP Order... 3 Inventory... 4 Show me Chemicals where... 4 Items Received on... 5 All Items... 5

More information

Application Note. U. Heat of Formation of Ethyl Alcohol and Dimethyl Ether. Introduction

Application Note. U. Heat of Formation of Ethyl Alcohol and Dimethyl Ether. Introduction Application Note U. Introduction The molecular builder (Molecular Builder) is part of the MEDEA standard suite of building tools. This tutorial provides an overview of the Molecular Builder s basic functionality.

More information

Web Search of New Linearized Medical Drug Leads

Web Search of New Linearized Medical Drug Leads Web Search of New Linearized Medical Drug Leads Preprint Software Engineering Department The Jerusalem College of Engineering POB 3566, Jerusalem, 91035, Israel iaakov@jce.ac.il Categories and subject

More information

Introducing GIS analysis

Introducing GIS analysis 1 Introducing GIS analysis GIS analysis lets you see patterns and relationships in your geographic data. The results of your analysis will give you insight into a place, help you focus your actions, or

More information

So I have an SD File What do I do next? Rajarshi Guha & Noel O Boyle NCATS & NextMove So<ware

So I have an SD File What do I do next? Rajarshi Guha & Noel O Boyle NCATS & NextMove So<ware So I have an SD File What do I do next? Rajarshi Guha & Noel O Boyle NCATS & NextMove Soonal Mee>ng, Boston 2015 What do you want to do? Tasks to be considered Searching for structures Managing

More information

Esri UC2013. Technical Workshop.

Esri UC2013. Technical Workshop. Esri International User Conference San Diego, California Technical Workshops July 9, 2013 CAD: Introduction to using CAD Data in ArcGIS Jeff Reinhart & Phil Sanchez Agenda Overview of ArcGIS CAD Support

More information

Chem 253. Tutorial for Materials Studio

Chem 253. Tutorial for Materials Studio Chem 253 Tutorial for Materials Studio This tutorial is designed to introduce Materials Studio 7.0, which is a program used for modeling and simulating materials for predicting and rationalizing structure

More information

ProMass Deconvolution User Training. Novatia LLC January, 2013

ProMass Deconvolution User Training. Novatia LLC January, 2013 ProMass Deconvolution User Training Novatia LLC January, 2013 Overview General info about ProMass Features Basics of how ProMass Deconvolution works Example Spectra Manual Deconvolution with ProMass Deconvolution

More information

Moving into the information age: From records to Google Earth

Moving into the information age: From records to Google Earth Moving into the information age: From records to Google Earth David R. R. Smith Psychology, School of Life Sciences, University of Hull e-mail: davidsmith.butterflies@gmail.com Introduction Many of us

More information

ArcGIS Pro: Essential Workflows STUDENT EDITION

ArcGIS Pro: Essential Workflows STUDENT EDITION ArcGIS Pro: Essential Workflows STUDENT EDITION Copyright 2018 Esri All rights reserved. Course version 6.0. Version release date August 2018. Printed in the United States of America. The information contained

More information

Ligand Scout Tutorials

Ligand Scout Tutorials Ligand Scout Tutorials Step : Creating a pharmacophore from a protein-ligand complex. Type ke6 in the upper right area of the screen and press the button Download *+. The protein will be downloaded and

More information

Building Inflation Tables and CER Libraries

Building Inflation Tables and CER Libraries Building Inflation Tables and CER Libraries January 2007 Presented by James K. Johnson Tecolote Research, Inc. Copyright Tecolote Research, Inc. September 2006 Abstract Building Inflation Tables and CER

More information

Structure Input and Search Documentation

Structure Input and Search Documentation Structure Input and Search Documentation www.infochem.de Version 1.10 March 2016 Dr. Troll-Str. 14 Landsberger Str. 408 82194 Gröbenzell 81241 München Tel: +89 58 30 02 Tel: +89 58 30 02 Fax: +89 58 03

More information

BIOVIA ENHANCED STEREOCHEMICAL REPRESENTATION WHITE PAPER

BIOVIA ENHANCED STEREOCHEMICAL REPRESENTATION WHITE PAPER BIOVIA ENHANCED STEREOCHEMICAL REPRESENTATION WHITE PAPER THE CHALLENGE Applied synthetic chemistry has been placing increasing emphasis in recent years on stereochemistry. In the pharmaceutical industry,

More information

OF ALL THE CHEMISTRY RELATED SOFTWARE

OF ALL THE CHEMISTRY RELATED SOFTWARE ChemBioOffice Ultra 2010 - A Great Benefit for Academia by Josh Kocher, Illinois State University OF ALL THE CHEMISTRY RELATED SOFTWARE that I have used in both an industrial and academic setting, ChemBioOffice

More information

Understanding Your Spectra Module. Agilent OpenLAB CDS ChemStation Edition

Understanding Your Spectra Module. Agilent OpenLAB CDS ChemStation Edition Understanding Your Spectra Module Agilent OpenLAB CDS ChemStation Edition Notices Agilent Technologies, Inc. 1994-2012, 2013 No part of this manual may be reproduced in any form or by any means (including

More information

Universities of Leeds, Sheffield and York

Universities of Leeds, Sheffield and York promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Organic & Biomolecular

More information

Preparing Spatial Data

Preparing Spatial Data 13 CHAPTER 2 Preparing Spatial Data Assessing Your Spatial Data Needs 13 Assessing Your Attribute Data 13 Determining Your Spatial Data Requirements 14 Locating a Source of Spatial Data 14 Performing Common

More information

Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner

Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner Table of Contents Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner Introduction... 2 CSD-CrossMiner Terminology... 2 Overview of CSD-CrossMiner... 3 Features

More information

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom verview D-QSAR Definition Examples Features counts Topological indices D fingerprints and fragment counts R-group descriptors ow good are D descriptors in practice? Summary Peter Gedeck ovartis Institutes

More information

- Some properties of elements can be related to their positions on the periodic table.

- Some properties of elements can be related to their positions on the periodic table. 180 PERIODIC TRENDS - Some properties of elements can be related to their positions on the periodic table. ATOMIC RADIUS - The distance between the nucleus of the atoms and the outermost shell of the electron

More information

Canonical Line Notations

Canonical Line Notations Canonical Line otations InChI vs SMILES Krisztina Boda verview Compound naming InChI SMILES Molecular equivalency Isomorphism Kekule Tautomers Finding duplicates What s Your ame? 1. Unique numbers CAS

More information