Packing_Similarity_Dendrogram.py

Similar documents
CSD Conformer Generator User Guide

CSD Conformer Generator User Guide

Introduction Molecular Structure Script Console External resources Advanced topics. JMol tutorial. Giovanni Morelli.

Analyzing Molecular Conformations Using the Cambridge Structural Database. Jason Cole Cambridge Crystallographic Data Centre

Performing a Pharmacophore Search using CSD-CrossMiner

Practical Bioinformatics

Supplementary Information for Evaluating the. energetic driving force for co-crystal formation

STRUCTURAL BIOINFORMATICS I. Fall 2015

Table of Contents. Scope of the Database 3 Searching by Structure 3. Searching by Substructure 4. Searching by Text 11

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

Generation of crystal structures using known crystal structures as analogues

Alkanes. Introduction

The Schrödinger KNIME extensions

Version 1.2 October 2017 CSD v5.39

There are several ways to draw an organic compound, mainly being display formulae, 3D structure and skeletal structure.

Creative Data Mining

Reaxys Pipeline Pilot Components Installation and User Guide

Assignment 2 Atomic-Level Molecular Modeling

ISIS/Draw "Quick Start"

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Physical Chemistry Analyzing a Crystal Structure and the Diffraction Pattern Virginia B. Pett The College of Wooster

q C e C k (Equation 18.1) for the distance r, we obtain k (Equation 18.1), where Homework#1 3. REASONING

High Density Area Boundary Delineation

Preparing a PDB File

Supplementary Information: Construction of Hypothetical MOFs using a Graph Theoretical Approach. Peter G. Boyd and Tom K. Woo*

Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner

Iowa Department of Transportation Office of Transportation Data GIS / CAD Integration

Increase in Solubility of Poorly-Ionizable Pharmaceuticals by Salt Formation. A Case of Agomelatine Sulfonates.

Applying cluster analysis to 2011 Census local authority data

85. Geo Processing Mineral Liberation Data

Investigating crystal engineering principles using a data set of 50 pharmaceutical cocrystals

Rigorous Software Development CSCI-GA

IBiSA_tools Documentation

CHAPTER 2. Structure and Reactivity: Acids and Bases, Polar and Nonpolar Molecules

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

CAMBRIDGE STRUCTURAL DATABASE SYSTEM 2010 RELEASE WORKED EXAMPLES

Introduction to Hartree-Fock calculations in Spartan

CHEMDRAW ULTRA ITEC107 - Introduction to Computing for Pharmacy. ITEC107 - Introduction to Computing for Pharmacy 1

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

The Cambridge Structural Database (CSD) a Vital Resource for Structural Chemistry and Biology Stephen Maginn, CCDC, Cambridge, UK

CSD. CSD-Enterprise. Access the CSD and ALL CCDC application software

LysinebasedTrypsinActSite. A computer application for modeling Chymotrypsin

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options for grouping with metabolism

ANALYZE. A Program for Cluster Analysis and Characterization of Conformational Ensembles of Polypeptides

Build_model v User Guide

6. How Functions Work and Are Accessed. Topics: Modules and Functions More on Importing Call Frames

lightcurve Data Processing program v1.0

Assignment 1: Molecular Mechanics (PART 2 25 points)

Mercury User Guide and Tutorials

Lab 6: Linear Algebra

Ligand Scout Tutorials

Alkanes and Cycloalkanes

A Monte Carlo Implementation of the Ising Model in Python

the gradientframe package

Lab 1: Empirical Energy Methods Due: 2/14/18

Refine & Validate. In the *.res file, be sure to add the following four commands after the UNIT instruction and before any atoms: ACTA CONF WPDB -2

Screening for cocrystals of succinic acid and 4-aminobenzoic acid. Supplementary Information

MAT 275 Laboratory 4 MATLAB solvers for First-Order IVP

Structural and dynamical properties of Polyethylenimine in explicit water at different protonation states: A Molecular Dynamics Study

Homework Problem Set 4 Solutions

Alkanes. ! An alkane is a hydrocarbon with only single bonds. ! Alkanes have the general formula: C n H 2n+2

Figure 1. Molecules geometries of 5021 and Each neutral group in CHARMM topology was grouped in dash circle.

Blender in Bio-/ Quantum-chemistry. Thomas Haschka - Blenderconf 2011

Chapter 4. Glutamic Acid in Solution - Correlations

MECH : a Primer for Matlab s ode suite of functions

Loudon Chapter 24 Review: Carbohydrates Jacquie Richardson, CU Boulder Last updated 4/26/2018

Tutorial: Structural Analysis of a Protein-Protein Complex

Pipeline Pilot Integration

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

ncounter PlexSet Data Analysis Guidelines

Companion. Jeffrey E. Jones

Supporting Information

Tutorial. Getting started. Sample to Insight. March 31, 2016

Homework #2 Solutions Due: September 5, for all n N n 3 = n2 (n + 1) 2 4

GRASS GIS Development APIs

Geostatistics and Spatial Scales

Advanced Forecast. For MAX TM. Users Manual

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Introduction to single crystal X-ray analysis VI. About CIFs Alerts and how to handle them

Chapter 2 Solutions of Equations of One Variable

CLASSIFIY ROCKS THAT ARE SHARED BETWEEN TWO OR MORE GLACIERS AS "NODATA"

41. Sim Reactions Example

Homework 9: Protein Folding & Simulated Annealing : Programming for Scientists Due: Thursday, April 14, 2016 at 11:59 PM

BCB410 Protein-Ligand Docking Exercise Set Shirin Shahsavand December 11, 2011

A Glimpse at Scipy FOSSEE. June Abstract This document shows a glimpse of the features of Scipy that will be explored during this course.

Due: since the calculation takes longer than before, we ll make it due on 02/05/2016, Friday

FlexPepDock In a nutshell

Conformational Analysis of n-butane

Infrared photodissociation spectroscopy of protonated formic. acid and acetic acid clusters

Space Group & Structure Solution

85. Geo Processing Mineral Liberation Data

Assignment A02: Geometry Definition: File Formats, Redundant Coordinates, PES Scans

Electronic Supplementary Information (ESI) for Chem. Commun. Unveiling the three- dimensional structure of the green pigment of nitrite- cured meat

Example: Identification

NMR Predictor. Introduction

MAGNETITE OXIDATION EXAMPLE

The Schrödinger KNIME extensions

IAST Documentation. Release Cory M. Simon

A First Course on Kinetics and Reaction Engineering Example 1.4

Project 2. Chemistry of Transient Species in Planetary Atmospheres: Exploring the Potential Energy Surfaces of CH 2 S

Transcription:

Packing_Similarity_Dendrogram.py Summary: This command-line script is designed to compare the packing of a set of input structures of a molecule (polymorphs, co-crystals, solvates, and hydrates). An all-to-all comparison of the structures is performed, considering only the heaviest components in each structure, and a packing-similarity dendrogram or tree is constructed using hierarchical clustering. This dendrogram shows the similarity between groups of structures and how these groups relate to one another. Requirements: CSD Python API (v. 1.0 or later), matplotlib and standard python packages. Usage and Output: python Packing_Similarity_Dendrogram.py -h will show the help text: usage: Packing_Similarity_Dendrogram.py [-h] [-m similarity_matrix.txt] [-ns 25] [-nm 15] [-o] [--allow_molecular_differences] [--clustering_type {complete,single,average}] [-s] [-ct 0.5] [-at 25] [-dt 0.25] input_file Packing_Similarity_Dendrogram.py - Construct a dendrogram for an input set of structures based on packing-similarity analysis positional arguments: input_file Set of structures to perform analysis on [.mol2/cif/res/ind]. optional arguments: -h, --help show this help message and exit -m similarity_matrix.txt, --matrix similarity_matrix.txt NumPy matrix containing existing packing similarity results. -ns 25, --n_structures 25 Number of structures to take from input set. -nm 15, --n_molecules 15 Size of molecular packing shell to use for analysis (must be consistent with input matrix, if used). -o Flag for whether to save packing similarity results (text file and mol2 overlays). --allow_molecular_differences Flag for whether to allow for molecular differences between structures (e.g. for salts). --clustering_type {complete,single,average} Type of clustering to employ -s, --strip Strip all terminal atoms and alkyl chains, up to any hetero atom (O, N, S) or cyclic atom. This cuts the

molecule down to core structural features and may reveal more general structural similarities, including those between molecules with different conformations. -ct 0.5, --conf_tol 0.5 RMSD threshold for considering two conformations to be the same (when merging at level 1). -at 25, --angle_tol 25 Tolerance for angles (in degrees) used by packing similarity. -dt 0.25, --dist_tol 0.25 Fractional tolerance for distances (0.0-1.0) used by packing similarity. Basic usage (in a command prompt) is: python Packing_Similarity_Dendrogram.py input_file where input_file should be a format recognised by the CSD Python API (e.g. cif, res, mol2, gcd or ind). The default output of the tool consists of figures (as.png files) of the packing-similarity dendrogram and a heat map of the packing similarity between the structures. The file names are prefixed with the stem of the input file (i.e. if roy.gcd is the input, then the similarity dendrogram will be called roy_packing_similarity_tree.png). If a large set of structures are inputted, the top N structures can be selected using the -nm options. Using the -o option will result in overlays being saved for each comparison (as.mol2 files). The matrix of similarities is also saved as a raw numpy matrix (i.e. as roy_similarity_matrix.txt), and can be read back in (with -m) to skip the packing-similarity analysis, if, for example, a different clustering algorithm is desired or part of the script has been changed. The --allow_molecular_differences option can be used when comparing crystal structures of closely related molecules e.g. salts and free forms. The -s option will strip all terminal atoms and carbon atom chains up to hetero atoms (e.g. the methyl of a methoxy will be removed), which may be useful for identifying more coarsegrained similarity that ignores small changes in the periphery of the molecule. A new cif is created containing the stripped molecule and analysis is performed using this file. Note that disordered experimental structures present in the input database will cause problems when this option is selected and should be removed from the file. The remaining options control the packing-similarity settings, such as number of molecules, thresholds etc. Matches of only one molecule between structures may not necessarily

correspond to good agreement between two conformations (as the distance and angle thresholds can be more coarse-grained than an RMSD tolerance). The -ct option can therefore be used to specific an RMSD tolerance for whether two conformations are matched and in, turn, whether two clusters that have only one molecule in common should be merged at that level. Understanding the Dendrogram: An example dendrogram (produced using CBZdataset.ind and related files; see supporting material of Cryst. Growth Des., 2009, 9, 1869 1888) is shown below in Figure 1. To produce this dendrogram, the script compares all 50 structures in the input to each other. For a default cluster size of 15, the match can range from 0, which indicates a completely different conformation and no similarity in packing, to 15, where the two structures are isostructural. In comparing the structures different linkage criteria can be used. The default behaviour is to link structures based on the smallest difference between them (referred to as singlelinkage clustering). Structures are initially grouped together based on their best matches, such that if A matches B with 15/15 and C with 14/15 and B matches C with 15/15, then A, B and C are all grouped together at level 15. Groups are then merged with the group which they share the best overlay, based on a single structure in each group. This means that if D has an 8/15 molecule match with B, then the group A, B, C will join with D at level 8, even if A and C had smaller similarities with D. At the highest level (e.g. 15 in the example below) isostructural crystal structures are listed together (e.g. CBZDBF18, CBZDBF19 ). At lower levels, where groups or structures share a similarity then the parent groups are draw one level higher (with blue dots) and then join at the level they match. For example, structures 37 and 4 at the bottom of the plot have 5/15 molecules in common, so both are drawn at level 6 and then merge at level 5. This group of 37 and 4 then shares a similarity with 32, 09 and 30, as well as all other structures apart from 31 at level 3, so all of them merge. Alternative clustering modes can be invoked by using --clustering_type, which takes one of three arguments: single, complete and average. Single linkage clustering is the default discussed above. Complete clustering will instead only join two clusters at the level where all the structures in the two clusters match, which is equivalent to joining clusters at the worst or lowest level connecting them. Figure 2 shows the same database as Figure 1 using this approach, which gives more definite clustering but will also hide some similarities due to the hierarchical nature of the clustering.

Figure 1: Example dendrogram based on a database of carbamazepine solid forms using the default single-linkage clustering. Figure 2: Example dendrogram based on a database of carbamazepine solid forms using completelinkage clustering.

The final option for clustering type is to merge clusters based on their average packingsimilarity agreement. The result for carbamazepine is shown in Figure 3. With average linkage clustering, two cluster can join at non-integer values and therefore the plot may become more complex visually. There are several alternative schemes for linking clusters (e.g. weighting the average based on the cluster sizes) that could also be implemented. Figure 3: Example dendrogram based on a database of carbamazepine solid forms linking clusters based on their average packing similarity. Caveats: The tool considers only the heaviest component in each crystal structure to enable comparison of multi-component forms with pure forms. Results for Z > 1 systems may not fully reflect differences between structures as the best match is retained by default in the packing-similarity analysis.