Generation of crystal structures using known crystal structures as analogues

Similar documents
Generating Small Molecule Conformations from Structural Data

Analyzing Molecular Conformations Using the Cambridge Structural Database. Jason Cole Cambridge Crystallographic Data Centre

Validation of Experimental Crystal Structures

Supplementary Information for Evaluating the. energetic driving force for co-crystal formation

Exploring symmetry related bias in conformational data from the Cambridge Structural Database: A rare phenomenon?

Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner

Ligand Scout Tutorials

Packing_Similarity_Dendrogram.py

Molecular Aggregation

Performing a Pharmacophore Search using CSD-CrossMiner

Crystal Structure Prediction using CRYSTALG program

Applications of the CSD to Structure Determination from Powder Data

Ground-state Structure and Dynamics

(+-)-3-Carboxy-2-(imidazol-3-ium-1-yl)- propanoate

Introduction to Spark

Joana Pereira Lamzin Group EMBL Hamburg, Germany. Small molecules How to identify and build them (with ARP/wARP)

Crystal structure prediction is one of the most challenging and

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Investigating crystal engineering principles using a data set of 50 pharmaceutical cocrystals

STATIC & DYNAMIC CHARACTERISTICS OF MEASUREMENT SYSTEM

Nitrile Groups as Hydrogen-Bond Acceptors in a Donor-Rich Hydrogen-Bonding Network. Supplementary Information

Data Mining with the PDF-4 Databases. FeO Non-stoichiometric Oxides

Screening for cocrystals of succinic acid and 4-aminobenzoic acid. Supplementary Information

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

User Guide for LeDock

The PhilOEsophy. There are only two fundamental molecular descriptors

Degrees of Freedom in Regression Ensembles

85. Geo Processing Mineral Liberation Data

Studies on 3D Fusion Reactions in TiDx under Ion Beam Implantation

BUDE. A General Purpose Molecular Docking Program Using OpenCL. Richard B Sessions

GAISE Framework 3. Formulate Question Collect Data Analyze Data Interpret Results

Comparative Network Analysis

85. Geo Processing Mineral Liberation Data

organic papers Malonamide: an orthorhombic polymorph Comment

Organometallics & InChI. August 2017

Chapter 3 Engineering Solutions. 3.4 and 3.5 Problem Presentation

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Simulation of Protein Crystal Nucleation

Probability and Statistics

CSD Conformer Generator User Guide

Database Speaks. Ling-Kang Liu ( 劉陵崗 ) Institute of Chemistry, Academia Sinica Nangang, Taipei 115, Taiwan

CSD. CSD-Enterprise. Access the CSD and ALL CCDC application software

Supporting Online Material for

BLAST: Target frequencies and information content Dannie Durand

arxiv: v1 [math.ca] 7 Apr 2019

CSD Conformer Generator User Guide

Hydrogen bonding in oxalic acid and its complexes: A database study of neutron structures

What is Classical Molecular Dynamics?

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

1.b What are current best practices for selecting an initial target ligand atomic model(s) for structure refinement from X-ray diffraction data?!

Softwares for Molecular Docking. Lokesh P. Tripathi NCBS 17 December 2007

Information-Based Similarity Index

The complexity of factoring univariate polynomials over the rationals

CONTRIBUTION OF WEAK INTERMOLECULAR INTERACTIONS IN 3-ACETYL COUMARIN DERIVATIVES

CAMBRIDGE STRUCTURAL DATABASE SYSTEM 2010 RELEASE WORKED EXAMPLES

MOLECULAR DYNAMICS SIMULATION OF HETEROGENEOUS NUCLEATION OF LIQUID DROPLET ON SOLID SURFACE

research papers Intramolecular hydrogen bonds: common motifs, probabilities of formation and implications for supramolecular organization

VIRTUAL LATTICE TECHNIQUE AND THE INTERATOMIC POTENTIALS OF ZINC-BLEND-TYPE BINARY COMPOUNDS

CHEM 121 Introduction to Fundamental Chemistry. Summer Quarter 2008 SCCC. Lecture 2

Analysis of 2x2 Cross-Over Designs using T-Tests

Supplementary Figure 1 Experimental setup for crystal growth. Schematic drawing of the experimental setup for C 8 -BTBT crystal growth.

Statistics of Radioactive Decay

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

Dynamic characteristics Pertain to a system where quantities to be measured vary rapidly with time.

Assignment 2: Conformation Searching (50 points)

Biologically Relevant Molecular Comparisons. Mark Mackey

Hunting for Anomalies in PMU Data

Supplementary Information: Construction of Hypothetical MOFs using a Graph Theoretical Approach. Peter G. Boyd and Tom K. Woo*

AB INITIO PREDICTIONS OF STRUCTURES AND DENSITIES OF ENERGETIC SOLIDS

Prentice Hall Mathematics, Geometry 2009 Correlated to: Connecticut Mathematics Curriculum Framework Companion, 2005 (Grades 9-12 Core and Extended)

CSD. Unlock value from crystal structure information in the CSD

A two-dimensional Cd II coordination polymer: poly[diaqua[l 3-5,6-bis(pyridin-2-yl)pyrazine-2,3- dicarboxylato-j 5 O 2 :O 3 :O 3,N 4,N 5 ]cadmium]

Study of Ni/Al Interface Diffusion by Molecular Dynamics Simulation*

20 Unsupervised Learning and Principal Components Analysis (PCA)

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

RNA protects a nucleoprotein complex against radiation damage

Density Curves and the Normal Distributions. Histogram: 10 groups

First Principles Calculation of Defect and Magnetic Structures in FeCo

Modeling Mutagenicity Status of a Diverse Set of Chemical Compounds by Envelope Methods

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

PREDICTION OF THE CRYSTAL STRUCTURE OF BYNARY AND TERNARY INORGANIC COMPOUNDS USING SYMMETRY RESTRICTIONS AND POWDER DIFFRACTION DATA

ECLT 5810 Data Preprocessing. Prof. Wai Lam

The Analysis of the Equilibrium Cluster Structure in Supercritical Carbon Dioxide

IMPROVING THE ACCURACY OF RIETVELD-DERIVED LATTICE PARAMETERS BY AN ORDER OF MAGNITUDE

Probable Factors. Problem Statement: Equipment. Introduction Setting up the simulation. Student Activity

Random Coattention Forest for Question Answering

Space Group & Structure Solution

A.DE NINNO, A. FRATTOLILLO, A. RIZZO. ENEA C.R. Frascati,Via E. Fermi 45, Frascati (Rome), Italy E. DEL GIUDICE

Algebra 1 Correlation of the ALEKS course Algebra 1 to the Washington Algebra 1 Standards

Commentary on candidate 9 evidence (Refractive Index)

Supporting information for Polymer interactions with Reduced Graphene Oxide: Van der Waals binding energies of Benzene on defected Graphene

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Conformational Searching using MacroModel and ConfGen. John Shelley Schrödinger Fellow

Supporting Information

Overview. More counting Permutations Permutations of repeated distinct objects Combinations

b = (9) Å c = (7) Å = (1) V = (16) Å 3 Z =4 Data collection Refinement

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Schrodinger ebootcamp #3, Summer EXPLORING METHODS FOR CONFORMER SEARCHING Jas Bhachoo, Senior Applications Scientist

Land-Line Technical information leaflet

Statistics for Managers using Microsoft Excel 6 th Edition

Transcription:

Supporting information Volume 72 (2016) Supporting information for article: Generation of crystal structures using known crystal structures as analogues Jason C. Cole, Colin R. Groom, Murray G. Read, Ilenia Giangreco, Patrick McCabe, Anthony M. Reilly and Gregory P. Shields

Supporting information, sup-1 S1. Shape match score probability table The USR shape match algorithm returns shape match values in the range 0 (no match) to 1 (exact match). The generation of candidate crystal structures may involve other forms of scored choice, for example conformer selection scored by log probability. To aid combined scoring, the USR shape match scores were converted to log probabilities by interpolation in a pre-calculated table. The table maps shape match values in 0.01 increments to the log probability of getting or exceeding that shape match value when a molecule, of specified atom count, is shape matched against another molecule of any size. Shape match value frequency data was collected by shape matching every molecule in 1/20 th of the shape database against every molecule in a different 1/20 th of the shape database. The log probability tables were calculated from the frequency data. CSD entries in the same family (those that had the same first 6 letters of their CSD refcode) were excluded from comparison. See the associated excel spread sheet for the complete table of log probabilities. The distribution of shape match probabilities can vary with the number of atoms in the molecules being compared. Normally very high shape match scores (> 0.95) are extremely rare, but for low atom count molecules there is a higher than normal chance of getting a high match score, and high atom count molecules have a higher than normal chance of getting a low match score. There are entries in the table for molecules with between 1 and 20 atoms. Above 20 atoms, the probability distribution was not seen to vary significantly. Figure S1 shows how these probability distributions vary for a number of different atom counts.

0.00 0.04 0.08 0.12 0.16 0.20 0.24 0.28 0.32 0.36 0.40 0.44 0.48 0.52 0.56 0.60 0.64 0.68 0.72 0.76 0.80 0.84 0.88 0.92 0.96 1.00 Acta Cryst. (2016). B72, doi:10.1107/s2052520616006533 Supporting information, sup-2 1.2 1 0.8 0.6 0.4 1 atom 3 atoms 9 atoms 20+ atoms 0.2 0 Figure S1 Probability distribution (y axis) of shape match scores (x axis) for various atom counts. S2. Structure Scoring S2.1. Inter-molecular score The inter-molecular score S(inter) for a crystal structure is calculated on a per-molecule basis using inter-molecular atom-atom interaction Buckingham potentials: V B (r) = Ae kr C (1) r 6 To keep the potential finite ranged we calculate the approximate long range r value (r d ) where V(r d ) = 10 3 and smoothly switch off V beyond this value (e.g. 11.6 Å for carbon-carbon interactions), over a range of 1 Å so that V(r) = V B (r) 1 2 (1 + cos((r r d)π)) r d r r d + 1 (2) r > r = 0 d + 1 To prevent the well-known divergence of V B (r) at short range, we linearly extrapolate from the point of inflection (r i ) on the repulsive wall, located between the local maximum at short range and the well minimum, where V B has slope V B (r i ) so that V(r) = V B (r i ) + V B (r i )(r r i ) r r i (3)

Supporting information, sup-3 The parameters A, k and C, based on the Unimol inter-molecular force-field (Filippini & Gavezzotti, 1993 and Gavezzotti & Filippini, 1994), were tuned to improve their ability to replicate CSD crystal structures. The performance of a set of parameters was evaluated by minimising 100-200 CSD crystals that used only those parameters, using a drift index (Gavezzotti, 2011) after minimisation as a badness score. These sets of parameters were then optimised with Limited Memory BFGS (Liu & Nocedal, 1989) to minimise this badness score using numeric gradients. S3. Effect of refcode family on space group In the CSD, crystal structure entries are placed into refcode families when the underlying molecular structure of the component in the lattice is equivalent. This means that redeterminations of crystal structures at different experimental conditions, polymorphic structures and repeat references to a single structure in the literature are contained within a single family. The reference codes (refcodes) for the distinct entries then differ by a two digit number appended to the reference. For example, structures of the compound sulphanilamide have refcodes SULAMD, SULAMD01, SULAMD03 etc. Naturally, later codes are often added to the database later in time, when new studies are published. Consequently our conjecture is that later refcodes tend to reflect more extensive studies on compounds than initial studies. In Figure S2, we see the distribution of the space groups in the CSD of the last refcode in refcode families broken down by the count of distinct refcodes in the family for rare and common space groups. The distribution shows a clear trend towards more space group diversity in later family members. One speculative but plausible hypothesis to explain such a distribution is that structures that are later in refcode families are examples of structures which were harder to find due to the conditions required for them to form, and were only found when more significant investment was made to discover them.

Supporting information, sup-4 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% Rare Common 20.00% 10.00% 0.00% All structures Family size of 2 Family size of 3 Family size greater than 3 Figure S2 Percentages of rare and common space groups for the last refcode in a refcode family in the CSD broken down by the size of the refcode family. Rare structures are defined as those in space groups where there are less than 10000 occurrences in the CSD. Common space groups are those where there are more than 10000 occurrences in the CSD. 45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% Ncomponents = 1 Ncomponents = 2 Ncomponents = 3 Ncomponents = 4 Ncomponents = 5 Ncomponents = 6 Ncomponents = 7 5.00% 0.00% Figure S3 Space group (x axis) against frequency (y axis) split by number of components (Z ꞌꞌ )

Supporting information, sup-5 S4. Response surface in optimisation We investigated the size of the capture space in local optimization. Each test set structure was perturbed randomly. All variables (rotations, translations and cell parameters were perturbed.). The perturbation applied to each variable was a random amount up to a defined percentage threshold. Following perturbation, the structure was optimised. The resulting structure was then compared with the observed structure to ascertain if the observed structure was reproduced (structures were considered to be reproduced if the crystal packing similarity (Chisholm & Motherwell, 2005) with a distance tolerance of 20%, an angle tolerance of 20 and a cluster size of 15 molecules gave a (heavy-atom) RMSD less than 0.5 Å.) The process was repeated 10 times for each structure in the test set with each percentage threshold, and the resulting data was used to establish the decline of success with increasing perturbation. The resultant curve is shown in Figure S4. 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 10 20 30 40 50 60 70 80 90 100 Number with 15 in common (%) Figure S4 Percentage of local optimisations that lead to the observed structure as a function of starting point perturbation.

Percentage of pairs with the same spacegroup Acta Cryst. (2016). B72, doi:10.1107/s2052520616006533 Supporting information, sup-6 S5. Can shape predict space group? In addition to analysing unit cell retrieval rates, we were interested in understanding whether shape aids in predicting space group. We achieved this by using USR to do a pairwise comparison of all pairs of structures that could be generated from a subset of 133,428 single-component entries in the CSD which were organic, fully determined and had an R-factor < 10.0. This led to 8,901,448,878 comparisons. These comparisons were then sorted by degree of shape similarity, and the percentage of comparisons with the same space group was calculated as a function of shape similarity. For comparisons with a shape similarity of 0.5, we find that 21% of pairs occur in the same space group. This value can be taken as a baseline: if shape were useful for predicting space group, we would expect higher degrees of shape similarity to lead to higher percentages of pairs in the same space group. As is apparent from Figure S5, there is a slow and gradual increase with similarity, but truly useful enrichment only occurs between very similar structures (with a shape similarity greater than 0.95). 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Shape similarity of pair Figure S5 Using shape to predict space group. S6. Other supplementary files evaluation_set.gcd : a text file listing the CSD entries used in the test set. Analogue Pairs.csv : pairs of CSD entries containing the structure predicted and the analogue structure used to predict it.

Supporting information, sup-7 shape_match_probability_table.xlsx: an excel format spreadsheet containing shape match log probabilities References Chisholm, J. A. & Motherwell, W. D. S. (2005). J. Appl. Cryst. 38, 228-231. Filippini, G. & Gavezzotti, A. (1993). Acta Cryst. B49, 868-880. Gavezzotti, A. & Filippini, G. (1994). J. Phys. Chem. 98, 4831-4837. Gavezzotti, A. (2011). New J. Chem. 35, 1360-1368. Liu, D. C. & Nocedal, J. (1989). Math. Program. 45, 503-528.