verview D-QSAR Definition Examples Features counts Topological indices D fingerprints and fragment counts R-group descriptors ow good are D descriptors in practice? Summary Peter Gedeck ovartis Institutes for BioMedical Research RC, orsham, UK V A R T I S V A R T I S Definition What are descriptors? Atom Group umber Vector Function Physicochemical property (log P) Derived properties (distribution of surface electrostatic potential) D-s are based on descriptors derived from a twodimensional graph representation of a molecule 1D - molecular formula D - molecular connectivity / topology D - molecular geometry / stereochemistry D/D/ - conformational ensembles C 1 1 MW =. Molecule Abstract properties (fingerprint = fragment count) V A R T I S V A R T I S verview Good descriptors should characterize molecular properties important for molecular interactions ydrophobic, electronic, steric / size / shape, hydrogen bonding A recently published encyclopaedia describes more then 000 molecular descriptors used in QSAR and molecular modelling. R. Todeschini, V. Consonni, andbook of Molecular, Wiley, 000 Definition Examples Features counts Topological indices D fingerprints and fragment counts R-group descriptors ow good are D descriptors in practice? Summary We cannot cover all! So, here is a selection V A R T I S V A R T I S 1
Feature counts Feature counts ydrogen bond donor ydrogen bond acceptor umber of rings umber or rotatable bonds Features are usually defined using substructures or SMARTS 1 [,] [!#;!0] 1 SMILES and SMARTS tutorial can be found at www.daylight.com Feature counts Application Ghose and Crippen developed an atom-based model for logp (alogp) Ghose AK, Crippen GM. Atomic physicochemical parameters for three-dimensional structure-directed quantitative structure-activity relationships. I. Partition coefficients as a measure of hydrophobicity. J Comput Chem (1) -. Wildman SA, Crippen GM. Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci (1) -. The atoms of a molecule are classified into 1 different atom types aromatic carbon, primary, secondary aliphatic carbon, Linear model for logp logp = f i n i atomtype i Extension of the approach to molar refractivity V A R T I S V A R T I S Feature counts Application: Polar Surface Area (PSA) Feature counts Application: Polar Surface Area (PSA) Polar Surface Area - PSA is the sum of surface contributions of polar atoms (usually oxygens, nitrogens and attached hydrogens). This descriptor is easy to interpret and what is most important, it provides very good correlation with drug transport properties. Ertl P, Rohde B, Selzer P, J Med Chem (000) 1 V A R T I S V A R T I S Feature counts Application: Polar Surface Area (PSA) PSA vs. D-PSA for molecules. n =, r = 0. Generalisation of feature counts based on D fingerprints Fragment dictionary fingerprints Defined structural features (public) keys Pre-defined fragments may not be suitable for dataset ashed fingerprints Automatically generated fragments Convert fragment to unique number (0- ) fingerprint Fold large fingerprint into short representation (e.g. ): Daylight, QSAR, or use as is: SciTegic V A R T I S V A R T I S
Cl I QSAR Fragment based fingerprints occurrences Break structure into fragments and count occurrences occurrences 1 occurrence List of unique fragments Combine all counts for all possible fragments into a vector of numbers = hologram 0 1 0 V A R T I S 1 Convert to unique number 10,0,1,1,0,0,1,0, Interpret numbers as bits or counts Reduce length of vector by folding = 0 reduced hologram V A R T I S QSAR Application Level of detail encoded Atoms: CCC Atoms/Bonds Atoms/Bonds Connections Atoms/Bonds Connections Chirality Minimum and maximum size of fragments (R/S) Will describe a large scale QSAR study comparing various methods later Length of reduced hologram 0 1 0 0 V A R T I S V A R T I S Scientific Rationale What determines binding? Example of a descriptor developed for a very specific application irons L, olliday JD, Jelfs SP, Willett P, Gedeck P. Use f The R-Group Descriptor For Alignment-Free QSAR. QSAR Comb. Sci (00) 11-1. Lead optimisation datasets Series of compounds with common core structure Systematic variation of substituents Modification often localised at a small part of the molecule Cl R1 Glutamate + Glutamine R Isoleucine V A R T I S Muszynski IC et al.. QSAR 1 (1) - R R R1 R R V A R T I S Protein-Ligand interactions hydrophobic hydrogen bond electrostatic Position of pharmacophore in space important
1.0 0. -0. -1.0 0 1 1.0 0. -0. -1.0 1.0 0. -0. -1.0 1.0 0. -0. -1.0 0 1 0 1 0 1 0. 0. -0. -0. 0 1 0. 0. -0. -0. 0. 0. -0. -0. 0. 0. -0. -0. 0 1 0 1 0 1 Scientific Rationale ow to capture binding information? Scientific Rationale Influence of core on binding is constant for lead series + Differences of substituent properties cause difference in binding S R Influence of substituents for same binding mode almost additive Descriptor needs to encode position of pharmacophoric features in space have properties that correlate with binding interactions hydrophobic: atomic polarisability hydrogen bond: hydrogen bond donor/acceptor counts, polar surface area electrostatic: atomic charge Distance of functional groups from core important for binding nly interested in substituents with single attachment point Substituents are fairly small R1 R S V A R T I S V A R T I S R-Group Assign properties to atoms of descriptor Determine distance of atoms to attachment point 0. 0. 0.1 0.1-0. 0. 0. Example Phenyl Phenyl -Aminophenyl -Aminophenyl Combine properties and distance to form the descriptor Descriptor: (0.1, 0., 0., 0.1, -0., 0, 0, 0) 1 1 F -Fluorophenyl -ydroxycyclohexyl -Fluorophenyl -ydroxycyclohexyl 0-1 0 1 Atomic Polarisability Atomic Charge V A R T I S V A R T I S Variations QSAR Atom-Based Based upon the sum of atomic properties: Atomic weights and partial charges Atomic contributions to LogP, MR and PSA -Bond Acceptor and Donor counts (BA and BD) Data Surface-Based Based upon maximum-positive and minimum-negative surface potentials: Molecular Electrostatic Potentials (MEP) Molecular Lipophilicity Potentials (MLP) Structure Model Field-Based Based upon the molecular interaction fields (MIF, GRID) Dry probe - hydrophobic interactions Carbonyl oxygen probe - BD interactions 1 Amide nitrogen probe - BA interactions represent properties of the structure Predictions V A R T I S V A R T I S
QSAR QSAR R-group QSAR Development Descriptor generation: R descriptors R R R 1 Compounds R-groups Atomic properties R descriptors Me R 1 descriptors Compound 1 Compound R 1 R R R 1 R R R 1 R R Function relating descriptors to biological activity: activity = f (Molecular descriptors) X x = explain which molecular features are responsible to activity help to design new compounds with enhanced features Compound Property 1 Property Property variables V A R T I S V A R T I S QSAR Data sets QSAR results Four data sets selected from the literature: Data set Benzodiazepines QSAR PLS R 0. 0. Q 0. 0.1 pred-r 0. -0.1 R R R 1 R R R 1 R R 1 R R R 1 benzodiazepines serotonin triazines tropanes R 1 R Cl Serotonin Triazines Tropanes T Tropanes A Tropanes DA QSAR QSAR QSAR CoMFA QSAR CoMFA QSAR CoMFA 0. 0.0 0. 0. 0. 0. 0. 0. 0.1 0. 0. 0. 0. 0. 0. 0.1 0.1 0. 0. 0. 0. 0.1 0. 0. 0. 0.0 0.1 0. 0.1 0. 0. 0. 0. 0.0 0. 0. 0. 0. V A R T I S V A R T I S QSAR Serotonin data set Simulated Lead-ptimisation Exception Serotonin data set (q = 0.) not surprising Literature result using CoMFA: r=0.1, q=0. Substituents large and structurally very diverse Demonstrates limitation of R-Group descriptors Retrieve initial lead compounds Initialisation R1 R R1: (cores) S S Remaining compounds? Generate QSAR [false] [true] Select best predictions ptimisation V A R T I S V A R T I S
Simulated Lead-ptimisation Retrospective analysis using three in-house datasets with known timecourse programme programme programme Distribution of activities (pic0 values) Iterations of 0 compounds activity Simulated Lead-ptimisation Box-plots improve clarity of visualisation activity outliers upper adjacent value upper quartile median lower quartile lower adjacent value 1 1 iteration 1 1 1 iteration 1 V A R T I S V A R T I S Simulated Lead-ptimisation chemist chemist chemist verview Two strategies Chronological starting point Diverse starting point QSAR supported lead optimisation identifies potent compounds more rapidly activity chronos chronos chronos Definition Examples Features counts Topological indices D fingerprints and fragment counts R-group descriptors ow good are D descriptors in practice? Summary diverse diverse diverse 1 1 iteration 1 V A R T I S V A R T I S contain between 0 and 000 datapoints Approximately 0 datasets extracted from corporate database contain estimated data (e.g. > µm, full DS) contain only exact measurements (pruned DS) verlap 0 datasets Average 00 datapoints Average 1. log(mol/l) different descriptors studied D descriptors (GRID): single conformation used (Concord); default settings DRY,, 1 probe D descriptors : Counts of atom types FCFCx (x=,,; SciTegic): Counts of extended connectivity fragments using pharmacophore atom typing; three levels of complexity QSAR: Count of fragment occurrences; default settings, 01 length Similog: Descriptor based on counts of pharmacophore triplets Fingerprint : public key fingerprint. : ovartis developed fingerprint, optimised for searching/filtering in corporate database. PCA required for FCFCx, and Similog due to large number of descriptors V A R T I S V A R T I S
sorted by activity split into training and test set 0-0: Every other data point used for test set (interpolation) -: Top and bottom % of dataset used for testing (extrapolation) PLS model (implementation Sybyl) ptimal number of components determined using crossvalidation of training set Characterisation of model performance Predictive performance of model on test set Multivariate predictive r pred Correlation actual versus predicted r corr pred act ( yi yi ) i test act act ( yi y ) rpred = 1 i test pred pred act act ( yi y )( yi y ) i test pred pred act act ( yi y ) ( yi y ) rcorr = i test i test V A R T I S V A R T I S Validation through randomisation experiments datasets using the descriptors Random test/training set splits Median std dev of r pred values: y-scrambling r pred values dropped to - with median std dev of Dataset 0 0 0 0 datasets 00 data points descriptors 1.000 s 0 0 experiments 0-0. 0. 0. 0. 0. 1.0 r pred V A R T I S V A R T I S 1. Performance of individual descriptors. Dependence on dataset characteristics. Comparing descriptors. r pred or r corr? 1. Performance of individual descriptors 0-0 experiment, full dataset Dependence on cut-off one of the descriptors is best all the times QSAR and perform best; descriptor are biased towards features of the dataset FCFCx should be similar, but too many features introduce too much noise FCFC slightly better than FCFC and FCFC AlogP, FCFC, FCFC,, and Similog occupy middle ground performs worst Percentage of good models 0 Method 0 FCFC FCFC FCFC QSAR 0 Similog Total 0 0 0 0. 0. 0. 0. 1.0 r pred cut-off for good models V A R T I S V A R T I S
1. Performance of individual descriptors umber of good models: r pred > 0. Similog All descriptors: QSAR 0-0 experiment FCFC % (11) pruned DS FCFC FCFC % (11) full DS - experiment % () pruned DS % () full DS Adding estimated data Similog (=inactives) improves QSAR models (0-0 experiment) FCFC FCFC FCFC Pruned dataset - Full dataset - 1 0 0 Pruned dataset 0-0 Full dataset 0-0. Descriptor dependence Dataset size for 0-0 experiment Red line is local LESS regression Similar results obtained for experiment Trend as expected Good models are easier to achieve for larger datasets 1 0 0 Percentage V A R T I S V A R T I S. Descriptor dependence Spread of biological activities for 0-0 experiment Red line is local LESS regression Similar results obtained for experiment Trend as expected log(1 mol/l) minimum requirement for good models log( mol/l) better. Comparing descriptors Example: versus 0-0 experiment, full dataset, negative r pred values set to 0 ighly correlated, yet more complex descriptors consistently better V A R T I S V A R T I S. Comparing descriptors. Comparing descriptors Graphs compare r pred values calculated for different descriptors using densities 0-0 experiment, full dataset and high correlation, but shifted curve verview shows Similog and descriptors are different FCFC and FCFC behave very similar Visualisation too complex Visualisation of correlation matrices. r pred used to cluster descriptors and to calculate correlation matrix. are reordered in each graph using hierarchical clustering. Colours correspond to correlation coefficients: 0. (blue), 0. (green), 0. (yellow), 0. (orange). FCFC, Similog and descriptors are very different and highly correlated, but quality of models is very different FCFC Similog QSAR FCFC FCFC FCFC Similog QSAR FCFC FCFC V A R T I S V A R T I S
. r pred or r correl QSAR, full dataset Black line is identity Similar result obtained for other descriptors 0-0 experiment: nly little difference between the two statistical measures - experiment: Accurate prediction of extrapolated activity data difficult Summary Study compares different QSAR methods using 00 real-life datasets ine different types of descriptors used Some descriptors are better than others, but none is perfect Why only -% good models? Quality of biological data Small dataset QSAR unreliable (but not useless!) Maybe it looks worse than it is; % good models for cut-off r pred>0. s performance was disappointing, but it may be improved if better care is used to identify correct conformations For a new dataset, try QSAR (and ) first: fast and often a good performance V A R T I S V A R T I S Acknowledgements Christian Bartels, Bernd Rohde, GPS, IBR, Basel, Switzerland Large-scale QSAR study Peter Ertl, Paul Selzer, GPS, IBR, Basel, Switzerland PSA model Steven Jelfs, Prof. Peter Willett, Dr. John olliday, Linda irons, University of Sheffield, UK R-group descriptors V A R T I S