Topliss batchwise scheme reviewed in the era of Open Data

Similar documents
Visualization and manipulation of Matched Molecular Series for decision support

Power. Wintersemester 2016/17. Jerome Olsen

Oliver Kullmann Computer Science Department Swansea University. MRes Seminar Swansea, November 17, 2008

MM-GBSA for Calculating Binding Affinity A rank-ordering study for the lead optimization of Fxa and COX-2 inhibitors

Temporal Integrity Constraints in Databases With Ongoing Timestamps

Ordinals and Cardinals: Basic set-theoretic techniques in logic

On the Trail of Tomorrow s Semiconductors

SS BMMM01 Basismodul Mathematics/Methods Block 1: Mathematics for Economists. Prüfer: Prof. Dr.

Ridit Score Type Quasi-Symmetry and Decomposition of Symmetry for Square Contingency Tables with Ordered Categories

A simple method for solving the diophantine equation Y 2 = X 4 + ax 3 + bx 2 + cx + d

Use of data mining and chemoinformatics in the identification and optimization of high-throughput screening hits for NTDs

CHAPTER 6 QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR) ANALYSIS

Reporting LJ-meeting

Development of Pharmacophore Model for Indeno[1,2-b]indoles as Human Protein Kinase CK2 Inhibitors and Database Mining

Biophysics of Macromolecules

In Silico Investigation of Off-Target Effects

Counterexamples in the Work of Karl Weierstraß

Analysis of Time-to-Event Data: Chapter 2 - Nonparametric estimation of functions of survival time

Structural interpretation of QSAR models a universal approach

Algebra. Übungsblatt 8 (Lösungen) m = a i m i, m = i=1

L. Introduction to Chemical Engineering ETH Zürich. Page 1 of 7 FS Prof. Marco Mazzotti Written Examination,

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

Structure-Activity Modeling - QSAR. Uwe Koch

1. Einleitung. 1.1 Organisatorisches. Ziel der Vorlesung: Einführung in die Methoden der Ökonometrie. Voraussetzungen: Deskriptive Statistik

Double Kernel Method Using Line Transect Sampling

Regulation of Nanomaterials the relevance of LCA and RA

Organische Chemie IV: Organische Photochemie

1. Positive and regular linear operators.

PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE

Foundations of Knowledge Management Categorization & Formal Concept Analysis

Organische Chemie IV: Organische Photochemie

Example questions. Z:\summer_10_teaching\bioinfo\Beispiel_frage_bioinformatik.doc [1 / 5]

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Medicinal Chemistry/ CHEM 458/658 Chapter 3- SAR and QSAR

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a

Exploring the black box: structural and functional interpretation of QSAR models.

Statistical Methods in Particle Physics

Foundations of Knowledge Management Categorization & Formal Concept Analysis

Ligand Close-Packing Model

ENCYCLOPEDIA OF PHYSICS

Evidence for the Existence of Non-monotonic Dose-response: Does it or Doesn t it?

Quantum jumps of light: birth and death of a photon in a cavity

Algebra. Übungsblatt 10 (Lösungen)

Reaxys Medicinal Chemistry Fact Sheet

COMPUTER AIDED DRUG DESIGN (CADD) AND DEVELOPMENT METHODS

EMPIRICAL VS. RATIONAL METHODS OF DISCOVERING NEW DRUGS

Statistical concepts in QSAR.

Fragment based drug discovery in teams of medicinal and computational chemists. Carsten Detering

Structure based drug design and LIE models for GPCRs

Optimization of Mixed Queries in MonetDB System

Molecular Modeling Study of Some Anthelmintic 2-phenyl Benzimidazole-1- Acetamides as β-tubulin Inhibitor

Organische Chemie IV: Organische Photochemie

Bernoulli s Law of Large Numbers

Philosophiekolloquium FB Philosophie KGW

The Case for Use Cases

Supporting Information

Large scale classification of chemical reactions from patent data

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology

Notes of Dr. Anil Mishra at 1

Die Nadel im Heuhaufen

Static Program Analysis

Static Program Analysis

Clouds Atlas Sofa DL2

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

Analysis of Salinity Alterations due to Estuarine Waterway Deepening by Artificial Neural Networks

The chemical handbook: Glorious past Parlous present No future? Francis Bartow Culp Purdue University

To learn how to use molecular modeling software, a commonly used tool in the chemical and pharmaceutical industry.

CSD. Unlock value from crystal structure information in the CSD

CHEM 4170 Problem Set #1

D-optimally Lack-of-Fit-Test-efficient Designs and Related Simple Designs

Ákos Tarcsay CHEMAXON SOLUTIONS

Designing Degradation Experiments Using a Log-Logistic Distribution

VL Bioinformatik für Nebenfächler SS2018 Woche 9

MM-PBSA Validation Study. Trent E. Balius Department of Applied Mathematics and Statistics AMS

Executive Summary : 10 Key Facts

Influencing Variables, Precision and Accuracy of Terrestrial Laser Scanners

Capturing Chemistry. What you see is what you get In the world of mechanism and chemical transformations

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE

TRAINING REAXYS MEDICINAL CHEMISTRY

Data Analysis. the Cosmic Microwave Background.

0 Introduction. Radiation Physics. Radiation Physics. 1 Radioactive Decay. 1.1 Radioactive Decay Law

Public-Key Technique 1. Public-Key Cryptography. Agenda. Public-Key Technique 2. L94 - Public-Key Cryptography. L94 - Public-Key Cryptography

OECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model

Implementation of novel tools to facilitate fragment-based drug discovery by NMR:

OECD QSAR Toolbox v.3.4

Grundlagen Fernerkundung - 12

Using Web Technologies for Integrative Drug Discovery

The Continuity between the Cavities of the Premandibular Somites and of Rathke's Pocket in Torpedo By SIR GAVIN DE BEER, F.R.S.

Humberto Maturana. Francesco Varela. Erwin Schrödinger

Integrated Cheminformatics to Guide Drug Discovery

Simplifying Drug Discovery with JMP

Full-Potential Electronic Structure Method

Emerging patterns mining and automated detection of contrasting chemical features

Ancestral state reconstruction with parsimony. Master Thesis

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Ping-Chiang Lyu. Institute of Bioinformatics and Structural Biology, Department of Life Science, National Tsing Hua University.

Integrated System Simulation of Machine Tools

Learning Organic Chemistry

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

Similarity methods for ligandbased virtual screening

Transcription:

Pharmacoinformatics Research Group Department of Pharmaceutical Chemistry Topliss batchwise scheme reviewed in the era of Open Data Lars Richter, Gerhard F. Ecker Dept. of Pharmaceutical Chemistry gerhard.f.ecker@univie.ac.at pharminfo.univie.ac.at

Topliss batchwise scheme Topliss ranking schemes subst. π σ -σ π+σ Es 3,4-Cl 2 1 1 5 1 2-5 4-Cl 2 2 4 2 2-5 4-CH 3 3 4 2 3 2-5 Topliss substituent proposals scheme new substituent selection 1 π 3-CF3, 4-Cl; 3-CF3, 4-NO2; 4- CF3; 2,4-C12; 4-c-C5H9; 4-c- C6H11; 4-CH(CH3)2; 4-C(CH3)3; 3,4-(CH3)2; 4-O(CH3),CH3; 4- OCH2Ph; 4-N(C2H5) σ 3-CF3, 4-Cl; 3-CF3, 4-NO2; 4- CF3; 2,4-C12; 4-c-C5H9; 4-c- C6H11 4-OCH 3 4-5 5 1 5 2-5 H 4-5 3 3 4 1 1 Topliss et al. J Med Chem 1977 -σ π+σ 4-N(C2H5)2; 4-N(CH3)2; 4-NH2; 4-NHC4H9; 4-OH; 4- OCH(CH3)2; 3-CH3,4-OCH3 3-CF3, 4-Cl; 3-CF3, 4-NO2; 4- CF3; 2,4-C12; 4-c-C5H9; 4-c- C6H11

Topliss batchwise scheme Series of five phenyl-substituted propafenone derivatives measured against P-Glycoprotein substituion EC50 rank 3,4-Cl 2 0.150 5 4-Cl 0.132 4 4-CH 3 0.063 2 4-OCH 3 0.045 1 H 0.079 3 Which compound should be synthesized next?

Topliss batchwise scheme Topliss ranking schemes subst. π σ -σ π+σ Es 3,4-Cl 2 1 1 5 1 2-5 4-Cl 2 2 4 2 2-5 propafenone dataset scheme -σ Topliss substituent proposals substituion EC50 rank 3,4-Cl 2 0.150 5 new 4-Cl substituent 0.132 selection 4 4-CH 3 0.063 2 3-CF3, 4-Cl; 3-CF3, 4-NO2; 4- π 4-OCH CF3; 2,4-C12; 3 0.045 1 4-c-C5H9; 4-c- C6H11; H 4-CH(CH3)2; 0.079 4-C(CH3)3; 3 3,4-(CH3)2; 4-O(CH3),CH3; 4- OCH2Ph; 4-N(C2H5) 4-CH 3 3 4 2 3 2-5 4-OCH 3 4-5 5 1 5 2-5 H 4-5 3 3 4 1 σ 3-CF3, 4-Cl; 3-CF3, 4-NO2; 4- CF3; 2,4-C12; 4-c-C5H9; 4-c- C6H11 -σ π+σ 4-N(C2H5)2; 4-N(CH3)2; 4-NH2; 4-NHC4H9; 4-OH; 4- OCH(CH3)2; 3-CH3,4-OCH3 3-CF3, 4-Cl; 3-CF3, 4-NO2; 4- CF3; 2,4-C12; 4-c-C5H9; 4-c- C6H11

Topliss batchwise scheme propafenone dataset -σ substituion EC50 rank 3,4-Cl 2 0.150 5 4-Cl 0.132 4 4-CH 3 0.063 2 4-OCH 3 0.045 1 H 0.079 3 4-N(CH 3 ) 2 derivative was synthesized and tested no affinity increase 4-N(CH 3 ) 2 How often do Topliss schemes (π, σ, -σ, π+σ, Es) occur in large databases? How useful do Topliss schemes prove in activity optimization?

www.openphacts.org

www.openphacts.org

How often do Topliss patterns occur? 1. Return 3,4-dichloro substituted compounds in postgresql ChEMBL 20 using RDKit cartridge 9312 cpds 2a. For each 3,4-Cl 2 substituent check for availablity of 4-Cl, 4-OCH 3, 4-CH 3 and H substitutions 3. Check for each compound series for bioactivity data (pchembl) measured in - same target in same assay - activity type = IC 50 or K i - plus, if available, activity for new subst. selection SQL query 540 x 200 series 3nM 5nM 8nM 9nM 10nM 1 2 3 4 5 new substitution selection 1108 bioactivity data for additional substituents

Raw data output after mining ChEMBL 200 series new substitution selection 1108 bioactivity data for additional substituents 3nM 5nM 8nM 9nM 10nM 1 2 3 4 5

How often do Topliss patterns occur? subst. π σ -σ π+σ Es 200 series 3,4-Cl 2 1 1 5 1 2-5 4-Cl 2 2 4 2 2-5 3nM 5nM 8nM 9nM 10nM 1 2 3 4 5 4-CH 3 3 4 2 3 2-5 distribution of 200 series 4-OCH 3 4-5 5 1 5 2-5 H 4-5 3 3 4 1 # of series 13 7 3 2 34 57 of 200 series (29%) extracted from ChEMBL 20 follow a Topliss pattern π σ -σ π+σ Es others

How useful do Topliss prove in activity optimization? Topliss pattern # of series substituent selection [1] more active [2] percent age π 13 29 9 31 % σ 7 9 1 11 % -σ 3 5 1 20 % π+σ 2 2 1 50 % [1] For each series, bioactivity for substituents, proposed by Topliss new substituent selection were collected from ChEMBL 20, if available. [2] Check whether proposed substituents lead to more active cpds scheme new substituent selection π 3-CF3, 4-Cl; 3-CF3, 4-NO2; 4- CF3; 2,4-C12; 4-c-C5H9; 4-c- C6H11; 4-CH(CH3)2; 4- C(CH3)3; 3,4-(CH3)2; 4- O(CH3),CH3; 4-OCH2Ph; 4- N(C2H5) σ 3-CF3, 4-Cl; 3-CF3, 4-NO2; 4- CF3; 2,4-C12; 4-c-C5H9; 4-c- C6H11 -σ 4-N(C2H5)2; 4-N(CH3)2; 4-NH2; 4-NHC4H9; 4-OH; 4- OCH(CH3)2; 3-CH3,4-OCH3 π+σ 3-CF3, 4-Cl; 3-CF3, 4-NO2; 4- CF3; 2,4-C12; 4-c-C5H9; 4-c- C6H11 Topliss approach seems to have difficulties for series following the σ scheme in activity optimization for the series found in ChEMBL. poor performance of -σ is in agreement with propafenone data

How useful do Topliss prove in activity optimization? propafenone dataset substituion EC50 rank Topliss proposal for propafenone dataset, 4-N(CH3)2, did not show activity gain. -σ 3,4-Cl 2 0.150 5 4-Cl 0.132 4 4-CH 3 0.063 2 4-OCH 3 0.045 1 H 0.079 3 Are there -σ series in ChEMBL with bioactivity data for 4-N(CH 3 ) 2 substitution? target Type 4-OCH 3 (nm) 4-N(CH 3 ) 2 (nm) P-Glycoprotein EC 50 45 82 Alpha-1a adrenergic receptor (ChEMBL) K i 0.3 0.8 µ-opioid receptor (ChEMBL) K i 0.50 63 Also in the two cases of ChEMBL the -σ proposal 4-N(CH 3 ) 2 failed to increase activity.

Topliss batchwise scheme propafenone aryloxy non Topliss substituion EC50 rank 3,4-Cl 2 0.522 5 4-Cl 0.190 4 4-CH 3 0.063 1 4-OCH 3 0.180 3 Ranking pattern 5 4 1 3 2 in this dataset can t be assigned to an existing Topliss scheme H 0.079 2 How often does the pattern 5 4 1 3 2 occur in ChEMBL? In general, which other, non Topliss pattern occur frequently in ChEMBL?

Which non Topliss pattern occur in ChEMBL? subst. new1 new2 new3 aryloxy 3,4-Cl 2 1 5 5 5 4-Cl 2 2 3 4 4-CH 3 4 4 1 1 4-OCH 3 3 1 4 3 H 5 3 2 2 # series 6 4 4 0 Do we find an underlying physicochemical driving force in the new3 pattern? Can we extrapolate to aryloxy dataset? The pattern found in aryloxy dataset, does not occur in ChEMBL However: High similarity to new3 distribution of 200 series π σ -σ π+σ Es new1 new2 new3

Correlation analysis within new3 series target name pattern # of cpds in series [1] r (π) r (σ ) r (vdw_area) Prostanoid EP 1 rec 5 3 1 4 2 5 + 8-0.81** Adenosine A3 rec 5 3 1 4 2 5 + 8-0.54* Adenosine A3 rec 5 3 1 4 2 5 + 8-0.67** Chymase 5 3 1 4 2 5 + 13-0.49** P-Glycoprotein 5 4 1 3 2 5 [1] Next to the 5 datapoints from 3,4-Cl 2, 4-Cl, 4-OCH 3, 4-CH 3 and H, bioactivity data from other substituents listed in Topliss et al 1977 were selected for correlation analysis. Correlation analyses were undertaken to calculate the Pearson correlation coefficient (r) between physicochemical features π, σ, vdw_area and the respective bioactivity data. ** p < 0.05, * p < 0.10 Statistically significant negative vdw_area correlations indicate that new3 pattern & aryloxy bind to a tight pocket

Discover the ranking globe How to look at the ranking space globally? There are 120 (5!) ranking possibilites (patterns) (1,2,3,4,5), (2,1,3,4,5), (1,3,2,4,5), (5,4,3,2,1) Calculation of Spearman s rank correlation distance matrix for 120 possibilities (R function cordist) Spherical MDS to represent the distance matrix on the surface of a sphere (R function smacofsphere), Kruksal-Stress = 0.15 Each point represents a pattern (e.g. 1,2,3,4,5) similar patterns are in vincinity to each other Frequency contour map Color coding based on frequency of patterns. Red = high frequency Blue = low frequency

Map analysis - σ Only three σ pattern in ChEMBL In the investigated cases, poor predictability of σ scheme trench steric island π and σ continent Frequency contour map Color coding based on frequency of patterns. Red = high frequency Blue = low frequency *Es aryloxy surrounded by Es pattern lies in area with negative vdw_area correlation steric island π and σ continent Van der Waals contour map Color coding based on vdw_area correlations with bioactivity. Only series with activity data for five additional derivatives (e.g. 4-CF 3, 4-OH...) are used in correlation analysis (n>=10). Resulting correlations with p > 0.1 were omitted. The remaining coefficients were used for color coding. Only Topliss patterns (π, σ, π+σ, Es ) and rankings patterns with four or more series (new1, new2, new3) are schown. Red... positive correlation Blue... negative correlation

Summary & Outlook Open medicinal chemistry data such as those in ChEMBL allow analysis of complex SAR patterns Connecting these data with data from pathways and diseases like implemented in the Open PHACTS Discovery Platform will open up completely new possibilities for linking chemical SAR patterns to biological endpoints Quality of data is key for the analysis (assays) Next steps Look for X-ray structures of complexes Analyse with respect to target classes

Pharmacoinformatics Research Group Department of Pharmaceutical Chemistry

Pharmacoinformatics Research Group Department of Pharmaceutical Chemistry SQL query: get all 3,4-Cl2 compounds RDKit Chemoinformatics toolkit 2014.03 SMILES Data processing in python RDKit cartridge ChEMBL 20 postgresql > 13 000 000 activities 200 series

Pharmacoinformatics Research Group Department of Pharmaceutical Chemistry -> 120 ranking possibilies are created -> Spearman ranking distance matrix calculated -> Spherical MDS is undertaken -> X,Y,Z coordinates are exported as CSV file Spherical MDS in R software Coordinates.csv Python data preprossesing 2D - EquidistantCylindrical Projections 3D - Orthographic Basemap toolkit provides list of globe projections create contour maps

Pharmacoinformatics Research Group Department of Pharmaceutical Chemistry For each series bioactivity data for 3,4-Cl 2, 4-Cl, 4-CH 3, 4-OCH 3 and 4-H is available For the majority of the series (91%) there are bioactivity data for more substituents e.g. 4-CF3, 4-OH, 4-F,... available. (Substituents taken from new substituent selection ) More than 57% of the series have activity data for five or more additional substituents. For series with 5 or more additional substituents (n>=10) correlation analysis were run: Series_8 3,4-Cl2 4-Cl 4-CH3 4-OCH3 4-H pic 50 6.3 7.0 7.4 7.6 8 vdw_area 134 117 116 131 99 4-CF3 4-F 4-OH 3,4-(CH3)2 4-C(CH3)3 6.9 7.7 6.6 7 6.1 129 103 109 134 152 In this example: R = -0.70, p = 0.03 Series 8 with pattern 5 4 3 2 1, has R(vdw) = -0.7

Pharmacoinformatics Research Group Department of Pharmaceutical Chemistry Details to Multidimensional Scaling with First 2D MDS bad Kruksal-Stress-1 > 0.2 Second 3D MDS good Kruksal-Stress-1 = 0.11 but visualization not helpful Third Spherical MDS moderate Kruksal-Stress-1 = 0.15, good visualization get120possibilities()... creates a vector with 120 rankings [(1,2,3,4,5), (2,1,3,4,5)...] cordist ()... calculates Spearman s rank correlation distance smacofsphere()... runs spherical MDS, type= ordinal because we have rankings, algorithm= primal... handling of ties xyz.120... x,y,z coordinates of the MDS run Coordinates (xyz.120) are exported to CSV file and are the input for Basemap

Pharmacoinformatics Research Group Department of Pharmaceutical Chemistry Potentielle Fragen: -> Wie lange dauert so eine Suche wenn der Workflow steht ~ 1 Tag (4 Prozessoren Rechner, 8GB RAM) -> Wie werden Salze behandelt? Skript ist so geschrieben dass diese nicht berücksichtigt werden. Soll heißen es wäre potentiell möglich dass die dichloro verbindung ein Natriumsalz ist und das Methylderivat ein Kaliumsalz. Wie auch immer in den 200 Serien war dies nie zu finden und spielt somit keine Rolle. -> Wie steht es um Chiralität. Ich habe die Chiralität nicht berücksichtigt in der Query. Dies wäre möglich gewesen aber da die Codierung von Chiralitäten in ChEMBL nicht umfassend ist habe ich es nicht berücksichtigt. -> wie groß muss den Unterschied sein zwischen den Bioaktivitäten damit es als Serie anerkannt wurde? Im Topliss paper findet man rankings mit log >0.1 zwischen den Verbindungen. Wir haben darauf keine Rücksicht genommen und alle Daten verwendet (so wie es übrigens auch die Gruppe die 2014 eine ähnliche Analyse auch gemacht haben) Die Datenanalyse zeigt von den 200 serien: Haben 43 eine Differenz von mindestens >0.1 log zwischen den rankings. 77 series haben 1 verstoß dieser regel, d.h. die differnz zwischen 2 rankings ist ein mal kleiner 0.1. 80 haben dann 2 oder mehr verstöße. Warum habt ihr die anderen pattern 2pi-pi^2, pi-sigma usw. nicht berücksichtigt? Die Komplexität wäre deutlich höher gewesen ohne dass es einen nennenswerten Informationsgewinn gegeben hätte. Zur Abgrenzung, die neuen pattern new 1, new 2, new 3) fallen in keines der von Topliss postulierten pattern auch nicht in die erweiterte Auswahl (2pi- pi^2, pi-3sigma, usw.)