Encoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models. Igor V.

Similar documents
ALOGPS is a free on-line program to predict lipophilicity and aqueous solubility of chemical compounds

What is a property-based similarity?

Can we estimate the accuracy of ADMET predictions?

Surrogate data a secure way to share corporate data

In Silico Prediction of ADMET properties with confidence: potential to speed-up drug discovery

Chemical Space: Modeling Exploration & Understanding

Assessing and Improving the Reliability of In Silico Property Predictions by Incorporating Inhouse. Pranas Japertas

Machine learning for ligand-based virtual screening and chemogenomics!

Introduction to Chemoinformatics and Drug Discovery

Introduction to Chemoinformatics

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Integer weight training by differential evolution algorithms

Introduction. OntoChem

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

Structure-Activity Modeling - QSAR. Uwe Koch

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

QSPR MODELLING FOR PREDICTING TOXICITY OF NANOMATERIALS

Clustering Ambiguity: An Overview

Data Mining Part 5. Prediction

Computing chemistry on the web. Igor V. Tetko

Data Compression Techniques

Deep unsupervised learning

Practical QSAR and Library Design: Advanced tools for research teams

QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

Classifier Selection. Nicholas Ver Hoeve Craig Martek Ben Gardner

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Artificial Intelligence Methods for Molecular Property Prediction. Evan N. Feinberg Pande Lab

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre

Scoring functions for of protein-ligand docking: New routes towards old goals

C.-A. Azencott, M. A. Kayala, and P. Baldi. Institute for Genomics and Bioinformatics Donald Bren School of Information and Computer Sciences

Artificial Intelligence (AI) Common AI Methods. Training. Signals to Perceptrons. Artificial Neural Networks (ANN) Artificial Intelligence

(Big) Data analysis using On-line Chemical database and Modelling platform. Dr. Igor V. Tetko

One and two proton separation energies from nuclear mass systematics using neural networks

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Machine Learning: Multi Layer Perceptrons

Translating Methods from Pharma to Flavours & Fragrances

The prediction of house price

Entropy as a measure of surprise

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

A Community Gridded Atmospheric Forecast System for Calibrated Solar Irradiance

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

CHAPTER-17. Decision Tree Induction

Estimating Predictive Uncertainty for Ensemble Regression Models by Gamma Error Analysis

De Novo molecular design with Deep Reinforcement Learning

Tutorial: Device-independent random number generation. Roger Colbeck University of York

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Artificial Neural Networks Examination, June 2004

Building innovative drug discovery alliances. Just in KNIME: Successful Process Driven Drug Discovery

Nonlinear Classification

Postprocessing of Numerical Weather Forecasts Using Online Seq. Using Online Sequential Extreme Learning Machines

Supporting Information

CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION

Building blocks for automated elucidation of metabolites: Machine learning methods for NMR prediction

Applications of multi-class machine

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Data Mining und Maschinelles Lernen

An Integrated Approach to in-silico

Machine Learning Approaches to Crop Yield Prediction and Climate Change Impact Assessment

Supplementary Figure 1: Chemical compound space. Errors depending on the size of the training set for models with T = 1, 2, 3 interaction passes

In silico pharmacology for drug discovery

Kernel-based Machine Learning for Virtual Screening

Quantitative Structure-Activity Relationship (QSAR) computational-drug-design.html

Doing Right By Massive Data: How To Bring Probability Modeling To The Analysis Of Huge Datasets Without Taking Over The Datacenter

The use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration

Development of a Structure Generator to Explore Target Areas on Chemical Space

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

A neural network representation of the potential energy surface in Si- and Si-Li systems

CHAPTER 6 NEURAL NETWORK BASED QUANTITATIVE STRUCTURE-PROPERTY RELATIONS(QSPRS) FOR PREDICTING REFRACTIVE INDEX OF HYDROCARBONS

Matrix Factorization Techniques for Recommender Systems

Fragment-based de novo Design

Artificial Neural Networks Examination, June 2005

Holdout and Cross-Validation Methods Overfitting Avoidance

Bridging the Dimensions:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Rustam Z. Khaliullin University of Zürich

Development of K-step Yard sampling method and Apply to the ADME-T In Silico Screening. In Silico Data, Ltd.

DEVELOPMENT, EXTENSION AND VALIDATION OF (THEORY-BASED) SARS DECEMBER 2018 I LUC VEREECKEN

Structural interpretation of QSAR models a universal approach

Exploring the black box: structural and functional interpretation of QSAR models.

Using a Hopfield Network: A Nuts and Bolts Approach

CS7267 MACHINE LEARNING

bcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012

Machine Learning Concepts in Chemoinformatics

Decision Tree And Random Forest

Hopfield Neural Network and Associative Memory. Typical Myelinated Vertebrate Motoneuron (Wikipedia) Topic 3 Polymers and Neurons Lecture 5

Machine Learning with Quantum-Inspired Tensor Networks

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Canonical Line Notations

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

Data Mining in the Chemical Industry. Overview of presentation

Web Structure Mining Nodes, Links and Influence

Current best practice of uncertainty forecast for wind energy

Kinome-wide Activity Models from Diverse High-Quality Datasets

Artificial Neural Network Method of Rock Mass Blastability Classification

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Artificial Neural Networks Examination, March 2004

Transcription:

Encoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models Igor V. Tetko IBPC, Ukrainian Academy of Sciences, Kyiv, Ukraine and Institute for Bioinformatics, Munich, Germany March 14th, ACS

Encoding molecular structures as SHULED ranks of models: A new, secure way for sharing chemical data and development of ADME/T models Igor V. Tetko IBPC, Ukrainian Academy of Sciences, Kyiv, Ukraine and Institute for Bioinformatics, Munich, Germany

Structure-Property correlations Require representation (description) of the molecule in a format that can be used for machine learning methods, i.e. MLRA, neural network, PLS Two major systems: topological and 3D based ragment-based indices topological indices E-state indices Quantum-chemical parameters VolfSurf descriptors Molecular shape parameters

Three scenario for structure decoding Can we identify the molecule provided we have it in our portfolio? -- the most difficult scenario Can we do the same in knowledge that the molecule can be originated from one of several chemical series? Can we identify the molecule provided we do not know anything about it? -- the practical scenario

Can we identify the molecule provided we have it in our portfolio? Topological indices. The ability to unambiguously identify a molecule is limited to information content of indices If the indices contain sufficient information, the identification is possible Information content of a molecule: CCCCC -- 11111 (5 bits) C1CCCC1 -- 12111123 (11 bits) C -- 1 bit 1 -- 2 bits -- 3 bits

Information content of molecules in set of 12908 molecules (PHYSPRP database) Element requency Bits bits/atom 12 C c 78777 76965 1 2 10 ) 42336 3 8 ( 42336 4 6 29349 5 4 1 = 23648 20610 6 7 2 16156 8 0 0 20 40 60 2 12658 9 non-hydrogen atoms not optimal -- Huffman, arithmetic coding, other algorithms: gz, zip -- 3.5 bits/atom, bzip2 -- 2.9 bits/atom

Information content of a molecule 30 -- 40 atoms -- 90 -- 110 bits 1 double value -- 32 bits, 3 -- 4 topological indices potentially contains sufficient information to unambiguously decode molecule with 40 atoms! In reality a larger number of indices can be required because of rounding effects, non-optimal storage of information Thus, the encoding of molecules using topological indices can be insecure.

When reverse engineering is impossible? A practical scenario. ALGPS program: 75 indices per molecule for logp 33 indices per molecule for logs We use decreased resolution of data, i.e to just 3 significant digits per index (7-10 bits instead of 32 bits) Additional bits are coming from range ~ 11 bits per index => 10-12 indices per molecule with 40 atoms The information encoded in the indices could be (theoretically) adequate to decode the molecules with < 50 heavy atoms. number of indices 30 20 Decoding limit But, this can be too pessimistic conclusion. The theoretical possibility to decode does not propose a way how this can be done! 10 0 Average number of indices per molecule 0 10 20 30 40 50 60 70 number of molecules

ALGPS 2.1 LogP: 75 input variables corresponding to electronic and topological properties of atoms (E-state indices), 12908 molecules in the database, 64 neural networks in the ensemble. Calculated results RMSE=0.35, MAE=0.26, n=76 outliers (>1.5 log units) LogS: 33 input E-state indices, 1291 molecules in the database, 64 neural networks in the ensemble. Calculated results RMSE=0.49, MAE=0.35, n=18 outliers (>1.5 log units) Tetko, Tanchuk & Villa, JCICS, 2001, 41, 1407-1421. Tetko, Tanchuk, Kasheva & Villa, JCICS, 2001, 41, 1488-1493. Tetko & Tanchuk, JCICS, 2002, 42, 1136-1145.

http://www.vcclab.org

Artificial eed-orward Back-propagation eural etwork (B)

Early Stopping ver Ensemble (ESE) Learning/Validation Sets Artificial eural etwork Training Set no 1 network no 1 Artificial eural etwork Ensemble Analysis Set no 2 network no 2 Learning Partition InitialTraining Set Set no 199 network no 199 Validation Set no 200 network no 200

AS: an example correction logp=3.11 H logp=3.48 H " 12.3% $ $ 4.6 $ M $ $ 13.2 # $ 10.1& " 13.7% $ $ 4.8 $ M $ $ 15.8 # $ 12.0& " net 1 % $ $ net 2 $ M $ $ net 63 # $ net 64& " net 1 % $ $ net 2 $ M $ $ net 63 # $ net 64& 1-k correction Morphinan-3-ol, 17-methyl- Calculated logp=3.65, δ=+0.54 --> 3.65-0.76=2.89 (δ=+0.22) Levallorphan Calculated logp=4.24, δ=+0.76 --> 4.24-0.54=3.70 (δ=+0.22) -- both molecules are the nearest neighbors, r 2 =0.47, in space of residuals!

Associative eural etwork (AS) A prediction of case i: [ x i ] [ AE] M = [ z i ] = " $ $ $ $ $ # $ z 1 i M z k i M z M i % & Ensemble approach: z = 1! i i z k M k = 1, M Pearson s (Spearman) correlation coefficient r ij =R(z i,z j )>0 in space of residuals z$ i = z i + 1 k j"!( y # ) j z j k ( x i ) <<= AS bias correction The correction of neural network ensemble value is performed using errors (biases) calculated for the neighbor cases of analyzed case x i detected in space of neural network models

Prediction Space of the model does not cover the in house compounds Training set data In house data Each new molecule is encoded as rank of models

Encoding of a molecule as rank of models ΔlogP=logPexp-logPcalc 64 values, ranks of 0.89 0.88 0.86 0.90 0.91 0.85.885 0.83 0.95 0.94 5 7 8 4 3 9 6 10 1 2 0.09 0.08 0.06 0.10 0.11 0.05.085 0.03 0.15 0.14 5 7 8 4 3 9 6 10 1 2 0.59 0.38 0.26 0.60 0.71 0.15.485 0.03 0.95 0.84 5 7 8 4 3 9 6 10 1 2 Millions of solutions provide the same ranks of responses --> no way to decode -- previous name of the paper, but

How selective is rank coding? 8x64 = 512 bits (comparable to MDL keys) 126 out of 121281 Asiprox (0.1%) 12 out of 12908 PHYSPRP (0.1%)

ALGPS: Extrapolation vs Interpolation ALGPS logp (blind) :MAE = 1.27, RMSE=1.63 ALGPS logp (LIBRARY):MAE = 0.49, RMSE=0.70 Tetko, JCICS, 2002, 42, 717-742. Tetko & Bruneau, J. Pharm. Sci., 2004, 94, 3103-3110.

Analysis of Pfizer data Pallas PrologD : ACDlogD (v. 7.19): ALGPS: ALGPS LIBRARY: MAE = 1.06, RMSE=1.41 MAE = 0.97, RMSE=1.32 MAE = 0.92, RMSE=1.17 MAE = 0.43, RMSE=0.64 Tetko & Poda, J. Med. Chem., 2004, 94, 5601-5604.

PHYSPRP data set Total: 12908 nova set star set CLGP 9429 3479 training nova --> prediction star set XLGP 1873 PHYSPRP 12 908

Prediction performance as function of similarity in space of models of star set Blind prediction LIBRARY mode max correlation coefficient of a test compound to training set compounds MAE=0.43 max correlation coefficient of a test compound to LIBRARY compounds MAE=0.28 (0.26)

Reliability of new compound predictions r 2 0-0.2 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1 error >0.7 ~0.6 ~0.5 ~0.4 <0.3 CI, 250,000 http://asinex.com 120,000 http://ambinter.com 650,000 PHYSPRP

Reliability of new compound predictions r 2 0-0.2 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1 error >0.7 ~0.6 ~0.5 ~0.4 <0.3 CI, 250,000 http://asinex.com 120,000 http://ambinter.com 650,000 Aurora data PHYSPRP

Is identification possible? PHYSPRP -- Asinex study H 2 H 2 S H r 2 =0.97 H S r 2 =0.85 H H H r 2 =0.82 H

Is identification possible? PHYSPRP -- Asinex study Cl H H 0.77 H 0.64 H H 0.70 H 0.67 0.67

Is identification possible? PHYSPRP -- Asinex study H 0.61 H 0.60 H H H 0.60

Securing the data -- shuffling ranks! Shuffle r 2 =0.8 Shuffle r 2 =0.6

Rank shuffling Shuffled rank molecule is less similar to itself than the molecules from the other series wiil be pick-upped --> secure encoding Different molecules will have different distribution of neighbors as function of similarity=> lower level of security (e.g. 1 in 10 5, 1 in 10 6 ) can be determined individually for each single compound using an external library (e.g. complete enumeration, compilation of public libraries) Everything can be done in completely automatic mode

Possible approaches Raw topological indices Development of new global models, after the development the data can be discarded There is a theoretical possibility to decode the structure, particular for smaller number of atoms in a molecule (not clear if such algorithm can be realized) ne-to-one contract may be required Rank of models Allows to incorporate explicit structural parameters as feature elements o limitation on the number of indices The quality of local correction is comparable to retraining Very appealing to share on the WWW Security can be controlled by shuffling but will deteriorate prediction quality of model Development of new models Develop new models in-house Provide them to be included in the set of models Predict new data using an ensemble of diverse models (AS in space of models of different companies) A complete set of automated tools to develop them can be provided

Acknowledgement Part of this presentation was done thanks to Virtual Computational Chemistry Laboratory ITAS-I 00-0363 project (http://www.vcclab.org). I thank Prof Hugo Kubinyi, Drs Pierre Bruneau and Gennadiy Poda for collaboration and Prof. Tudor prea for inviting me to participate in this conference. Thank you for your attention!