Structural interpretation of QSAR models a universal approach

Similar documents
Exploring the black box: structural and functional interpretation of QSAR models.

Interpretation of QSAR models

Structure-Activity Modeling - QSAR. Uwe Koch

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

Machine learning for ligand-based virtual screening and chemogenomics!

E. Muratov 1, E. Varlamova 2, A. Artemenko 2, D. Fourches 1, V. Kuz'min 2, A. Tropsha 1

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

Applications of multi-class machine

In Silico Prediction of ADMET properties with confidence: potential to speed-up drug discovery

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov

QSAR Study on N- Substituted Sulphonamide Derivatives as Anti-Bacterial Agents

QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

Evaluation. Andrea Passerini Machine Learning. Evaluation

Chemical Space: Modeling Exploration & Understanding

Predicting Binding Affinity of CSAR Ligands Using Both Structure- Based and Ligand-Based Approaches

Evaluation requires to define performance measures to be optimized

Machine Learning Concepts in Chemoinformatics

Molecular Descriptors Family on Structure Activity Relationships 5. Antimalarial Activity of 2,4-Diamino-6-Quinazoline Sulfonamide Derivates

CS6220: DATA MINING TECHNIQUES

Classification techniques focus on Discriminant Analysis

Hierarchical QSAR technology based on the Simplex representation of molecular structure

Linear and Logistic Regression. Dr. Xiaowei Huang

(Big) Data analysis using On-line Chemical database and Modelling platform. Dr. Igor V. Tetko

Chemical library design

Holdout and Cross-Validation Methods Overfitting Avoidance

QSPR MODELLING FOR PREDICTING TOXICITY OF NANOMATERIALS

Ligand-receptor interactions

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

CS6220: DATA MINING TECHNIQUES

Logistic Regression. COMP 527 Danushka Bollegala

Materials Informatics: Statistical Modeling in Material Science

OCHEM. Product features and highlights

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

UniStra activities within the BigChem project:

Virtual screening in drug discovery

Screening and prioritisation of substances of concern: A regulators perspective within the JANUS project

Feature combination networks for the interpretation of statistical machine learning models: application to Ames mutagenicity

ESS2222. Lecture 4 Linear model

(e.g.training and prediction set, algorithm, ecc...). 2.9.Availability of another QMRF for exactly the same model: No other information available

Modeling Mutagenicity Status of a Diverse Set of Chemical Compounds by Envelope Methods

molecules ISSN

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Molecular Modeling Studies of RNA Polymerase II Inhibitors as Potential Anticancer Agents

Nonlinear Classification

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Statistical learning theory, Support vector machines, and Bioinformatics

CSCI-567: Machine Learning (Spring 2019)

QSAR in Green Chemistry

Translating Methods from Pharma to Flavours & Fragrances

Scoring functions for of protein-ligand docking: New routes towards old goals

Learning Decision Trees

Support Vector Inductive Logic Programming

Listwise Approach to Learning to Rank Theory and Algorithm

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

Identification of Active Ligands. Identification of Suitable Descriptors (molecular fingerprint)

A Deep Interpretation of Classifier Chains

Emerging patterns mining and automated detection of contrasting chemical features

Optimizing Model Development and Validation Procedures of Partial Least Squares for Spectral Based Prediction of Soil Properties

Prediction of Acute Toxicity of Emerging Contaminants on the Water Flea Daphnia magna by Ant Colony Optimization - Support Vector Machine QSTR models

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

What is a property-based similarity?

Machine Learning Practice Page 2 of 2 10/28/13

Plan. Day 2: Exercise on MHC molecules.

Introduction to Machine Learning and Cross-Validation

Machine Learning: Evaluation

Machine learning methods to infer drug-target interaction network

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Pose and affinity prediction by ICM in D3R GC3. Max Totrov Molsoft

Click Prediction and Preference Ranking of RSS Feeds

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

New opportunities for high-resolution countrywide tree type mapping

CheS-Mapper 2.0 for visual validation of (Q)SAR models

Medicinal Chemistry/ CHEM 458/658 Chapter 3- SAR and QSAR

Stephen Scott.

Discovery Through Situational Awareness

De Novo molecular design with Deep Reinforcement Learning

A Bias Correction for the Minimum Error Rate in Cross-validation

CS145: INTRODUCTION TO DATA MINING

Electrical and Computer Engineering Department University of Waterloo Canada

ABC-LogitBoost for Multi-Class Classification

Qsar study of anthranilic acid sulfonamides as inhibitors of methionine aminopeptidase-2 using different chemometrics tools

BAGGING PREDICTORS AND RANDOM FOREST

A Magiv CV Theory for Large-Margin Classifiers

3D QSAR analysis of quinolone based s- triazines as antimicrobial agent

OECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model

Discriminative Learning and Big Data

Kinome-wide Activity Models from Diverse High-Quality Datasets

Effect of 3D parameters on Antifungal Activities of Some Heterocyclic Compounds

ECS289: Scalable Machine Learning

bcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012

Statistical Machine Learning from Data

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Estimation of Melting Points of Brominated and Chlorinated Organic Pollutants using QSAR Techniques. By: Marquita Watkins

CS 6375 Machine Learning

Learning with multiple models. Boosting.

Tutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg)

An Introduction to Statistical Machine Learning - Theoretical Aspects -

International Journal of Chemistry and Pharmaceutical Sciences

Transcription:

Methods and Applications of Computational Chemistry - 5 Kharkiv, Ukraine, 1 5 July 2013 Structural interpretation of QSAR models a universal approach Victor Kuz min, Pavel Polishchuk, Anatoly Artemenko, Eugene Muratov A.V. Bogatsky Physico-Chemical Institute of National Academy of Sciences of Ukraine Odessa, Ukraine pavel_polishchuk@ukr.net 1

QSAR interpretation: interpretability vs. complexity Model interpretability Popular misbelief MLR PLS DT knn RF SVM ANN Model complexity 2

QSAR interpretation approaches Model-specific approaches: Rule-based (Decision tree) Regression coefficients (MLR, PLS) Latent variables (PLS) Weights and biases (ANN) Model-independent approaches: Variable importance Local gradients or partial derivatives 3

Model-independent interpretation approaches Variable importance Imp i MSE(x i ) MSE(x permut i ) Local gradients or partial derivatives C i f(x i ) f(x x i i Δx i ) 4

QSAR interpretation: common workflow Model Variables contributions Structureproperty relationship f(x) Var_1 Var_2 Mol_1-0.23 1.82 Mol_2 2.36 1.27 Mol_3 5.01 2.30 Mol_4 0.69-0.58 5

Matched molecular pairs approach logs = -3.18 logs = -0.60 ΔlogS = 2.58 H OH ΔlogS = 1.59 logs = -2.21 logs = -0.62 6

Exemplified dataset 7

Universal structural QSAR interpretation - = logs pred = -1.55 logs pred = -1.61 ΔlogS pred = 0.06 - = logs pred = -1.55 logs pred = -1.35 ΔlogS pred = -0.20 8

Universal structural QSAR interpretation - = logs pred = -1.93 logs pred = -4.32 ΔlogS pred = 2.39 9

Simplex representation of molecular structure (SiRMS) Simplex generation example Atom-property labeling Kuz min, V. E. et al, Journal of Molecular Modelling 2005, 11, 457-467. Kuz min, V. et al, Journal of Computer-Aided Molecular Design 2008, 22, 403-421. 10

Case studies End points: Solubility (regression) Inhibition of Transglutaminase 2 TG2 (regression) Mutagenicity (binary classification) Descriptors: Simplex representation of molecular structure (SiRMS) Dragon Machine learning methods: Random Forest (RF) Support vector machine (SVM) Projects to latent structures (PLS) 11

Solubility: dataset and models Overall number of compounds 1033 Huuskonen, J. J. Chem. Inf. Comp. Sci. 2000, 40, 773-777 5-fold external cross validation results (10 runs) SiRMS Dragon Endpoint Model R 2 CV RMSE R 2 CV RMSE PLS 0.84 0.82 0.91 0.60 Solubility, RF 0.88 0.71 0.91 0.62 logs SVM 0.87 0.72 0.92 0.59 12

Solubility: interpretation SiRMS vs. Dragon 13

Solubility: fragment ranking SiRMS 14

Solubility: pair-wise contribution correlations 15

Transglutaminase 2 inhibition: dataset and models R1 = acyl groups( preferably acryl); R2 = NO 2, F, Br, CF 3, CH 3, OCH 3. R1 = acyl groups (preferably acryl and its derivatives); R3 = acyl groups (preferably Boc, Cbz and its derivatives), substituted phenyl and pyridyl. Prime, M. E. et al, J. Med. Chem. 2012, 55, 1021-1046. 5-fold external cross validation results (10 runs) SiRMS Dragon Endpoint Model R 2 CV RMSE R 2 CV RMSE PLS 0.70 0.67 0.65 0.72 TG2 inhibition, RF 0.74 0.62 0.64 0.74 pic 50 SVM 0.70 0.67 0.68 0.70 16

TG2 inhibition: ranking R1 substituents 17

TG2 inhibition: ranking R2 substituents 18

Ames mutagenicity: dataset and models + 2344 mutagens 2017 non-mutagens 4361 overall 5-fold external cross validation results (10 runs) Descriptors Algorithm Balanced Accuracy SiRMS RF 0.817 SVM 0.800 Dragon RF 0.816 SVM 0.793 19

Ames mutagenicity: fragments ranking 20

Universal structural QSAR interpretation: benefits Estimation of contribution of fragments with single (terminal groups) and multiple attachment points (scaffolds or linkers) Non-additivity of calculated contributions (depends on an investigated property) Estimation of mutual fragment influence on a property Calculated fragment contributions are independent from used descriptors and machine learning methods 21

Related projects SiRMS project on GitHub: https://github.com/drrdom/sirms A.V. Bogatsky Physico-Chemical Institute, Chemoinformatic group: http://qsar4u.com 22