Machine learning for ligand-based virtual screening and chemogenomics!

Similar documents
Kernels for small molecules

Machine learning methods to infer drug-target interaction network

Rchemcpp Similarity measures for chemical compounds. Michael Mahr and Günter Klambauer. Institute of Bioinformatics, Johannes Kepler University Linz

Statistical learning theory, Support vector machines, and Bioinformatics

Xia Ning,*, Huzefa Rangwala, and George Karypis

Structure-Activity Modeling - QSAR. Uwe Koch

Package Rchemcpp. August 14, 2018

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Introduction. OntoChem

Machine Learning Concepts in Chemoinformatics

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Exploring the black box: structural and functional interpretation of QSAR models.

An Integrated Approach to in-silico

Research Article. Chemical compound classification based on improved Max-Min kernel

COMS 4771 Introduction to Machine Learning. Nakul Verma

Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach

Protein function prediction via graph kernels

Introduction to Chemoinformatics and Drug Discovery

In silico pharmacology for drug discovery

Kernel-based Machine Learning for Virtual Screening

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Development of a Structure Generator to Explore Target Areas on Chemical Space

De Novo molecular design with Deep Reinforcement Learning

CADASTER & MC ITN ECO

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine learning in molecular classification

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

Structural interpretation of QSAR models a universal approach

A Linear Programming Approach for Molecular QSAR analysis

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Drug-Target Interaction Prediction by Learning From Local Information and Neighbors

Using Self-Organizing maps to accelerate similarity search

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Supervised classification for structured data: Applications in bio- and chemoinformatics

Introduction to Chemoinformatics

Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics

Computational chemical biology to address non-traditional drug targets. John Karanicolas

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Computational Approaches towards Life Detection by Mass Spectrometry

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Drug Informatics for Chemical Genomics...

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov

QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

1. Some examples of coping with Molecular informatics data legacy data (accuracy)

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

The Changing Requirements for Informatics Systems During the Growth of a Collaborative Drug Discovery Service Company. Sally Rose BioFocus plc

Structural biology and drug design: An overview

A NETFLOW DISTANCE BETWEEN LABELED GRAPHS: APPLICATIONS IN CHEMOINFORMATICS

Electrical and Computer Engineering Department University of Waterloo Canada

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

S1324: Support Vector Machines

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Abstract. 1 Introduction 1 INTRODUCTION 1

The PhilOEsophy. There are only two fundamental molecular descriptors

Interactive Feature Selection with

Support Vector Machines

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

Support Vector Machine (SVM) and Kernel Methods

Improved Machine Learning Models for Predicting Selective Compounds

Fast similarity searching making the virtual real. Stephen Pickett, GSK

Plan. Day 2: Exercise on MHC molecules.

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

DATA ANALYTICS IN NANOMATERIALS DISCOVERY

Identification of Active Ligands. Identification of Suitable Descriptors (molecular fingerprint)

PAC-learning, VC Dimension and Margin-based Bounds

Using Phase for Pharmacophore Modelling. 5th European Life Science Bootcamp March, 2017

Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods

Chemical Space: Modeling Exploration & Understanding

Computational Methods and Drug-Likeness. Benjamin Georgi und Philip Groth Pharmakokinetik WS 2003/2004

Lecture 10: A brief introduction to Support Vector Machine

Ignasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

Bridging the Dimensions:

Application of kernel functions for accurate similarity search in large chemical databases

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013

Support Vector Machine (continued)

A Submodular Framework for Graph Comparison

Development of a Protein-Ligand Extended Connectivity (PLEC) fingerprint and its application for binding affinity predictions.

Introduction to Support Vector Machines

Brief Introduction to Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Applied Machine Learning Annalisa Marsico

Similarity methods for ligandbased virtual screening

On learning with kernels for unordered pairs

Support Vector Machines: Maximum Margin Classifiers

Subspace Analysis for Facial Image Recognition: A Comparative Study. Yongbin Zhang, Lixin Lang and Onur Hamsici

Novel Approaches for Small Biomolecule Classification and Structural Similarity Search

Review: Support vector machines. Machine learning techniques and image analysis

LIBRARY DESIGN FOR COLLABORATIVE DRUG DISCOVERY: EXPANDING DRUGGABLE CHEMOGENOMIC SPACE

Raed Saeed Khashan. Chapel Hill Approved by: Dr. Alexander Tropsha. Dr. Weifan Zheng. Dr. Alexander Golbraikh. Dr. Wei Wang. Dr.

Transcription:

Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds: Success & Challenges INSERM workshop, Saint-Raphaël, France, March 25, 2010

Outline 1. Machine learning for ligand-based virtual screening 2. 2D kernels 3. 3D kernels 4. Towards in silico chemogenomics

Machine learning for ligand-based virtual screening

Ligand-based virtual screening / QSAR From http://cactus.nci.nih.gov!

Represent each molecule as a vector

and discriminate with machine learning - LDA - PLS - Neural network - Decision trees - Nearest neighbour - SVM,

Support Vector Machine (SVM) - Nonlinear - Large margin (useful in high dim) - Need pairwise distance / similarity as input instead of vectors / fingerprints

From descriptors to similarities Molecules Representation Vectors / Fingerprints Tanimoto Inner product «Kernel» Pairwise distance / similarity Discrimination - Neural Net - LDA - Decision trees - PLS, - SVM - Kernel PLS - Kernel LDA -

2D kernels

2D fragment kernels (walks) Kashima et al. (2003), Gärtner et al. (2003)

Properties of the 2D fragment kernel Corresponds to a fingerprint of infinite size Can be computed efficiently in O( x ^3 x ^3) (much faster in practice) Solves the problem of clashes and memory storage (fingerprints are not computed explicitly) Kashima et al. (2003), Gärtner et al. (2003)

2D kernel computational trick

Extension 1: label enrichment - Increases the expressiveness of the kernel - Faster computation with more labels - Other relabeling schemes are possible (pharmacophores) Mahé et al. (2005)

Extension 2: subtree patterns «All subtree patterns» Mahé and V., Mach. Learn, 2009. Ramon et al. (2004), Mahé & V. (2009)

2D subtree vs walk kernel NCI 60 dataset Mahé & V. (2009)

Other 2D kernels Indexing by all shortest paths (Borgwardt & Kriegel 2005) Indexing by all small subgraphs (Shervashidze et al. 2009) Optimal assignment kernel (Fröhlich et al. 2005)

3D pharmacophore kernel

3-point pharmacophores Mahé et al., J. Chem. Inf. Model., 2006.

3D fingerprint kernel

Removing discretization artifacts 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 1 2 3 4 5 3D Fingerprint 0 1 2 3 4 5 3D Fuzzy Fingerprint 0 1 2 3 4 5 3D Kernel

From the fingerprint kernel to the pharmacophore kernel

Experiments Mahé et al., J. Chem. Inf. Model., 2006.

Towards in silico chemogenomics

Chemogenomics Chemical space Target family

In silico Chemogenomics Chemical space Target family

Fingerprint for a (target,molecule) pair? t= c = = - Sequence - Structure - Evolution - Expression - = - 2D - 3D - Pharmacophore - MW, logp, =???

Fingerprint for a (target,molecule) pair? T= c = = - Sequence - Structure - Evolution - Expression - = - 2D - 3D - Pharmacophore - logp, 10 6 10 3 10 3

Similarity for (target,molecule) pairs t= c = = - Sequence - Structure - Evolution - Expression - = - 2D - 3D - Pharmacophore - logp,

Summary: SVM for chemogenomics 1. Choose a kernel (similarity) for targets 2. Choose a kernel (similarity) for ligands 3. Train a SVM model with the product kernel for (target/ligand) pairs

Important remark New methods are being actively developed in machine learning for learning over pairs «Collaborative filtering», «transfer learning», «multitask learning», «MMMF», «pairwise SVM», etc 37k registered teams from 180 countries!

Application: virtual screening of GPCR Data: GLIDA database filtered for drug-like compounds - 2446 ligands - 80 GPCR - 4051 interactions - 4051 negative interactions generated randomly Ligand similarity - 2D Tanimoto - 3D pharmacophore Target similarities - 0/1 Dirac (no similarity) - Multitask (uniform similarity) - GLIDA s hierarchy similarity - Binding pocket similarity (31 AA) (Jacob et al., BMC Bioinformatics, 2008)

Results (mean accuracy over GPCRs) 5-fold cross-validation Orphan GPCRs setup (Jacob et al., BMC Bioinformatics, 2008)

Influence of the number of known ligands Number of ligands / GPCR Performance improvement (hierarchy vs Dirac) (Jacob et al., BMC Bioinformatics, 2008)

Screening of enzymes, GPCRs, ion channels Data: KEGG BRITE database, redundancy removed Enzymes - 675 targets - 524 molecules - 1218 interactions - 1218 negatives GPCRs - 100 targets - 219 molecules - 399 interactions - 399 negatives Ion channels - 114 targets - 462 molecules - 1165 interactions - 1165 negatives (Jacob and V., Bioinformatics, 2008)

Results (mean AUC) 10-fold CV Orphan setting (Jacob and V., Bioinformatics, 2008)

Influence of the number of known ligands Enzymes GPCRs Ion channels Relative improvement : hierarchy vs Dirac (Jacob and V., Bioinformatics, 2008)

Conclusion SVM offer state-of-the-art performance in many chemo- and bio-informatics applications The kernel trick is useful to Work implicitly with many features without computing them (2D fragment kernels) Work with similarity measures that cannot be derived from descriptors (optimal alignment kernel) Relax the need for discretization (3D pharmacophore kernel) Work in a product space (chemogenomics) Promising direction: Multiple kernel learning Collaborative filtering in product space

Thank you! Collaborators: P. Mahé, L. Jacob, V. Stoven, B. Hoffmann References : http://cbio.ensmp.fr/~jvert Open-source kernels for chemoinformatics: http://chemcpp.sourceforge.net