Using Self-Organizing maps to accelerate similarity search

Similar documents
Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Machine Learning Concepts in Chemoinformatics

Fast similarity searching making the virtual real. Stephen Pickett, GSK

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013

Machine learning for ligand-based virtual screening and chemogenomics!

Introduction to Chemoinformatics and Drug Discovery

Similarity Search. Uwe Koch

Introduction. OntoChem

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

UniStra activities within the BigChem project:

Similarity methods for ligandbased virtual screening

Bioengineering & Bioinformatics Summer Institute, Dept. Computational Biology, University of Pittsburgh, PGH, PA

Structure-Activity Modeling - QSAR. Uwe Koch

Chemical Space: Modeling Exploration & Understanding

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introducing a Bioinformatics Similarity Search Solution

Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

Computational chemical biology to address non-traditional drug targets. John Karanicolas

QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Drug Informatics for Chemical Genomics...

Chemical Databases: Encoding, Storage and Search of Chemical Structures

An Integrated Approach to in-silico

Artificial Neural Networks. Edward Gatt

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

arxiv: v1 [cs.ds] 25 Jan 2016

The PhilOEsophy. There are only two fundamental molecular descriptors

Using Phase for Pharmacophore Modelling. 5th European Life Science Bootcamp March, 2017

E. Muratov 1, E. Varlamova 2, A. Artemenko 2, D. Fourches 1, V. Kuz'min 2, A. Tropsha 1

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Targeting protein-protein interactions: A hot topic in drug discovery

Development of a Structure Generator to Explore Target Areas on Chemical Space

Data Mining in the Chemical Industry. Overview of presentation

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Tailored Bregman Ball Trees for Effective Nearest Neighbors

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps

Deep Learning: What Makes Neural Networks Great Again?

SPATIAL INDEXING. Vaibhav Bajpai

Design and characterization of chemical space networks

How Diverse Are Diversity Assessment Methods? A Comparative Analysis and Benchmarking of Molecular Descriptor Space

Imago: open-source toolkit for 2D chemical structure image recognition

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options of the structure similarity

Performing a Pharmacophore Search using CSD-CrossMiner

OECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

est Drive K20 GPUs! Experience The Acceleration Run Computational Chemistry Codes on Tesla K20 GPU today

Functional Group Fingerprints CNS Chemistry Wilmington, USA

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

Tutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg)

De Novo molecular design with Deep Reinforcement Learning

Models, Data, Learning Problems

Data Mining Part 5. Prediction

Searching Substances in Reaxys

Marginal Balance of Spread Designs

DATA FUSION APPROACHES IN LIGAND-BASED VIRTUAL SCREENING: RECENT DEVELOPMENTS OVERVIEW

Pavel Zezula, Giuseppe Amato,

COMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA

Design and Synthesis of the Comprehensive Fragment Library

In Silico Investigation of Off-Target Effects

Catching the Drift Indexing Implicit Knowledge in Chemical Digital Libraries

Expanding the scope of literature data with document to structure tools PatentInformatics applications at Aptuit

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

Xia Ning,*, Huzefa Rangwala, and George Karypis

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Kernels for small molecules

Learning Theory Continued

Development and application of ligand-based computational methods for de-novo drug. design and virtual screening. Alexander Richard Geanes.

Ligand Scout Tutorials

QSAR in Green Chemistry

Scuola di Calcolo Scientifico con MATLAB (SCSM) 2017 Palermo 31 Luglio - 4 Agosto 2017

OECD QSAR Toolbox v.4.1. Step-by-step example for predicting skin sensitization accounting for abiotic activation of chemicals

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Universities of Leeds, Sheffield and York

Molecular Dynamics Graphical Visualization 3-D QSAR Pharmacophore QSAR, COMBINE, Scoring Functions, Homology Modeling,..

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Pivot Selection Techniques

Evaluation of Molecular Similarity and Molecular Diversity Methods Using Biological Activity Data

Receptor Based Drug Design (1)

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Nonlinear Classification

Computational Methods and Drug-Likeness. Benjamin Georgi und Philip Groth Pharmakokinetik WS 2003/2004

CS 188: Artificial Intelligence. Outline

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

OECD QSAR Toolbox v.3.2. Step-by-step example of how to build and evaluate a category based on mechanism of action with protein and DNA binding

Modern Information Retrieval

Using AutoDock for Virtual Screening

Chapter 2 The Chemical Space of Flavours

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

Composite Quantization for Approximate Nearest Neighbor Search

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a

Transcription:

YOU LOGO Using Self-Organizing maps to accelerate similarity search Fanny Bonachera, Gilles Marcou, Natalia Kireeva, Alexandre Varnek, Dragos Horvath Laboratoire d Infochimie, UM 7177. 1, rue Blaise Pascal, 67000 Strasbourg

The Problem YOU LOGO We have used the ChemAxon API to develop high-quality, high information content, ph sensitive and otherwise chemically meaningful descriptors Fuzzy Pharmacophore Fingerprints (FPT), ISIDA Coloured Fragment counts 2

Fuzzy Pharmacophore Triplets YOU LOGO 3 3 3 4 5 3 5 4 4 7 3 4 5 5 6 0 0 0 0 0 +6 +3 0 2 D i (m) = total occupancy of basis triplet i in molecule m.

Microspecies-Specific Labeling of Fragments YOU LOGO µspecies increment counters Lower of & contained Upper fragments by their population levels Fragment sizes A/D A/D are user-defined N /A D D A-***-D +95 D-***-D +95 A-***-D +95 D-***-D +95 A-**A*-D +95 D-**A*-D +95 N-***-D +5 N-**A*-D +5 /A D Population: 95% Molecular Fingerprint 5%

Augmented Atoms YOU LOGO Branched fragments, representing an atom and (an user-defined number of ) its successive coordination spheres Strict Typing with Bond Info (-b) Strict Typing, no Bond Info D(-(*)*)(-H(-H)=A) D(-(*)*A)(-H(-H)=A) D(())(H(H)A) D(()A)(H(H)A) /A D H H A All but Central and Terminal Atoms may be wildcards (-b -w) D(-(*)*)(-H(-H)=A) D(-?(*)*)(-H(-H)=A) D(-?(*)*A)(-H(-H)=A) Tree descriptors have wildcards for all but Central & Terminal: D(-?(*)*A)(-?(-H)=A)

The Problem YOU LOGO We have used the ChemAxon API to develop high-quality, high information content, ph sensitive and otherwise chemically meaningful descriptors Fuzzy Pharmacophore Fingerprints (FPT), ISIDA Coloured Fragment counts It is time to exploit them in similarity-driven virtual screening. search for similar compounds of similar fingerprints to a given query, and therefore, hopefully, a similar activity. BUT, these fingerprints are neither binary, nor really short (>10 5 -dimensional, with some fragmentation schemes). We are short-lived mortals who d nevertless like to allow public web-based screening of something like the ZINC database, for mortal web site users. 2

A Possible Solution YOU LOGO Self Organizing Map (SOM)-enhancement: map molecules, search neighborhood Pharmacophore patterns of database compounds (3 to 8 millions) Is this an interesting node? Underpopulated? Chemically appealing? External compound ( query ) Compare query only to neighboring references: Neuron adius to be defined Page 7

YOU LOGO Data sets Similarity Screening Sets: QS - Query Set (2000 compounds) : random subsets of 11 different analogue series used to model structure-property relationships in literature, marketed drugs and biological reference compounds and commercially available molecules (picked randomly from the ZINC database) DB - Database (55613 compounds) : including the remainders of the 11 abovecited series, further marketed drugs and biological reference compounds, 1870 ligands from the Pubchem database tested on the heg channel, and a majority of randomly picked ZINC compounds. No overlap between DB and QS. For each molecule in QS, the list of its top neighbors from DB is found by classical calculation of Tanimoto and Euclidean coefficients against the entire DB, then selecting top 300 hits at Tanimoto>0.75, and respectively Euclidean<9. A maximum of these Tanimoto and Euclidean hits must then be found again in SOM-enhanced VS but in a much shorter time Page 8

YOU LOGO Build maps: Data sets Sets for Map training: Extended - SOM training set (53206 compounds) : a subset of previous molecules (DB+QS), excluding the analogue series members, the Pubchem compounds and some 900 ZINC molecules. Smallef - SOM training set (11168 compounds) : features all drugs and biological reference compounds seen in Extended, but significantly less ZINC molecules. and external testing ExtDB - External database for real-life tests (~160000 compounds) : from the corporate collection of one of our industrial partners. ExtQ - External query set for real-life tests (12491 compounds) : taken basically from Smallef, and completed with randomly picked commercial compounds. Page 9

YOU LOGO Building the SOMs Generated maps : For each training set (Smallef and Extended) 36 explored geometries Varying X and Y from 8x6 to 30x30 Map fitting steps : Explored training iterations Brute training : 1000 iterations efinement : 10000 iterations Hyperefinement : tests between 50000, 80000, 100000, 200000 iterations Page 10

YOU LOGO Monitoring map convergence 22x28 rectangle bubble trained on Extended Page 11

YOU LOGO Monitoring map convergence 22x28 rectangle bubble : Smallef vs Ext. Page 12

YOU LOGO Map-enhanced similarity searching How do we define Q? Map Quality criterion Q Virtual Screening enhancement factor of a map Scanning for the best time enhancement vs. etrieval rate compromise over increasing adii. Q max[ @ (1 f @ ) 2 ] Q, with respect to dissimilarity metric Σ, for the map κ needs to be optimized by scanning for the best time enhancement @ 1 @ vs. retrieval rate compromise over increasing radii. f Page 13

YOU LOGO Impact of the training set size comparison of Q factor Page 14

YOU LOGO An overview of a good map Best map, Q=0.77 SOM trained on the Smallef dataset. 18x20 = 360 neurons. 3 training steps (Brute + efinement + Hyperefinement at 50k iterations) ectangular topology and Bubble neighbourhood function Colored according to Lipinski-rules violations (ed = 0 violations, green = 1, yellow = 2, blue = 3). Mapping of DUD compounds to check consistency Page 15

YOU LOGO An overview of a good map Page 16

YOU LOGO eal-life testing MAP Neuron adius #Detected Similar Pairs Time (mins) small_but_good 1 29718 158 The top1 good behavior 1 of the maps, 27107 as evidenced 65 at their primary top2 benchmarking 1* stage, 24588 was confirmed. 74 top2 3 27096 139 top2 5* 27707 260 top2 10* 28571 597 top3 2 27837 96 Similar compounds not located in neighbours neurons are at risk to be dispatched Anywhere in the map retrieving them by increasing might be very costly. Page 17

Conclusions YOU LOGO Too much learning is harmful, even for artificial brains Maps may, and should, be trained on relatively small though diverse compound sets. Too many input molecules just make convergence more difficult. Good news: no need to retrain your maps when expanding your database! Too much fitting of the code vectors may be detrimental interestingly, unsupervised learning methods may suffer from overfitting too! SOM acceleration works in both Tanimoto and Euclidean spaces, and with different sets of descriptors: get ~90% of expected virtual hits in 10% of time. but loosing a few virtual hits was never a Greek tragedy. If this should become an issue (Greeks are unpredictable), it is enough to choose a larger Neuron adius. 2