Similarity methods for ligandbased virtual screening

Similar documents
Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

DATA FUSION APPROACHES IN LIGAND-BASED VIRTUAL SCREENING: RECENT DEVELOPMENTS OVERVIEW

Chemoinformatics: the first half century

Universities of Leeds, Sheffield and York

This is a repository copy of Multiple search methods for similarity-based virtual screening: analysis of search overlap and precision.

Universities of Leeds, Sheffield and York

This is a repository copy of Chemoinformatics techniques for data mining in files of two-dimensional and three-dimensional chemical molecules.

Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies

Similarity Search. Uwe Koch

Universities of Leeds, Sheffield and York

COMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA

An Enhancement of Bayesian Inference Network for Ligand-Based Virtual Screening using Features Selection

Universities of Leeds, Sheffield and York

A Framework For Genetic-Based Fusion Of Similarity Measures In Chemical Compound Retrieval

Computational chemical biology to address non-traditional drug targets. John Karanicolas

Evaluation of Molecular Similarity and Molecular Diversity Methods Using Biological Activity Data

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre

Data Mining in the Chemical Industry. Overview of presentation

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

est Drive K20 GPUs! Experience The Acceleration Run Computational Chemistry Codes on Tesla K20 GPU today

Using Self-Organizing maps to accelerate similarity search

Fast similarity searching making the virtual real. Stephen Pickett, GSK

Machine Learning Concepts in Chemoinformatics

Clustering Ambiguity: An Overview

Introducing a Bioinformatics Similarity Search Solution

An Integrated Approach to in-silico

Development of a Structure Generator to Explore Target Areas on Chemical Space

Molecular Similarity Searching Using Inference Network

COMBINATORIAL CHEMISTRY: CURRENT APPROACH

Catching the Drift Indexing Implicit Knowledge in Chemical Digital Libraries

Machine learning for ligand-based virtual screening and chemogenomics!

De Novo molecular design with Deep Reinforcement Learning

Chemical Similarity Searching

Functional Group Fingerprints CNS Chemistry Wilmington, USA

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Cheminformatics analysis and learning in a data pipelining environment

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

Mixture of metrics optimization for machine learning problems

Introduction. OntoChem

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Research Article. Chemical compound classification based on improved Max-Min kernel

Performing a Pharmacophore Search using CSD-CrossMiner

MSc Drug Design. Module Structure: (15 credits each) Lectures and Tutorials Assessment: 50% coursework, 50% unseen examination.

Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval

bcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012

Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?

Introduction to Chemoinformatics

Introduction to Chemoinformatics and Drug Discovery

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters

AUTOMATIC GENERATION OF TAUTOMERS

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

Virtual affinity fingerprints in drug discovery: The Drug Profile Matching method

Targeting protein-protein interactions: A hot topic in drug discovery

The Schrödinger KNIME extensions

Analysis of Random Fragment Profiles for the Detection of Structure-Activity Relationships

1. Some examples of coping with Molecular informatics data legacy data (accuracy)

DivCalc: A Utility for Diversity Analysis and Compound Sampling

Characterization of Pharmacophore Multiplet Fingerprints as Molecular Descriptors. Robert D. Clark 2004 Tripos, Inc.

Marginal Balance of Spread Designs

October 6 University Faculty of pharmacy Computer Aided Drug Design Unit

Virtual screening in drug discovery

arxiv: v1 [cs.ds] 25 Jan 2016

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

Introduction to Chemoinformatics

Developing CAS Products for Substructure Searching by Chemists. Linda Toler

Improving structural similarity based virtual screening using background knowledge

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013

Course Plan (Syllabus): Drug Design and Discovery

Expanded Interaction Fingerprint Method for Analyzing Ligand Binding Modes in Docking and Structure-Based Drug Design

COMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE

THE REVIEW OF VIRTUAL SCREENING TECHNIQUES

Exploring the chemical space of screening results

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

György M. Keserű H2020 FRAGNET Network Hungarian Academy of Sciences

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

EMPIRICAL VS. RATIONAL METHODS OF DISCOVERING NEW DRUGS

Hit Finding and Optimization Using BLAZE & FORGE

In silico pharmacology for drug discovery

JCICS Major Research Areas

Structure-Activity Modeling - QSAR. Uwe Koch

Electrical and Computer Engineering Department University of Waterloo Canada

Web tools for Monomer selection, Library Design and Compound Acquisition. Andrew Leach GlaxoSmithKline Research and Development Stevenage

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.

Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner

Receptor Based Drug Design (1)

Analyzing Small Molecule Data in R

Kernels for small molecules

Chemical Databases: Encoding, Storage and Search of Chemical Structures

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options of the structure similarity

Virtual Screening: How Are We Doing?

Building innovative drug discovery alliances. Just in KNIME: Successful Process Driven Drug Discovery

Searching Substances in Reaxys

Ignasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015

Computational Methods and Drug-Likeness. Benjamin Georgi und Philip Groth Pharmakokinetik WS 2003/2004

CHEMINFORMATICS MODELING OF DIVERSE AND DISPARATE BIOLOGICAL DATA AND THE USE OF MODELS TO DISCOVER NOVEL BIOACTIVE MOLECULES

Transcription:

Similarity methods for ligandbased virtual screening Peter Willett, University of Sheffield Computers in Scientific Discovery 5, 22 nd July 2010

Overview Molecular similarity and its use in virtual screening Use of fragment weighting schemes Comparison of fusion rules

Chemoinformatics The pharmaceutical industry has been one of the great success stories of scientific research in the latter half of the twentieth century Range of novel drugs for important therapeutic areas Agrochemicals and other fine-chemicals Chemoinformatics has played an increasingly important role in these developments Chem(o)informatics is a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information (Greg Paris, quoted at http://www.warr.com/warrzone.htm) Particular focus on the manipulation of information about chemical structures (2D or 3D) Virtual screening now a key area of study

Virtual screening Ranking the molecules in a database in order of decreasing probability of activity Focus interest on just those at the top of the ranking Range of methods available, varying in the types of information available Use of structure-based methods when an X-ray structure for the biological target is available Use of ligand-based methods when no such information is available Database searching a common approach

Searching chemical databases Three main types of search Structure search Find me information about this molecule Substructure search Find me molecules that contain this partial structure Similarity search Find me molecules like this molecule

Similarity searching Substructure searching very powerful but requires a clear view of the types of structures of interest Given a reference structure find molecules in a database that are most similar to it ( give me ten more like this ) The similar property principle states that structurally similar molecules tend to have similar properties (cf neighbourhood principle) HO O OH O O OH O O O O O Morphine Codeine Heroin

How to define chemical similarity? eed for a similarity measure A structure representation A weighting scheme A similarity coefficient Very many different similarity measures: the most common uses 2D fingerprints and the Tanimoto coefficient First suggested in early Seventies but operational implementations not till mid-eighties

Similarity searching with 2D fingerprints and the Tanimoto coefficient H H OH H O H 2 Query H H 2 HO OH H H H 2

C C C C C O C C C Fingerprints A simple, but approximate, representation that encodes the presence of fragment substructures in a bit-string or fingerprint Cf keywords indexing textual documents Each bit in the bit-string (binary vector) records the presence ( 1 ) or absence ( 0 ) of a particular fragment in the molecule. Typical length is a few hundred or few thousand bits Two fingerprints are regarded as similar if they have many common bits set

Tanimoto coefficient for binary bit strings C S RD R D C C bits set in common between Reference and Database structures R bits set in Reference structure D bits set in Database structure S RD equal to one (or zero) corresponds to identical fingerprints (or no bits in common) More complex form for use with non-binary data, e.g., when one has non-binary fragment weights Many other similarity coefficients exist, e.g. cosine coefficient, Euclidean distance, Tversky index

Experimental details Use of MDDR (ca. 102K structures) and WOMBAT (ca. 130K structures) databases Sets of molecules with known biological activities Molecules represented by various types of fingerprint Simulated virtual screening using an active as the reference structure How many of the top-ranked molecules from a similarity search are also active?

Use of fingerprint weighting Binary fingerprints work well, but can we do better, given additional information? Use of frequency information Focus for this work Use of activity information Powerful machine learning methods, but need to have many actives and inactives

Types of frequency information Frequency within a molecule If two molecules have multiple occurrences of a fragment in common then more similar than if just a single occurrence in common Frequency within a database If two molecules share a very rare fragment then more similar than if share a very common fragment

Weighting in textual information retrieval Weighting of keywords in textual IR Both types of weighting improve performance as compared to simple binary weighting Is this also the case in similarity-based virtual screening? Previous studies on small-scale and equivocal results

Weighting in chemoinformatics: I S RD i k R k i 1 k 2 R D D k 2 i i 1 i 1 i 1 i i R D i i Experiments show that Use of occurrence, rather than incidence, data is generally useful Best results using the square root of the occurrence frequencies in both the reference and database structures

Weighting in chemoinformatics: II For a fragment occurring in T of the molecules in a database use the inverse frequency weight log(/t) Experiments show that: If the actives are closely related then this weight enhances performance over unweighted searching. If the actives are structurally diverse sets then unweighted searching is superior

Data fusion Originally developed for signal processing but an entirely general approach: Improved performance can be obtained by combining evidence from several different sources When used for similarity searching, combine multiple rankings of a database to give a single, fused ranking Similarity fusion A single reference structure with multiple similarity measures (e.g., different fingerprints or different similarity coefficients) Group fusion A single similarity measure but multiple reference structures How to combine different rankings?

Fusion rules Given multiple input rankings, a fusion rule outputs a single, combined ranking The rankings can be either the computed similarity values or the resulting rank positions Previous work has identified use of: CombMAX for similarity data CombSUM for rank data Many others can be used (15 in all here)

Fusion rules for the x-th database structure CombMax = max{s1(x), S2(x)..Si(x)..Sn(x)} Also CombMI CombSum = ΣSi(x) Also CombMED and other averages CombRKP = Σ(1/Ri(x)) Can only be used with rank data

Experimental details Searches carried out using Similarity fusion and group fusion Various percentages of the ranked database Different fusion rules Results show conclusively that: Use just the top 1-5% of each ranked list Use the CombRKP fusion rule

Use of CombRKP: I Virtual screening seeks to rank molecules in decreasing order of probability of activity: MDDR searches (J. Med. Chem., 2005, 48, 7049) show a hyperbola-like plot

Use of CombRKP: II Probability of activity approximated by (1/Rank), and hence CombRKP likely to perform well

Conclusions Similarity-based virtual screening using fingerprints well-established Can enhance screening effectiveness by: Using fragment occurrence data Combining the rankings from multiple searches using the CombRKP fusion rule

Acknowledgments Shereen Arif John Holliday Christoph Mueller urul Malim