Topology based deep learning for biomolecular data

Similar documents
Topology based deep learning for drug discovery

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

In silico pharmacology for drug discovery

arxiv: v1 [q-bio.qm] 31 Mar 2017

Introduction to Chemoinformatics and Drug Discovery

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Receptor Based Drug Design (1)

CAP 5510 Lecture 3 Protein Structures

Early Stages of Drug Discovery in the Pharmaceutical Industry

FRAUNHOFER IME SCREENINGPORT

Persistent homology analysis of protein structure, flexibility and folding

Artificial Intelligence Methods for Molecular Property Prediction. Evan N. Feinberg Pande Lab

Structure-Activity Modeling - QSAR. Uwe Koch

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre

Cheminformatics platform for drug discovery application

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Mitosis Detection in Breast Cancer Histology Images with Multi Column Deep Neural Networks

Structural biology and drug design: An overview

Molecular Dynamics Graphical Visualization 3-D QSAR Pharmacophore QSAR, COMBINE, Scoring Functions, Homology Modeling,..

Quantitative Structure-Activity Relationship (QSAR) computational-drug-design.html

GC and CELPP: Workflows and Insights

Neural Network Potentials

De Novo molecular design with Deep Reinforcement Learning

Drug binding and subtype selectivity in G-protein-coupled receptors

Integrating Globus into a Science Gateway for Cryo-EM

Plan. Day 2: Exercise on MHC molecules.

Neuronal structure detection using Persistent Homology

Virtual affinity fingerprints in drug discovery: The Drug Profile Matching method

Using Phase for Pharmacophore Modelling. 5th European Life Science Bootcamp March, 2017

bcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012

DATA ANALYTICS IN NANOMATERIALS DISCOVERY

BIRKBECK COLLEGE (University of London)

Introducing a Bioinformatics Similarity Search Solution

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

BUDE. A General Purpose Molecular Docking Program Using OpenCL. Richard B Sessions

Persistent homology analysis of protein structure, flexibility, and folding

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

MSc Drug Design. Module Structure: (15 credits each) Lectures and Tutorials Assessment: 50% coursework, 50% unseen examination.

Computational chemical biology to address non-traditional drug targets. John Karanicolas

Predicting Protein Functions and Domain Interactions from Protein Interactions

CHEM 4170 Problem Set #1

The Quantum Landscape

Principles of Drug Design

Principles of Drug Design

ESPRESSO (Extremely Speedy PRE-Screening method with Segmented compounds) 1

An Integrated Approach to in-silico

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS

Radial Basis Function Neural Networks in Protein Sequence Classification ABSTRACT

Modeling Biomolecular Systems II. BME 540 David Sept

Computational Biology From The Perspective Of A Physical Scientist

COMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA

arxiv: v1 [q-bio.qm] 31 Mar 2017

CH MEDICINAL CHEMISTRY

Protein structure alignments

Structure-Based Drug Discovery An Overview

STRUCTURAL BIOINFORMATICS. Barry Grant University of Michigan

ALL LECTURES IN SB Introduction

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a

Structure to Function. Molecular Bioinformatics, X3, 2006

Machine learning for ligand-based virtual screening and chemogenomics!

Computational Molecular Biology (

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Introduction to Chemoinformatics

TRAINING REAXYS MEDICINAL CHEMISTRY

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

SUPPLEMENTARY MATERIALS

Learning from persistence diagrams

Bridging the Dimensions:

Introduction to Machine Learning Midterm Exam

Supplementary Information. Broad Spectrum Anti-Influenza Agents by Inhibiting Self- Association of Matrix Protein 1

Tutorial 1 Geometry, Topology, and Biology Patrice Koehl and Joel Hass

Notes of Dr. Anil Mishra at 1

On drug transport after intravenous administration

Deep Learning for Gravitational Wave Analysis Results with LIGO Data

DISCRETE TUTORIAL. Agustí Emperador. Institute for Research in Biomedicine, Barcelona APPLICATION OF DISCRETE TO FLEXIBLE PROTEIN-PROTEIN DOCKING:

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Introduction to Protein Structures - Molecular Graphics Tool

K-means-based Feature Learning for Protein Sequence Classification

Research Statement. 1. Electrostatics Computation and Biomolecular Simulation. Weihua Geng Department of Mathematics, University of Michigan

Thermodynamic Integration with Enhanced Sampling (TIES)

Deep Convolutional Neural Networks for Pairwise Causality

Bioengineering & Bioinformatics Summer Institute, Dept. Computational Biology, University of Pittsburgh, PGH, PA

PREDICTION OF HETERODIMERIC PROTEIN COMPLEXES FROM PROTEIN-PROTEIN INTERACTION NETWORKS USING DEEP LEARNING

Generalization to a zero-data task: an empirical study

Few-shot learning with KRR

Anticipating Visual Representations from Unlabeled Data. Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

TUTORIAL PART 1 Unsupervised Learning

Bio-Medical Text Mining with Machine Learning

Development of a Structure Generator to Explore Target Areas on Chemical Space

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

Contents. xiii. Preface v

A COMPARATIVE STUDY OF MACHINE-LEARNING-BASED SCORING FUNCTIONS IN PREDICTING PROTEIN-LIGAND BINDING AFFINITY. Hossam Mohamed Farg Ashtawy A THESIS

Computational Biology 1

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

Development of Pharmacophore Model for Indeno[1,2-b]indoles as Human Protein Kinase CK2 Inhibitors and Database Mining

Ligand Scout Tutorials

Supplementary Methods

Transcription:

Topology based deep learning for biomolecular data Guowei Wei Departments of Mathematics Michigan State University http://www.math.msu.edu/~wei American Institute of Mathematics July 23-28, 2017 Grant support: NSF, NIH and Michigan State University

Welcome to big-data era CPU GPU TPU Half of all jobs will be done by robots in the near future

Yearly Growth of Total Structures in the Protein Data Bank Biological sciences are undergoing a historic transition: From qualitative, phenomenological, and descriptive to quantitative, analytical and predictive, as quantum physics did a century ago

Machine learning for drug design and discovery 1) Disease identification 2) Target hypothesis 3) Virtual screening 4) Drug structural optimization in the target binding site 5) Preclinical in vitro and in vivo test 6) Clinical test 7) Optimize drug s efficacy, toxicity, pharmacokinetics, and pharmacodynamics properties (quantitative systems pharmacology) Influenza -- flu virus M2 channel Amantadine M2-A complex

Deep learning Fukushima (1980) Neo-Cognitron; LeCun (1998) Convolutional Neural Networks (CNN);

How to do deep learning for 3D biomolecular data? Obstacles for deep learning of 3D biomolecules: Molecules are too small to be visible. Geometric models for visualization hold partial truth and are non-unique. Geometric dimensionality: R 3N, where N~5500 for a protein. Machine learning dimensionality: > m1024 3, where m is the number of atom types in a protein. Complexity: Atom types, nonstationary & non-markovian. Solution: Topological simplification Dimensionality reduction

Topological fingerprints of an alpha helix Short bars are NOT noise! (Xia & Wei, IJNMBE, 2014)

Topological fingerprints of beta barrel Protein:2GR8 (Xia & Wei, IJNMBE, 2014)

DNA topological fingerprints DNA: 2MJK Personalized topological genome library

Topological signature of Gaussian noise SNR=0.1 SNR=1.0 SNR=10 SNR=100 Histogram plots of the barcodes show the Gaussian distribution of the noise

Persistent homology for ill-posed inverse problems Original data: microtubule Fitted with onetype of tubulins Fitted with twotypes of tubulins PCC=0.96 PCC=0.96 (Xia, Wei, 2014)

Objective oriented persistent homology G = ò g area [ ]dr, (Wang & Wei, JCP, 2016) Objective: Minimal surface energy ( ) area where gamma is the surface tension, and S is a surface characteristic function: S S=0 S=1 Generalized Laplace-Beltrami flow S t S S S

Objective oriented persistent homology Level sets generated from Laplace-Beltrami flow S t S S S (Wang & Wei, JCP, 2016)

Resolution Multiresolution induced multiscale of a fractal Introducing the resolution: (Xia, Zhao & Wei, JCB, 2015) r ( r, ) exp 2 j r j 2 Betti-0 Betti-1

Resolution Resolution Multiresolution induced multidimensional topology of a fractal r ( r, ) exp 2 j r j 2 Barcode Histogram (Rank) log10(n) Betti-1 Density (Xia, Zhao & Wei, JCB, 2015) Density

Betti-0 Betti-2 High dimensional persistence generated from multiparameter filtration --- Each independent parameter leads to a genuine dimension! Resolution Density (Xia & Wei, JCC, 2015) 2YGD N k k j j c r r r r 1 2 2 ), ( exp ), ( N k k z x j z j x j j z x c r z z y y x x r 1 2 2 2 2 2 ),, ( exp ),, ( 2-dimensional: 3-dimensional:

Multiscale topological persistence of a virus Virus 1DYL has 12 pentagons and 30 hexagons with icosahedral symmetry. The diameter is about 700 Angstrom. Scale=12A Scale=8A Scale=4A Scale=2A

Topological analysis of protein folding ID: 1I2T EBond L j 0 j ETotal 1/ L j 1 j Quantitative! (Xia, Wei, IJNMBE, 2014)

2D persistence in protein 1UBQ unfolding Radius log10(n) 0 1 2 Time (Xia & Wei, JCC, 2015)

Component Multicomponent and multichannel persistent homology for a protein-drug complex Radius Components are generated from element specific persistent homology. Eight channels are constructed from births, deaths and persistences at Betti-0, Betti-1 and Betti-2. (Cang & Wei, IJNMBE, 2017)

Topology based learning architecture (Cang & Wei, Bioinfomatics, 2017) Proteinligand complex Element specific groups Correlation matrix topological fingerprints Training and prediction Known labels Training data Learning algorithm Trained model Feature vector Query Prediction

Topological fingerprint based machine learning method for the classification of 2400 proteins Influenza A virus drug inhibition: 96% Accuracy Protein domains: 85% Accuracy (Alzheimer s disease) (Cang et al, MBMB, 2015) Hemoglobins in their relaxed and taut forms: 80% accuracy 55 classification tasks of protein superfamilies over 1357 proteins from Protein Classification Benchmark Collection: 82% accuracy

Topology based convolutional deep Learning Convolution (128x200) Pooling (128x100) Flattening (1xN) Multichannel images (54x200) Prediction Original Complex Classify atoms into element specific groups Generate topological fingerprints Convolutional deep learning neural network (Cang & Wei, Plos CB, 2017)

Blind binding affinity prediction of PDBBind v2013 core set of 195 protein-ligand complexes

Topology based binding affinity prediction of PDBBind v2007 core set of 195 protein-ligand complexes With C-C Betti-1 and Betti-2 With C-C Betti-0 Without C-C Betti-0

Directory of Useful Decoy (DUD) Classification of 98266 compounds containing 95316 decoys and 2950 active ligands binding to 40 targets from six families

D3R Grand Challenge 2 Given: Farnesoid X receptor (FXR) and 102 ligands Tasks: Dock 102 ligands to FXR, and compute their poses, binding free energies and energy ranking

Topological Multi-Task Deep Learning Globular protein mutation impacts Membrane protein mutation impacts Convolution and pooling Task specific representation Topological feature extraction Multi-task topological deep learning

Blind prediction of mutation energies Prediction correlations for 2648 mutations on global proteins Prediction correlations for 223 mutations on membrane proteins (Cang & Wei, Bioinformatics, 2017)

Comparison of the predictabilities of mutation predictors (MPs) constructed by Topology (T-MP), Electrostatics (E-MP), Geometry (G-MP), Sequence (S-MP) and High-level feature (H-MP) (Cang & Wei, Bioinformatics, 2017)

Prediction of partition coefficients: Star Set (223 molecules) Wu, Wang, Zhao, Wang, Wei, 2016

Prediction of drug solubility Wu, Wang, Zhao, Wang, Wei, 2016

Mutagenicity test sets Cao, Cang, Wu, Wei, 2017

Concluding remarks Multicomponent persistent homology, multidimensional persistent homology, element specific persistent homology, object orientated persistent homology are proposed to retain essential chemical and biological information during the topological simplification of biomolecular geometric complexity. The abovementioned approaches are integrated with advanced machine learning and deep learning algorithms to achieve the state-of-the-art predictions of protein-ligand binding affinities, mutation induced protein stability changes, drug toxicity, solubility and partition coefficients.

N Baker (PNNL) P Bates (MSU) Z Burton (MSU) K Dong (MSU) M Feig (MSU) H Hong (MSU) J Hu (MSU) J Wang Y Tong (NSF) (MSU) X Ye (UKLR)