In silico generation of novel, drug-like chemical matter using the LSTM deep neural network

Similar documents
In silico generation of novel, drug-like chemical matter using the LSTM neural network

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

De Novo molecular design with Deep Reinforcement Learning

Quest for the Rings. In Silico Exploration of Ring Universe To Identify Novel Bioactive Heteroaromatic Scaffolds

The Molecule Cloud - compact visualization of large collections of molecules

Memory-Augmented Attention Model for Scene Text Recognition

Large scale classification of chemical reactions from patent data

Analyzing Building Blocks Diversity for DNA Encoded Library Design. Cresset User Group Meeting Nik Stiefl & Finton Sirockin, Novartis

Enamine Golden Fragment Library

Design and Synthesis of the Comprehensive Fragment Library

Deep Reinforcement Learning for de-novo Drug Design

Forecasting demand in the National Electricity Market. October 2017

Introduction to Deep Learning CMPT 733. Steven Bergner

Ultra High Throughput Screening using THINK on the Internet

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Universities of Leeds, Sheffield and York

Identifying QCD transition using Deep Learning

Prediction of Organic Reaction Outcomes. Using Machine Learning

Deep Learning: What Makes Neural Networks Great Again?

Inter-document Contextual Language Model

Increasing the Coverage of Medicinal Chemistry-Relevant Space in Commercial Fragments Screening

EE-559 Deep learning Recurrent Neural Networks

The Quantum Landscape

Presented By: Omer Shmueli and Sivan Niv

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

We can model covalent bonding in molecules in essentially two ways:

Deep Learning with Coherent Nanophotonic Circuits

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Introduction to Chemoinformatics and Drug Discovery

Regression Adjustment with Artificial Neural Networks

Deep Generative Models for Graph Generation. Jian Tang HEC Montreal CIFAR AI Chair, Mila

Imago: open-source toolkit for 2D chemical structure image recognition

Random Coattention Forest for Question Answering

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Introduction to Chemoinformatics

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

Quantum Artificial Intelligence and Machine Learning: The Path to Enterprise Deployments. Randall Correll. +1 (703) Palo Alto, CA

Supporting Information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

What I Did Last Summer

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

INF 5860 Machine learning for image classification. Lecture 5 : Introduction to TensorFlow Tollef Jahren February 14, 2018

Multi objective de novo drug design with conditional graph generative model

Tuning Color Through Substitution

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Development of a Structure Generator to Explore Target Areas on Chemical Space

Improving Satellite Data Utilization Through Deep Learning

Analyzing Small Molecule Data in R

Long-Short Term Memory and Other Gated RNNs

Similarity Search. Uwe Koch

Structure-Activity Modeling - QSAR. Uwe Koch

A Service Architecture for Processing Big Earth Data in the Cloud with Geospatial Analytics and Machine Learning

DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar University of Utah

Study of Pair-Monitor for ILC using Deep Learning

Artificial Intelligence (AI) Common AI Methods. Training. Signals to Perceptrons. Artificial Neural Networks (ANN) Artificial Intelligence

An Integrated Approach to in-silico

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters

PREDICTION OF HETERODIMERIC PROTEIN COMPLEXES FROM PROTEIN-PROTEIN INTERACTION NETWORKS USING DEEP LEARNING

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

Mark Neubauer. Enabling Discoveries at the LHC through Advanced Computation and Machine Learning. University of Illinois at Urbana-Champaign

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Machine Learning Basics

arxiv: v2 [cs.lg] 15 Nov 2017

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Development of a Protein-Ligand Extended Connectivity (PLEC) fingerprint and its application for binding affinity predictions.

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE

Machine Learning. Boris

Linear logic and deep learning. Huiyi Hu, Daniel Murfet

Important Aspects of Fragment Screening Collection Design

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Introduction to Neural Networks

Deep Recurrent Neural Networks

Fun with weighted FSTs

On the use of Long-Short Term Memory neural networks for time series prediction

An EoS-meter of QCD transition from deep learning

Kernel-based Machine Learning for Virtual Screening

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

@SoyGema GEMA PARREÑO PIQUERAS

How to add your reactions to generate a Chemistry Space in KNIME

Neural Networks and Deep Learning

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Neural Networks Introduction

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks

A Deep Learning Approach to the Citywide Traffic Accident Risk Prediction

Electrical and Computer Engineering Department University of Waterloo Canada

Deep Learning: Pre- Requisites. Understanding gradient descent, autodiff, and softmax

The Long and Rocky Road from a PDB File to a Protein Ligand Docking Score. Protein Structures: The Starting Point for New Drugs 2

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Introduction to Neural Networks

Improved Learning through Augmenting the Loss

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Deep Learning Autoencoder Models

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

PV021: Neural networks. Tomáš Brázdil

Sequence Modeling with Neural Networks

Transcription:

In silico generation of novel, drug-like chemical matter using the LSTM deep neural network Peter Ertl Novartis Institutes for BioMedical Research, Basel, CH September 2018

Neural networks in cheminformatics Treasure Island by P. Selzer and P. Ertl, 2001 analysis of the Novartis archive by Kohonen NNs Published in 1993 J. Chem. Inf. Model. 46, 2319 (2006) Drug Disc.Today 2. 2001 (2005) 2

Neural networks 2.0 Dramatic increase in computation power, much large datasets and particularly novel network architectures available as open source caused a quantum leap in the NN applications. LSTM (long short-term memory) recurrent neural network is one of these novel network types it powers Google s speech and image recognition, Apples s Siri and Amazon s Alexa. Very simplistically expressed; the LSTM can learn to understand the inner structure or grammar of properly encoded objects and then answer questions about these object or generate objects similar to those in the training set. Disadvantage of the LSTM networks is that they require very large training sets and also long learning time. 3

Major challenge the proper network architecture To design a network with the correct architecture is not easy; there are no general rules, one has to experiment and try different architectures and parameters. Numerous network parameters need to be to set-up: number, types and size of network layers learning rate drop-out rate loss and activation functions combination of these parameters makes the number of possible network architectures practically unlimited. It took >3 weeks of heavy computational experiments to find the properly working network architecture to design new molecules. 4

The LSTM architecture used After several trials, the architecture shown above has been chosen. The following parameters have been used: dropout rate 0.2, RMSprop optimizer with learning rate 0.01 and categorical crossentropy as a loss criteria during the optimization. The implementation was done using standard software: Python3 + Keras + TensorFlow 5

SMILES Simplified Molecular-Input Line-Entry System uppercase aliphatic atom = - double bond CC(=O)Oc1ccccc1C(O)=O parenthesis - branching lowercase aromatic atom pair of numbers - ring closure trained on 40 previous characters the output should be O 6

Training the network was trained on ~550,000 bioactive SMILES s from ChEMBL the goal was to learn the grammar of SMILES s encoding bioactive molecules and then to use the trained network to generate the SMILES s for novel molecules some level of randomness had to be also added (we do not want to exactly reproduced the ChEMBL structures, but generate novel, ChEMBL-like molecules) training took ~1 week on a single CPUs - parallelization is not yet supported in Keras (according to my 1 st experiments, using the GPUs would speed-up training about 5-times). 7

Generation of novel molecules the new SMILES s were generated character-by-character based on learned structure of ~550k ChEMBL SMILES s also incorrect SMILES s were generated (already one false character makes SMILES non-parsable); 54% of generated SMILES s could be discarded just by text check (not all brackets or rings paired), additional 14% were discarded after parsing (problems with aromaticity or incorrect valences); at the end 32% of SMILES s generated led to correct molecules (this ratio could be increased by longer training, but we do not want to reproduce exactly ChEMBL) generation of 1 million correct SMILES s required less than 2 hours on 300 CPUs using rather crude code; optimization, switch to GPUs and more processors would allow to generate 100s of millions, even billions of novel molecules 8

Novelty of generated structures out of 1 million generated molecules only 2774 were contained in the ChEMBL training set the generated molecules are novel similarity to ChEMBL structures (standard RDKit similarity; distance to the closest neighbor) is medium 1 million generated structures contain 627k unique scaffolds, (550k ChEMBL structure contain 172k scaffolds), overlap between 2 sets is only 18k scaffolds = the generated structures are diverse and contain many new chemotypes 9

Functionalities of novel structures Distribution of functional groups (% in the new set, % in ChEMBL) is very similar in both sets. The generated structures are novel, but of the same type as bioactive ChEMBL molecules. P. Ertl, An algorithm to identify functional groups in organic molecules, J. Cheminformatics 9:36 2017 10

Substructure analysis Feature ChEMBL generated baseline Feature ChEMBL generated baseline no rings 0.4 0.4 0.1 Large rings (>8) 0.4 1.8 75.9 1 ring 2.8 4.3 13.2 Spiro ring 1.9 0.6 0.6 2 rings 14.8 23.1 17.7 without N,O,S 0 0.2 2.6 3 rings 32.2 43.5 27.3 contains N 96.5 96.1 92.3 4 rings 32.7 23.9 25.2 contains O 93 92 85.5 >4 rings 17.2 4.8 16.5 contains S 35.6 27.9 39.6 fused ar. rings 38.8 30.9 0.2 contains halogen 40.7 38.8 49.4 Average molecule formula: ChEMBL C20.6 H22.1 N3.3 O2.8 S0.4 X0.8 generated C18.7 H19.7 N3.0 O2.5 S0.3 X0.6 (X = halogen) baseline C25.1 H35.0 N3.8 O2.3 S0.5 X0.7 The substructure features of the generated molecules are practically identical with those of ChEMBL structures, the baseline structures are very different, contain many macrocycles. 11

Molecular properties 12

Summary properly configured LSTM deep neural networks are able to learn the general structural features of bioactive molecules and then generate novel molecules of this type novelty, molecular properties, distribution of functional groups and synthetic accessibility of generated structures is very good the network is able to generated practically unlimited stream (100 s of millions, even billions) of novel molecules obvious applications of the generated drug-like molecules are: virtual screening identifying holes in the corporate chemical space generation of specialized molecule sets (natural product-like, target X-like,...) additional applications (possibly using other deep learning techniques) can be discussed 13

Acknowledgements Niko Fechner Brian Kelley Richard Lewis Eric Martin Valery Polyakov Stephan Reiling Bernd Rohde Gianluca Santarossa Nadine Schneider Ansgar Schuffenhauer Lingling Shen Finton Sirockin Clayton Springer Nik Stiefl Wolfgang Zipfel In silico generation of novel, drug-like chemical matter using the LSTM neural network Peter Ertl, Richard Lewis, Eric Martin, Valery Polyakov arxiv:1712:07499 (2017) The Python code (BSD license) is available on request 14