An overview of deep learning methods for genomics

Similar documents
Convolutional Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Interpretable Convolutional Neural Networks for Effective Translation Initiation Site Prediction

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 17: Neural Networks and Deep Learning

Artificial Neural Networks. MGS Lecture 2

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Deep Learning Lab Course 2017 (Deep Learning Practical)

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Introduction to Convolutional Neural Networks (CNNs)

Neural Architectures for Image, Language, and Speech Processing

Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSCI567 Machine Learning (Fall 2018)

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Feedforward Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CSC 578 Neural Networks and Deep Learning

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Reading Group on Deep Learning Session 1

Lecture 3 Feedforward Networks and Backpropagation

Introduction to Neural Networks

Introduction to Deep Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Computational statistics

SGD and Deep Learning

Lecture 35: Optimization and Neural Nets

Neural networks COMS 4771

Statistical NLP for the Web

ECE521 Lectures 9 Fully Connected Neural Networks

Statistical Machine Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning. Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Lecture 3 Feedforward Networks and Backpropagation

1 What a Neural Network Computes

Understanding How ConvNets See

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Course 395: Machine Learning - Lectures

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

CSC 411 Lecture 10: Neural Networks

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Lecture 2: Learning with neural networks

Final Examination CS 540-2: Introduction to Artificial Intelligence

Machine Learning Lecture 10

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

How to do backpropagation in a brain

Machine Learning Lecture 5

Introduction to Machine Learning (67577)

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks: Backpropagation

Deep Feedforward Networks

Logistic Regression & Neural Networks

4. Multilayer Perceptrons

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Deep Neural Networks (1) Hidden layers; Back-propagation

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Deep Learning Recurrent Networks 2/28/2018

Intro to Neural Networks and Deep Learning

Deep Feedforward Networks

Lecture 5: Logistic Regression. Neural Networks

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Bayesian Networks (Part I)

Neural Networks: Backpropagation

Normalization Techniques in Training of Deep Neural Networks

text classification 3: neural networks

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

RegML 2018 Class 8 Deep learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Introduction to Deep Learning CMPT 733. Steven Bergner

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion

Natural Language Processing

Deep Feedforward Networks. Sargur N. Srihari

Artificial Intelligence

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Nonlinear Classification

Machine Learning Lecture 12

Transcription:

An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1

Snapshot 1. Brief introduction to convolutional neural networks What is deep learning? How are these techniques related to the methods we ve learned in class? 2. Applications to genomics What are the advantages/disadvantages of these models for this application? How do we interpret what our model has learned? 3. Extensions in genomics What else are these models used for? 2

Introduction Consider observations X i R P for i = 1,..., N with corresponding labels Y i {, 1}. Taken together, X is a N P matrix and Y is a N 1 vector. A simple linear model might try to fit weights, ω j for j 1,..., P, to minimize a provided loss function (e.g. minimize empirical risk). For example, under squared-loss: min ω R P ( R ω (X ) ) ) = min ((X ω Y ) 2 ω R P Alternatively, back in module 2 we learned about support vector machines (SVMs) for training a classifier (hinge-loss): R ω (X ) = max(, 1 Y X ω) 3

Introduction: Logistic regression recap Under squared-loss, the model fit is a simple linear model Y = X ω. While an extremely powerful formulation, it is limited to modeling linear relationships. Further, in our case (Y {, 1}) we want to restrict the range of Y. Consider a transformation g( ) on the linear model which constrains the predictions to being in (, 1) 1. For logistic regression, the transformation is: g ω (X ) = 1 = P(Y = 1 X, ω) 1 + e X ω What is g( ) under squared-loss? Can you think of any other possible g( )? 1 Statistics refers to these as link functions. In deep learning, these are referred to as activation functions. 4

Introduction: Going deep Let s do a quick recap. We re modeling Y with a nonlinear function: P(Y = 1 X, ω) = g ω (X ) The ω vector (P 1) of model weights is used to transform the N P design matrix (X ) into a N 1 vector of predictions, call it A. But what if ω is a P K matrix instead? Then A is a N K matrix, and we could repeat the procedure from before: ) P(Y = 1 X, ω) = g ω2 (A) = g ω2 (g ω1 (X ) = G ω (X ) G ω ( ) is a composition of nonlinear functions, and there s no need to stop at just two... 5

Introduction: Efficient training As shown on slide 1, under least squares, the ω are found by minimizing: ( ) ) min R ω (X ) = min ((X ω Y ) 2 ω R P ω R P One may show that the ω may be found with a closed-form solution ω = (X T X ) 1 X T Y. However, there is no such closed-form solution once the logistic link function has been applied. So how to fit the ω? In generalized linear models (GLMs) such as above, the loss (risk) functions R ω correspond to negative log-likelihoods. Then the ω are obtained via an iterative process such as Netwon-Raphson. 6

Introduction: Efficient training The same is true for deep models however now the risk functions no longer correspond to negative log-likelihoods. The G ω ( ) are very complicated functions involving compositions of many layers. As long as this is differentiable, however, the chain rule may be used to calculate the derivatives of the risk function w.r.t. each model weight. This procedure is termed back-propogation, and is the engine behind fitting the thousands (millions) of model weights. In addition, slightly different optimization techniques and heuristics are used to improve the fitting procedure. 7

Introduction: Spatial features One such heuristic is the use of convolutional filters. We ll consider the feed-forward model pictured below 2 : Input Sequences Mini batch A 1 C 1 T 1 G 1 5bp C 1 C 1 G 1 A 1 Convolution Layer 1 Convolution Layer 2 64 Filters 128 Filters ReLU & ReLU & Max Pooling Max Pooling 16 32 4 Convolution Output Fully-Connected Layers Output Layer P(Y=1 X) Softmax P(Y= X) Cross Entropy Loss 1 Labels A) 1-Hot Encoding B) 1 C) D) E) Up to this point, we ve only discussed the fully-connected layers, but the convolutional (early) layers are really no different. The key difference is that instead of performing a full outer-product matrix multiplication, the filter is performing an inner-product multiplication at each position along the input. 2 Feed-forward in the sense that the entire observation is fed-forward through the series of layers. 8

Introduction: Other techniques Input Sequences Mini batch A 1 C 1 T 1 G 1 5bp C 1 C 1 G 1 A 1 Convolution Layer 1 Convolution Layer 2 64 Filters 128 Filters ReLU & ReLU & Max Pooling Max Pooling 16 32 4 Convolution Output Fully-Connected Layers Output Layer P(Y=1 X) Softmax P(Y= X) Cross Entropy Loss 1 Labels A) 1-Hot Encoding B) 1 C) D) E) ReLUs/TanHs Pooling Dropout Batch normalization Residual networks 9

Introduction: Sequential features Alternatively, consider a model that incorporates temporality through the addition of the next nucleotide in the sequence 3. Predicted State P(Y=1 X) P(Y=1 X) P(Y=1 X) P(Y=1 X) P(Y=1 X) P(Y=1 X) P(Y=1 X) Bi-directional Hidden States Input Sequence 1-Hot Encoding 1 1 1 1 A C T G C C A 1 1 1 3 The recurrent neural network will not be discussed in this material 1

Problem Formulation: Overview Consider a set of genomic reads S in which half of the set contains some motif (Y = 1) and half of the set does not contain the given motif (Y = ). The motif in our example will be the TAL1 motif: The goal is to train a binary classifier on the genomic sequences, S i, and understand to what extent the classifier has learned the inserted motif. Data such as these could come from peak sequences called from a ChIP-seq experiment. 11

Problem Formulation: Notation Let S represent the collection of nucleotide sequences of length L for N observations indexed with i and composed of nucleotides n {A, C, G, T } with corresponding Y labels. We wish to learn a function G( ) mapping S Y through risk minimization. Define the empirical risk as: R G = Y [ ( log 1 1 + e G(S) )] [ ( (1 Y ) log e G(S) 1 + e G(S) As long as R G is differentiable we may use the chain rule (backpropagation) to calculate the derivative and perform gradient descent to update parameter values ω and in turn minimize empirical risk. )] 12

Problem Formulation: Notation We consider G( ) to be composed of M compositional functions (layers) such that: G(S i ) = g M (g M 1 (... (g 2 (g 1 (S i )))) and require g 1 (S i ) to be a convolutional layer such that for convolutional filter f of length l f at sequence position index J: g f 1 (S i,j ) = J+l f j=j n {A,C,G,T } ω f 1,j,n1 Si,j =n 13

Problem Formulation: Visualization For example, our CNN may look like this: Input Sequences Mini batch A 1 C 1 T 1 G 1 5bp C 1 C 1 G 1 A 1 Convolution Layer 1 Convolution Layer 2 64 Filters 128 Filters ReLU & ReLU & Max Pooling Max Pooling 16 32 4 Convolution Output Fully-Connected Layers Output Layer P(Y=1 X) Softmax P(Y= X) Cross Entropy Loss 1 Labels A) 1-Hot Encoding B) 1 C) D) E) We then train our model for several epochs and obtain model weights from the iteration with lowest test set accuracy. 14

Model Interpretation: Overview Understanding model rationale is an active field of research. What has my black box learned? A first and easy distinction to make is between: 1. encouraging intepretable learning while training a model L1/L2 regularization, interpretable CNNs [12], etc. 2. interpreting learned knowledge with a trained model 15

Model Interpretation: Importance Scores Given a trained model, model intepretation may be performed by computing importance scores. How important is nucleotide n in contributing to the final model prediction? There are two methodological approaches for computing such scores, or rather, visualizing learning: 1. Forward-, or perturbation-based [13, 2, 6] 2. Backward- or backpropagation-based [4, 11, 9, 3] 4 4 Happy to discuss these approaches if we have time. 16

Model Interpretation: Forward-based Forward-based approaches are quite simple: 1. For a given observation, obtain a predicted value 2. Modify the value of a single feature (e.g. nucleotide A C) 3. Obtain a new prediction 4. Calculate the difference, either at the network level or node level Figure 1: Zhou, Jian, and Olga G. Troyanskaya. Predicting effects of noncoding variants with deep learning-based sequence model. Nature methods 12.1 (215): 931. 17

Model Interpretation: Forward-based Figure 4A: Alipanahi, Babak, et al. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology 33.8 (215): 831. 18

Model Interpretation: Learning Motifs Alternatively, given the trained network, which observations maximize network activations (either individual network nodes or final network output)? What sequence(s) has the network learned to recognize? 1. Pass test observations through the first convolutional layer 2. Per filter, zero out low values below threshold (noise) 3. Extract motif-length sequences around non-zero activations 4. Use sequences to compute position-weight matrix (PWM) Figure 3B: Kelley, David R., Jasper Snoek, and John L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research 26.7 (216): 99-999. 19

Model Interpretation: Learning Motifs For motif m of length l, define the information (height) at position j {1, l} as: R j = log 2 4 H j with H j = n {A,C,G,T } H j,n defined as the total entropy at position j over nucleotides n {A, C, G, T }. Write the entropy at position j for nucleotide n as: H j,n = f j,n log 2 fj,n for relative frequency, f j,n, of nucleotide n at position j. f j,n is calculated from the sequences surrounding the non-zero activations. 2

Model Interpretation: Learning Motifs Supplementary Figure 4: Kelley, David R., Jasper Snoek, and John L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research 26.7 (216): 99-999.21

Summary The change in model predictions may be used to assess sensitivity to a given nucleotide in the sequence Convolutional filters learn motifs These analyses/intepretations are largely visualization based - there is no accompanying statistical framework - however recover past biological findings Models are typically trained on ChIP-seq/protein-binding data although any sequence data could in theory work. Important to consider the problem at hand Models show high accuracy and sensitivity over techniques such as k-mer SVMs, although require more observations, hyper-parameter tuning, difficult interpretations, etc. 22

Summary We saw figures from the first three papers [13, 2, 6], however these is a ton of development in this field Follow up work includes improved architectures [8, 1, 1], improved interpretations [4, 11, 9, 3, 7], diverse applications, to name a few. We also focused exclusively on genomics and didn t even touch the primary applications of image analysis and text/speech recognition! 23

Extensions Epigenetics (protein-binding, cell-type specific, etc.) Alternative splicing Model interpretability Population genetics Single cell RNA-seq GANs [5] 24

References I Amr Mohamed Alexandari, Avanti Shrikumar, and Anshul Kundaje. Separable Fully Connected Layers Improve Deep Learning Models For Genomics. In: biorxiv (217), p. 146431. Babak Alipanahi et al. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. In: Nature biotechnology 33.8 (215), pp. 831 838. Marco Ancona et al. A unified view of gradient-based attribution methods for Deep Neural Networks. In: arxiv preprint arxiv:1711.614 (217). Alexander Binder et al. Layer-wise relevance propagation for neural networks with local renormalization layers. In: International Conference on Artificial Neural Networks. Springer. 216, pp. 63 71. 25

References II Ian Goodfellow et al. Generative adversarial nets. In: Advances in neural information processing systems. 214, pp. 2672 268. David R Kelley, Jasper Snoek, and John L Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. In: Genome research 26.7 (216), pp. 99 999. Jack Lanchantin et al. Deep motif: Visualizing genomic sequence classifications. In: arxiv preprint arxiv:165.1133 (216). Daniel Quang and Xiaohui Xie. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. In: Nucleic acids research 44.11 (216), e17 e17. 26

References III Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning Important Features Through Propagating Activation Differences. In: CoRR abs/174.2685 (217). arxiv: 174.2685. url: http://arxiv.org/abs/174.2685. Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Reverse-complement parameter sharing improves deep learning models for genomics. In: biorxiv (217), p. 13663. Matthew D Zeiler et al. Deconvolutional networks. In: Computer Vision and Pattern Recognition (CVPR), 21 IEEE Conference on. IEEE. 21, pp. 2528 2535. Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable Convolutional Neural Networks. In: arxiv preprint arxiv:171.935 (217). 27

References IV Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning-based sequence model. In: Nature methods 12.1 (215), pp. 931 934. 28