An overview of deep learning methods for genomics

An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1

Snapshot 1. Brief introduction to convolutional neural networks What is deep learning? How are these techniques related to the methods we ve learned in class? 2. Applications to genomics What are the advantages/disadvantages of these models for this application? How do we interpret what our model has learned? 3. Extensions in genomics What else are these models used for? 2

Introduction Consider observations X i R P for i = 1,..., N with corresponding labels Y i {, 1}. Taken together, X is a N P matrix and Y is a N 1 vector. A simple linear model might try to fit weights, ω j for j 1,..., P, to minimize a provided loss function (e.g. minimize empirical risk). For example, under squared-loss: min ω R P ( R ω (X ) ) ) = min ((X ω Y ) 2 ω R P Alternatively, back in module 2 we learned about support vector machines (SVMs) for training a classifier (hinge-loss): R ω (X ) = max(, 1 Y X ω) 3

Introduction: Logistic regression recap Under squared-loss, the model fit is a simple linear model Y = X ω. While an extremely powerful formulation, it is limited to modeling linear relationships. Further, in our case (Y {, 1}) we want to restrict the range of Y. Consider a transformation g( ) on the linear model which constrains the predictions to being in (, 1) 1. For logistic regression, the transformation is: g ω (X ) = 1 = P(Y = 1 X, ω) 1 + e X ω What is g( ) under squared-loss? Can you think of any other possible g( )? 1 Statistics refers to these as link functions. In deep learning, these are referred to as activation functions. 4

Introduction: Going deep Let s do a quick recap. We re modeling Y with a nonlinear function: P(Y = 1 X, ω) = g ω (X ) The ω vector (P 1) of model weights is used to transform the N P design matrix (X ) into a N 1 vector of predictions, call it A. But what if ω is a P K matrix instead? Then A is a N K matrix, and we could repeat the procedure from before: ) P(Y = 1 X, ω) = g ω2 (A) = g ω2 (g ω1 (X ) = G ω (X ) G ω ( ) is a composition of nonlinear functions, and there s no need to stop at just two... 5

Introduction: Efficient training As shown on slide 1, under least squares, the ω are found by minimizing: ( ) ) min R ω (X ) = min ((X ω Y ) 2 ω R P ω R P One may show that the ω may be found with a closed-form solution ω = (X T X ) 1 X T Y. However, there is no such closed-form solution once the logistic link function has been applied. So how to fit the ω? In generalized linear models (GLMs) such as above, the loss (risk) functions R ω correspond to negative log-likelihoods. Then the ω are obtained via an iterative process such as Netwon-Raphson. 6

Introduction: Efficient training The same is true for deep models however now the risk functions no longer correspond to negative log-likelihoods. The G ω ( ) are very complicated functions involving compositions of many layers. As long as this is differentiable, however, the chain rule may be used to calculate the derivatives of the risk function w.r.t. each model weight. This procedure is termed back-propogation, and is the engine behind fitting the thousands (millions) of model weights. In addition, slightly different optimization techniques and heuristics are used to improve the fitting procedure. 7

Introduction: Spatial features One such heuristic is the use of convolutional filters. We ll consider the feed-forward model pictured below 2 : Input Sequences Mini batch A 1 C 1 T 1 G 1 5bp C 1 C 1 G 1 A 1 Convolution Layer 1 Convolution Layer 2 64 Filters 128 Filters ReLU & ReLU & Max Pooling Max Pooling 16 32 4 Convolution Output Fully-Connected Layers Output Layer P(Y=1 X) Softmax P(Y= X) Cross Entropy Loss 1 Labels A) 1-Hot Encoding B) 1 C) D) E) Up to this point, we ve only discussed the fully-connected layers, but the convolutional (early) layers are really no different. The key difference is that instead of performing a full outer-product matrix multiplication, the filter is performing an inner-product multiplication at each position along the input. 2 Feed-forward in the sense that the entire observation is fed-forward through the series of layers. 8

Introduction: Other techniques Input Sequences Mini batch A 1 C 1 T 1 G 1 5bp C 1 C 1 G 1 A 1 Convolution Layer 1 Convolution Layer 2 64 Filters 128 Filters ReLU & ReLU & Max Pooling Max Pooling 16 32 4 Convolution Output Fully-Connected Layers Output Layer P(Y=1 X) Softmax P(Y= X) Cross Entropy Loss 1 Labels A) 1-Hot Encoding B) 1 C) D) E) ReLUs/TanHs Pooling Dropout Batch normalization Residual networks 9

Introduction: Sequential features Alternatively, consider a model that incorporates temporality through the addition of the next nucleotide in the sequence 3. Predicted State P(Y=1 X) P(Y=1 X) P(Y=1 X) P(Y=1 X) P(Y=1 X) P(Y=1 X) P(Y=1 X) Bi-directional Hidden States Input Sequence 1-Hot Encoding 1 1 1 1 A C T G C C A 1 1 1 3 The recurrent neural network will not be discussed in this material 1

Problem Formulation: Overview Consider a set of genomic reads S in which half of the set contains some motif (Y = 1) and half of the set does not contain the given motif (Y = ). The motif in our example will be the TAL1 motif: The goal is to train a binary classifier on the genomic sequences, S i, and understand to what extent the classifier has learned the inserted motif. Data such as these could come from peak sequences called from a ChIP-seq experiment. 11

Problem Formulation: Notation Let S represent the collection of nucleotide sequences of length L for N observations indexed with i and composed of nucleotides n {A, C, G, T } with corresponding Y labels. We wish to learn a function G( ) mapping S Y through risk minimization. Define the empirical risk as: R G = Y [ ( log 1 1 + e G(S) )] [ ( (1 Y ) log e G(S) 1 + e G(S) As long as R G is differentiable we may use the chain rule (backpropagation) to calculate the derivative and perform gradient descent to update parameter values ω and in turn minimize empirical risk. )] 12

Problem Formulation: Notation We consider G( ) to be composed of M compositional functions (layers) such that: G(S i ) = g M (g M 1 (... (g 2 (g 1 (S i )))) and require g 1 (S i ) to be a convolutional layer such that for convolutional filter f of length l f at sequence position index J: g f 1 (S i,j ) = J+l f j=j n {A,C,G,T } ω f 1,j,n1 Si,j =n 13

Problem Formulation: Visualization For example, our CNN may look like this: Input Sequences Mini batch A 1 C 1 T 1 G 1 5bp C 1 C 1 G 1 A 1 Convolution Layer 1 Convolution Layer 2 64 Filters 128 Filters ReLU & ReLU & Max Pooling Max Pooling 16 32 4 Convolution Output Fully-Connected Layers Output Layer P(Y=1 X) Softmax P(Y= X) Cross Entropy Loss 1 Labels A) 1-Hot Encoding B) 1 C) D) E) We then train our model for several epochs and obtain model weights from the iteration with lowest test set accuracy. 14

Model Interpretation: Overview Understanding model rationale is an active field of research. What has my black box learned? A first and easy distinction to make is between: 1. encouraging intepretable learning while training a model L1/L2 regularization, interpretable CNNs [12], etc. 2. interpreting learned knowledge with a trained model 15

Model Interpretation: Importance Scores Given a trained model, model intepretation may be performed by computing importance scores. How important is nucleotide n in contributing to the final model prediction? There are two methodological approaches for computing such scores, or rather, visualizing learning: 1. Forward-, or perturbation-based [13, 2, 6] 2. Backward- or backpropagation-based [4, 11, 9, 3] 4 4 Happy to discuss these approaches if we have time. 16

Model Interpretation: Forward-based Forward-based approaches are quite simple: 1. For a given observation, obtain a predicted value 2. Modify the value of a single feature (e.g. nucleotide A C) 3. Obtain a new prediction 4. Calculate the difference, either at the network level or node level Figure 1: Zhou, Jian, and Olga G. Troyanskaya. Predicting effects of noncoding variants with deep learning-based sequence model. Nature methods 12.1 (215): 931. 17

Model Interpretation: Forward-based Figure 4A: Alipanahi, Babak, et al. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology 33.8 (215): 831. 18

Model Interpretation: Learning Motifs Alternatively, given the trained network, which observations maximize network activations (either individual network nodes or final network output)? What sequence(s) has the network learned to recognize? 1. Pass test observations through the first convolutional layer 2. Per filter, zero out low values below threshold (noise) 3. Extract motif-length sequences around non-zero activations 4. Use sequences to compute position-weight matrix (PWM) Figure 3B: Kelley, David R., Jasper Snoek, and John L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research 26.7 (216): 99-999. 19

Model Interpretation: Learning Motifs For motif m of length l, define the information (height) at position j {1, l} as: R j = log 2 4 H j with H j = n {A,C,G,T } H j,n defined as the total entropy at position j over nucleotides n {A, C, G, T }. Write the entropy at position j for nucleotide n as: H j,n = f j,n log 2 fj,n for relative frequency, f j,n, of nucleotide n at position j. f j,n is calculated from the sequences surrounding the non-zero activations. 2

Model Interpretation: Learning Motifs Supplementary Figure 4: Kelley, David R., Jasper Snoek, and John L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research 26.7 (216): 99-999.21

Summary The change in model predictions may be used to assess sensitivity to a given nucleotide in the sequence Convolutional filters learn motifs These analyses/intepretations are largely visualization based - there is no accompanying statistical framework - however recover past biological findings Models are typically trained on ChIP-seq/protein-binding data although any sequence data could in theory work. Important to consider the problem at hand Models show high accuracy and sensitivity over techniques such as k-mer SVMs, although require more observations, hyper-parameter tuning, difficult interpretations, etc. 22

Summary We saw figures from the first three papers [13, 2, 6], however these is a ton of development in this field Follow up work includes improved architectures [8, 1, 1], improved interpretations [4, 11, 9, 3, 7], diverse applications, to name a few. We also focused exclusively on genomics and didn t even touch the primary applications of image analysis and text/speech recognition! 23

Extensions Epigenetics (protein-binding, cell-type specific, etc.) Alternative splicing Model interpretability Population genetics Single cell RNA-seq GANs [5] 24

References I Amr Mohamed Alexandari, Avanti Shrikumar, and Anshul Kundaje. Separable Fully Connected Layers Improve Deep Learning Models For Genomics. In: biorxiv (217), p. 146431. Babak Alipanahi et al. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. In: Nature biotechnology 33.8 (215), pp. 831 838. Marco Ancona et al. A unified view of gradient-based attribution methods for Deep Neural Networks. In: arxiv preprint arxiv:1711.614 (217). Alexander Binder et al. Layer-wise relevance propagation for neural networks with local renormalization layers. In: International Conference on Artificial Neural Networks. Springer. 216, pp. 63 71. 25

References II Ian Goodfellow et al. Generative adversarial nets. In: Advances in neural information processing systems. 214, pp. 2672 268. David R Kelley, Jasper Snoek, and John L Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. In: Genome research 26.7 (216), pp. 99 999. Jack Lanchantin et al. Deep motif: Visualizing genomic sequence classifications. In: arxiv preprint arxiv:165.1133 (216). Daniel Quang and Xiaohui Xie. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. In: Nucleic acids research 44.11 (216), e17 e17. 26

References III Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning Important Features Through Propagating Activation Differences. In: CoRR abs/174.2685 (217). arxiv: 174.2685. url: http://arxiv.org/abs/174.2685. Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Reverse-complement parameter sharing improves deep learning models for genomics. In: biorxiv (217), p. 13663. Matthew D Zeiler et al. Deconvolutional networks. In: Computer Vision and Pattern Recognition (CVPR), 21 IEEE Conference on. IEEE. 21, pp. 2528 2535. Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable Convolutional Neural Networks. In: arxiv preprint arxiv:171.935 (217). 27

References IV Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning-based sequence model. In: Nature methods 12.1 (215), pp. 931 934. 28