Genome 541 Introduction to Computational Molecular Biology. Max Libbrecht

Size: px

Start display at page:

Download "Genome 541 Introduction to Computational Molecular Biology. Max Libbrecht"

Moses Harris
5 years ago
Views:

1 Genome 541 Introduction to Computational Molecular Biology Max Libbrecht

2 Genome 541 units Max Libbrecht: Gene regulation and epigenomics Postdoc, Bill Noble s lab Yi Yin: Bayesian statistics Postdoc, Jay Shendure s lab Adrian Verster: Analysis of the microbiome Postdoc, Elhanan Borenstein s lab Jesse Bloom: Molecular phylogenetics Professor, Fred Hutch Basic Sciences division Frank DiMaio: Protein structure Professor, Biochemistry

3 Course requirements The entire course grade is based on the homework assignments. The homeworks involve writing programs for data analysis, and running them on a computer that you have access to (we cannot provide computers). We don t require a specific language. Unless otherwise specified, homeworks are assigned on Thursday and due before class the following Thursday. Late homework will be penalized. Specifically, each assignment is worth 100 points, from which 10 points will be deducted for each day (or fraction thereof) that you turn it in late. The maximum deduction for being late is 60 points. It is OK to run your program on someone else s input data file, and compare outputs to see if you get the same results. However, it is not OK to share programs or to get someone else to debug your program. A key part of the course is being able to write and debug your own program. Homework assignments should be turned in using the Catalyst Tools Dropbox. Please submit your code along with your answers.

4 Have you seen ChIP-seq? ChIP-exo? DamID? ATAC-seq? DNase-FLASH? Chromatin state annotation (Segway, ChromHMM)? Peak calling from ChIP-seq? 3C? Hi-C? ChIA-PET? 3D architecture inference? Topological domain inference? Convexity? Gradient descent? Newton s method? Stochastic or coordinate-wise optimization? Lagrange multipliers? EM algorithm? Central path optimization? Multidimensional scaling (MDS)? Neural networks?

5 One-minute class responses At the end of class, please write a brief response describing: Something you found confusing A question for next class What you found most interesting

6 Unit 1: Gene regulation and epigenomics This unit

7 Unit 1: Gene regulation and epigenomics Lecture 1: Transcription factor binding (sequence) Lecture 2: Transcription factor binding (genomics assays) Lecture 3: Integrating genomics assays Lecture 4: Chromatin conformation

8 Methods focus: the generative framework Model Train Predict Data

9 Lecture 1: Transcription factor binding from genome sequence Sequence models of transcription factor binding Method: The EM algorithm for sequence motifs Method: Neural networks (deep learning)

10 Lecture 1: Transcription factor binding from genome sequence Sequence models of transcription factor binding Method: The EM algorithm for sequence motifs Method: Neural networks (deep learning)

11 Motifs

12 Motifs Motif (n): a succession of notes that has some special importance in or is characteristic of a composition

13 Sequence-specific transcription factors drive gene regulation Motif (n): a recurring genomic sequence pattern CTCCACGGC

14 The most common model of sequence motifs is the position-specific frequency matrix (PSFM) A C G T

15 Motivating question: How accurately does a PSFM predict the binding of a given transcription factor?

16 Lecture 1: Transcription factor binding from genome sequence Sequence models of transcription factor binding Method: The EM algorithm for sequence motifs Method: Neural networks (deep learning)

17 MEME problem statement Input: Collection of sequences. Assumption: One TFBS per sequence TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA Output: High-likelihood PSFM

18 MEME likelihood model TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT P (sequence motif starts at position i, PSFM) = i+l Y j=i PSFM(sequence j,j i) Y j/2[i,i+l] background(sequence j ) TFBS starti sequencei

19 The EM algorithm optimizes latent variable (missing data) models TFBS start position C A T G G

20 The EM algorithm optimizes latent variable (missing data) models TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT X Y θ P (sequence motif starts at position i, PSFM) = i+l Y j=i PSFM(sequence j,j i) Y j/2[i,i+l] background(sequence j )

21 The EM algorithm optimizes latent variable (missing data) models TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT X Y θ P (sequence motif starts at position i, PSFM) = Want to optimize: i+l Y j=i PSFM(sequence j,j i) Y j/2[i,i+l] background(sequence j ) P (X ) = X Y P (X, Y )

22 The EM algorithm optimizes latent variable (missing data) models TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT X Y θ P (sequence motif starts at position i, PSFM) = Want to optimize: i+l Y j=i PSFM(sequence j,j i) Y j/2[i,i+l] background(sequence j ) P (X ) = X Y P (X, Y ) Easier to optimize: E-step: M-step: Q(Y X, ) P (Y X, ) 0 argmax 0E Y Q(Y X, ) [log P (X Y, 0 )]

23 Idea #1: Compute probability distribution jointly over all TFBS Y1 Y2 TFBS starts (Y) sequence (X) E-step: Q(Y X, ) P (Y X, ) P (Y 1:N = y 1:N X 1:N, )= Q P i P (X i Y i = y i, ) Q i P (X i Y i = yi 0) y 0 i

24 In order to efficiently compute a probability distribution over many variables, that distribution must be factorizable Independent: Chain-structure: Tree-structure: Related terms: Junction tree algorithm Belief propagation Factor graph Tree-width Running intersection property

25 The MEME probability distribution is efficiently computable because each sequence is independent given θ Yi θ Xi: TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT E-step: Distribution for sequence i: Q i (Y i = y i X i, )= P (X i Y i = y i, ) P yi 0 P (X i Y i = yi 0) TFBS start i (Y i ) sequence i (X i )

26 MEME update rule Yi θ Xi: TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT M-step: 0 argmax 0E Y Q(Y X, ) [log P (X Y, 0 )] Update rule: PSFM(A,j)= Sequence # P i P Position # Indicator function k Q(Y i = k) 1(X i,k+j = A) P P i k Q(Y i = k)

27 MEME initializes EM separately using each subsequence as a starting point TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT A C G T One round of EM Choose highest-likelihood initializations run EM to convergence

28 Lecture 1: Transcription factor binding from genome sequence Sequence models of transcription factor binding Method: The EM algorithm for sequence motifs Method: Neural networks (deep learning)

29 How can we get a better model than sequence motifs? DNA physical shape Variable gaps Cooperativity between TFs Nucleosome interactions

30 Supervised framework CTCF binds in this sequence? f(sequence) TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT CTCF ChIP-seq peak sequence (~150 bp)

31 Sequence representation: one-hot encoding A C C G T One-hot encoding Input sequence ACCGTCGGTATAGGCTTATAAATCTCGG

32 Idea #1: linear classifier f (x) = sign X i i x i! A C C G T One-hot encoding

33 Neural networks f 3,1 (x 2 ) = sign(w 3,1 x 2 ) x 3 x 2 f 2,1 (x 1 )=W 2,1 x 1 x 1 f 1,1 (x 0 )=W 1,1 x 0 f 1,2 (x 0 )=W 1,2 x 0 x A C C G T

34 A nonlinear activation function is applied at each layer Logistic function: f 1,1 (x 0 ) = logistic(w 1,1 x 0 ) Rectifier-linear (ReLu) function: A C C G T

35 Neural networks Output Higher-level intermediate features (binding subdomains, cofactor binding) Intermediate features (DNA shape) A C C G T Raw features

36 A convolutional network applies the same function across portions of the input f(x 1:4 ) = logistic(wx 1:4 ) A C C G T C G G T A T A

37 A multi-task approach shares representations between factors Binds Doesn t bind Prediction CTCF classifier NRSF classifier Intermediate representation Shared feature extraction TCTCTCTCCACGGCTAATTAGGTGATCATGA Input sequence

38 DeepSEA architecture TF-specific layers Shared fullyconnected layers Shared convolutional layers ACCGTCGGTATAGGCTTATAAATCTCGG Zhou et al, Nature Methods 2015

39 Deep neural network accurately predicts TF binding and DNase hypersensitivity Reference sequence ACCGTCGGTATATCGTCGG CTCF binding in liver cells (ChIP-seq) Train Test DeepSEA model Mean AUC: Zhou et al, Nature Methods 2015

40 Deep neural network accurately predicts known regulatory variants Reference sequence Genetic variant ACCGTCGGTATATTCGTCCCGTCGGTATAGGCACCGTCGGTATA ACCGTCGGTATATTCGTCCCGTCGGTATTGGCACCGTCGGTATA Trained neural network Predicted change in binding strength?

41 When should you use a complex model (DeepSEA) over a simpler model (MEME)? Complex models generally: Have higher accuracy Need lots of training data Simpler models generally: Are interpretable Are generalizable Neural networks specifically generally: Require lots of compute (GPUs) Work well with structured input (DNA sequence, images, audio)

42 Administrivia Lecture notes are posted on the lab web page. I sent out an this morning to the class mailing list; talk to me if you didn t get it. HW1 out soon: Learning motif PSFMs with EM. Please write a one-minute response before you leave.

Genome 541! Unit 4, lecture 3! Genomics assays

Genome 541! Unit 4, lecture 3! Genomics assays Much easier to follow with slides. Good pace.! Having the slides was really helpful clearer to read and easier to follow the trajectory of the lecture.!!