QB LECTURE #4: Motif Finding Adam Siepel Nov. 20, 2015
2 Plan for Today Probability models for binding sites Scoring and detecting binding sites De novo motif finding
3 Transcription Initiation Chromatin Distal TFBS Co-activator complex Transcription initiation complex Transcription initiation CRM Proximal TFBS
4 Binding Sites a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites b B R M C W A W H R W G G B M Consensus sequence
5 Probability Model for Motifs Let x =(x 1,...,x k ) be a sequence possibly representing a binding site of length k We represent the motif as a sequence of position-specific multinomial models, π =(π 1,A,π 1,C,π 1,G,π 1,T,π 2,A...,π k,t ) such that at position i π i,j The likelihood is: L(x π) = is the probability of base j k P (x i π i,. )= i=1 k i=1 π i,xi
6 Background Model Assume an iid multinomial background model,, so that As with alignment, classical theory says a good statistic for discrimination is: where θ =(θ A,θ C,θ G,θ T ) k L(x θ) = log L(x π) L(x θ) = log k = k i=1 i=1 i=1 s i,xi θ xi π i,xi θ xi = s i,a = log π i,a log θ a k log π i,xi log θ xi i=1
7 Weight Matrix a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 b G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites B R M C W A W H R W G G B M Consensus sequence c Position frequency matrix (PFM) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0 4 4 0 3 7 4 3 5 4 2 0 0 4 C 3 0 4 8 0 0 0 3 0 0 0 0 2 4 G 2 3 0 0 0 0 0 0 1 0 6 8 5 0 T 3 1 0 0 5 1 4 2 2 4 0 0 1 0 {s i,a } d Position weight matrix (PWM) A 1.93 0.79 0.79 1.93 0.45 1.50 0.79 0.45 1.07 0.79 0.00 1.93 1.93 0.79 C 0.45 1.93 0.79 1.68 1.93 1.93 1.93 0.45 1.93 1.93 1.93 1.93 0.00 0.79 G 0.00 0.45 1.93 1.93 1.93 1.93 1.93 1.93 0.66 1.93 1.30 1.68 1.07 1.93 T 0.15 0.66 1.93 1.93 1.07 0.66 0.79 0.00 0.00 0.79 1.93 1.93 0.66 1.93
8 Estimating the Model If we have several training examples, we can estimate the parameters in the usual way for multinomial models a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 b Problem: sparse data π i,j G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites B R M C W A W H R W G G B M c Position frequency matrix (PFM) Consensus sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0 4 4 0 3 7 4 3 5 4 2 0 0 4 C 3 0 4 8 0 0 0 3 0 0 0 0 2 4 G 2 3 0 0 0 0 0 0 1 0 6 8 5 0 T 3 1 0 0 5 1 4 2 2 4 0 0 1 0 π 1,A =0 π 1,C = 3 8 π 1,G = 1 4. π 14,T =0
9 Example of Estimates with Pseudocounts a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 b G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites B R M C W A W H R W G G B M c Position frequency matrix (PFM) Consensus sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0 4 4 0 3 7 4 3 5 4 2 0 0 4 C 3 0 4 8 0 0 0 3 0 0 0 0 2 4 G 2 3 0 0 0 0 0 0 1 0 6 8 5 0 T 3 1 0 0 5 1 4 2 2 4 0 0 1 0 π 1,A = 1 12 π 1,C = 4 12 π 1,G = 3 12. π 14,T = 1 12
10 Prediction of Binding Sites We predict a binding site if and only if: S(x) = k i=1 s i,xi T where T is chosen to achieve the desired tradeoff between sensitivity and specificity Sensitivity is the fraction of true sites that are predicted (1 false negative rate) Specificity is the fraction of false sites that are not predicted (1 false positive rate)
11 background (null) binding sites (alternative) false negatives T S(x) false positives
12 Sorting Out Terms Prediction Outcome True Condition False Pos TP FP (PPV) Neg FN TN (NPV) Sens = TP / (TP+FN) Spec = TN / (FP+TN) FP rate = type I error = α = FP/(FP+TN) = 1 Spec FN rate = type II error = β = FN/(TP+FN) = 1 Sens β α Power = 1 (for bounded ) A p-value is an estimate of α for a given observation
13 How to Choose T? If known positive and negative examples, can estimate sensitivity and specificity directly and adjust accordingly Can control false positive rate only using a reasonable proxy for background (sometimes permuted data) Can generate synthetic data from the background model and use it to simulate from the null distribution of S(x) Can compute the exact null distribution by dynamic programming in some cases
14 Computing p-values Similar methods can be used to compute p-values for predicted motifs First characterize null distribution of log-odds scores, f(s(x)), empirically or analytically Now assign a p-value to a prediction by computing p = y S(x) f(y) Must be corrected for multiple testing
15 Improving the Background Model Bases are not independent: CpGs, polyas, simple sequence repeats, transposons, etc. In some cases, nonindependence will inflate false positive rates A better background model is needed Typically, higher order Markov models are used
16 Markov Models We are interested in the joint distribution of X1,...,Xk and for convenience have assumed: P (X 1,...,X k )=P (X 1 ) P (X k ) X 1 X 2 It may be slightly less egregious to assume: P (X 1,...,X k )=P (X 1 )P (X 2 X 1 )P (X 3 X 2 ) P (X k X k 1 ) X 1 X 2 X k X k This is a 1st-order Markov model. In an N th order model each Xi depends on Xi N,...,Xi 1
17 Markov Scores Now the background model is θ =(θ A A,θ A C,θ A T,...,θ T G,θ T T ) where θ x1 x 0 is specially defined to denote the marginal probability of x1 The log odds scores are: where L(x θ) = log L(x π) L(x θ) = = k log π i,xi log θ xi x i 1 i=1 k i=1 k i=1 s i,xi x i 1 θ xi x i 1 s i,a b = log π i,a log θ a b
18 Effect of Better Background Model background (null) binding sites (alternative) background (null) binding sites (alternative) T S(x) T S(x) false negatives false positives false negatives false positives First Model Better Model
19 An Aside on Information Theory Invented by Claude Shannon in the late 1940s, at the dawn of the digital age Motivated by problems in information transmission, especially data compression Has deep connections with probability theory, computer science, statistical mechanics, gambling and investment, etc. You benefit from it every time you gzip a file or look at a JPEG image!
20 Entropy The entropy of a (discrete) rv X is: H(X) = x = E p(x) log p(x) [ log 1 ] p(x) Interpretations of H(X): - Min. ave. length of binary encoding of X - Ave. information gained by observing X - Min. ave. number of yes/no questions to find out X - Min. ave. number of fair coins required to generate X
21 Encoding Example Suppose we want to encode n coin tosses as a binary sequence. If the coin is fair, we can do no better than to use a bit for each coin toss, e.g., 00101110 for TTHTHHHT. It will always take n bits to encode the sequence. Suppose, however, that the coin has weight θ = 0.2. Can we do better? It turns out we can (for large enough n), by encoding subsequences and giving shorter codes to more probable subsequences.
22 Encoding Example, cont. X P(X) Code TTT 0.512 0 Expected length: 0.512 1 + 0.128 3 3 + 0.032 5 3 + 0.008 5 = 2.184 TTH 0.128 100 THT 0.128 101 HTT 0.128 110 THH 0.032 11100 HTH 0.032 11101 HHT 0.032 11110 0 1 0 1 0 0 0 1 0 0 1 1 1 Therefore, 2.184/3 = 0.728 bits/coin are needed For the naive code: 1 Entropy: 0.722 HHH 0.008 11111 1
23 Entropy for Bernoulli rv with Parameter p H(X) is always concave and nonnegative
24 Perfect Code Suppose X has pdf: p(x) = An optimal binary encoding is: 1 2 x = a 1 4 x = b 1 8 x = c 1 8 x = d 0 a 10 b 110 c 111 d Expected length = H(X) = 1.75 bits Naive encoding: 2 bits
25 Entropy and Information Before an event X, your uncertainty about it is measured by H(X) Therefore, when you observe X, your ave. gain in information is measured by H(X) However, you may not observe X directly; after observing a noisy message Y, there may still be uncertainty about X We can measure the (ave.) information content of Y as Hbefore(X) Hafter(X)
26 Relative Entropy The relative entropy of pdf p wrt pdf q is: D(p q) = x p(x) log p(x) q(x) It represents the average additional bits needed to encode X if it comes from p but the code was optimized for q D(p q) =H pq (X) H pp (X) = x p(x) log q(x)+ x p(x) log p(x) = x p(x) log p(x) q(x) Useful as a measure of divergence between distributions
27 Mutual Information The mutual information in rv s X and Y is: I(X; Y )= x p(x, y) log y p(x, y) p(x)p(y) I(X;Y) is the relative entropy of P(X,Y) wrt P(X)P(Y). It represents the reduction in uncertainty about X due to knowledge of Y. Mutual information can be thought of as a test statistic for independence (connected with χ 2 test, G test)
28 Likelihood Connection Suppose n iid random variables X 1,..., Xn. What is the expected log likelihood? n p(x) log p(x) = i=1 x Similarly, what is the expected log-odds score of model 1 wrt model 2, if the variables are drawn from model 1? n i=1 x If drawn from model 2? n i=1 x n H(X) = nh(x) i=1 p 1 (x) log p 1(x) p 2 (x) = nd(p 1 p 2 ) p 2 (x) log p 1(x) p 2 (x) = nd(p 2 p 1 )
29 Motif Information Content The entropy of the distribution for each position determines the information content of that position: IC i =2 H(X i ) Can be considered ave. reduction in uncertainty wrt random DNA. Also, relative entropy wrt random DNA: p(x = b) log b p(x = b) 1/4 = b p(x = b) log p(x = b) b p(x = b) log 1 4 =2 H(X) Also related to the binding energy and the evolutionary constraint Visualized in widely used sequence logos
a Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Site 8 b G A C C A A A T A A G G C A G A C C A A A T A A G G C A T G A C T A T A A A A G G A T G A C T A T A A A A G G A T G C C A A A A G T G G T C C A A C T A T C T T G G G C C A A C T A T C T T G G G C C T C C T T A C A T G G G C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Source binding sites B R M C W A W H R W G G B M Consensus sequence 30 c Position frequency matrix (PFM) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0 4 4 0 3 7 4 3 5 4 2 0 0 4 C 3 0 4 8 0 0 0 3 0 0 0 0 2 4 G 2 3 0 0 0 0 0 0 1 0 6 8 5 0 T 3 1 0 0 5 1 4 2 2 4 0 0 1 0 d Position weight matrix (PWM) A 1.93 0.79 0.79 1.93 0.45 1.50 0.79 0.45 1.07 0.79 0.00 1.93 1.93 0.79 C 0.45 1.93 0.79 1.68 1.93 1.93 1.93 0.45 1.93 1.93 1.93 1.93 0.00 0.79 G 0.00 0.45 1.93 1.93 1.93 1.93 1.93 1.93 0.66 1.93 1.30 1.68 1.07 1.93 T 0.15 0.66 1.93 1.93 1.07 0.66 0.79 0.00 0.00 0.79 1.93 1.93 0.66 1.93 e Site scoring 2 0.45 0.66 0.79 1.68 0.45 0.66 0.79 0.45 0.66 0.79 0.00 1.68 0.66 0.79 T T A C A T A A G T A G T C Σ = 5.23, 78% of maximum f Bits 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Position
31 Motif Discovery Consider the problem of estimating a motif model from N sequences, each believed to have a binding site for some TF As before, we assume a motif model of width k with a multinomial distribution at each position l. We assume an iid multinomial background model. θ bg The goal in this case is to learn the parameters of the motif model The location of the binding site in each sequence i, denoted zi, is a latent variable θ l
32 Illustration initialize sample or average θ 1 θ 2... θ k sample or average
33 EM vs. Gibbs Sampling In EM, we average over potential positions In Gibbs sampling, we sample positions In EM, you estimate parameters that maximize the likelihood (locally) In Gibbs, you sample both binding sites and parameters, allowing for uncertainty in both
That s All 34