Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Size: px

Start display at page:

Download "Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?"

Joy Foster
5 years ago
Views:

1 Neyman-Pearson More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery Given a sample x 1, x 2,..., x n, from a distribution f(... #) with parameter #, want to test hypothesis # =! 1 vs # =! 2. Might as well look at likelihood ratio: f(x 1, x 2,..., x n! 1 ) f(x 1, x 2,..., x n! 2 ) > $ What s best WMM? Given 20 sequences s 1 of length 8, assumed to be generated at random according to a WMM defined by 8 x (4-1) parameters!, what s the best!? E.g., what MLE for! given data s 1? Answer: count frequencies per position. Weight Matrix Models 8 Sequences: GTG GTG TTG Log-Likelihood Ratio: log 2 f xi,i f xi, f xi = 1 4 Freq. Col 1 Col 2 Col3 A C G T A " -" G 0 -" 2.00 T "

2 Non-uniform Background E. coli - DNA approximately 25% A, C, G, T M. jannaschi - 68% A-T, 32% G-C LLR from previous example, assuming f A = f T = 3/8 f C = f G = 1/8 A.74 -" -" G " 3.00 T " e.g., G in col 3 is 8 x more likely via WMM than background, so (log 2 ) score = 3 (bits). WMM: How Informative? Mean score of site vs bkg? For any fixed length sequence x, let P(x) = Prob. of x according to WMM Q(x) = Prob. of x according to background Recall Relative Entropy: H(P Q) = P (x) P (x) log 2 Q(x) x Ω -H(Q P) H(P Q) is expected log likelihood score of a sequence randomly chosen from WMM; -H(Q P) is expected score of Background H(P Q) For WMM, you can show (based on the assumption of independence between columns), that : H(P Q) = i H(P i Q i ) where P i and Q i are the WMM/background distributions for column i. WMM Example, cont. Uniform A " -" G 0 -" T " RelEnt Freq. Col 1 Col 2 Col3 A C G T Non-uniform A.74 -" -" G " T " RelEnt

3 Pseudocounts Are the -! s a problem? Certain that a given residue never occurs in a given position? Then -! just right Else, it may be a small-sample artifact Typical fix: add a pseudocount to each observed count small constant (e.g.,.5, 1) Sounds ad hoc; there is a Bayesian justification How-to Questions Given aligned motif instances, build model? Frequency counts (above, maybe with pseudocounts) Given a model, find (probable) instances? Scanning, as above Given unaligned strings thought to contain a motif, find it? (e.g., upstream regions for coexpressed genes from a microarray experiment) Hard... next few lectures. Motif Discovery: 3 example approaches Greedy Best-First Approach [Hertz & Stormo] Greedy search Expectation Maximization Gibbs sampler Note: finding a site of max relative entropy in a set of unaligned sequences is NP-hard (Akutsu) Input: Sequence s 1 ; motif length I; breadth d Algorithm: create singleton set with each length l subsequence of each s 1 for each set, add each possible length l subsequence not already present compute relative entropy of each discard all but d best repeat until all have k sequences usual greedy problems

4 Expectation Maximization [MEME, Bailey & Elkan, 1995] Input (as above): Sequence s 1 ; motif length l; background model; again assume one instance per sequence (variants possible) Algorithm: EM Visible data: the sequences Hidden data: where s the motif { 1 if motif in sequence i begins at position j Y i,j = 0 otherwise Parameters ": The WMM MEME Outline Typical EM algorithm: Given parameters " t at t th iteration, use them to estimate where the motif instances are (the hidden variables) Use those estimates to re-estimate the parameters " to maximize likelihood of observed data, giving " t+1 Repeat Expectation Step (where are the motif instances?) Ŷ i,j = E(Y i,j s i, θ t ) = P (Y i,j = 1 s i, θ t ) = P (s i Y i,j = 1, θ t ) P (Y i, θ t ) P (s i θ t ) = cp (s i Y i,j = 1, θ t ) = c l k=1 P (s i,j+k 1 θ t ) where c is chosen so that j Ŷi,j = 1. E = 0 P (0) + 1 P (1) Bayes Ŷ i,j Sequence i Maximization Step (what is the motif?) Find " maximizing expected value: Q(θ θ t ) = E Y θ t[log P (s, Y θ)] = E Y θ t[log k P (s i, Y i θ)] log P (s i, Y i θ)] = k = k Y i,j log P (s i, Y i,j = 1 θ)] Y i,j log(p (s i Y i,j = 1, θ)p (Y i,j = 1 θ))] E Y θ t[y i,j ] log P (s i Y i,j = 1, θ) + C Ŷ i,j log P (s i Y i,j = 1, θ) + C

5 M-Step (cont.) Q(θ θ t ) = k Exercise: Show this is maximized by counting letter frequencies over all possible motif instances, with counts weighted by Ŷ i,j, again the obvious thing. si l+1 Ŷ i,j log P (s i Y i,j = 1, θ) + C s 1 : ACGGATT s k : GC... TCGGAC Ŷ 1,1 Ŷ 1,2 Ŷ 1,3. Ŷ k,l 1 Ŷ k,l ACGG CGGA GGAT. CGGA GGAC Initialization 1. Try every motif-length substring, and use as initial " a WMM with, say 80% of weight on that sequence, rest uniform 2. Run a few iterations of each 3. Run best few to convergence (Having a supercomputer helps)

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery CSE 527 Autumn 2006 Lectures 8-9 (& part of 10) Motifs: Representation & Discovery 1 DNA Binding Proteins A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%,