Probabilistic models of biological sequence motifs
|
|
- Natalie Bryant
- 5 years ago
- Views:
Transcription
1 Probabilistic models of biological sequence motifs Discovery of new motifs Master in Bioinformatics UPF Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain
2 what is a sequence motif? a sequence (or set of sequences) of biological significance (usually represented together as a pattern/matrix/etc) Examples: Sequence Motifs protein binding sites in DNA protein sequences corresponding to common functions conserved pieces of structure
3 Sequence Motifs Pattern Matching: search known motifs in a sequence (e.g. using a PWM) Pattern Discovery: search new motifs that are common to many sequences (Expectation Maximization algorithm)
4 Biological Relevance of Motif finding: Find a pattern or similar motif in a set of sequences that play a similar biological role can serve as evidence that this motif has a specific function E.g.: promoter sequences of co-expressed genes in Yeast:
5 Motifs and Weight Matrices Given a set of aligned sequences, we already know that we can construct a weight matrix to characterize the motif (see lecture on PWMs)
6 Motifs and Weight Matrices how can we construct a weight matrix if the sequences are not aligned? Moreover, in general we do not know what the motif looks like
7 But, what is the problem? Let s consider the reverse situation Consider 10 DNA sequences atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
8 We implant the motif AAAAAAAGGGGGGG atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga"
9 Where is the implanted motif? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga"
10 We implant the motif AAAAAAAGGGGGGG with four mutations atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga"
11 Where is the implanted motif now? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
12 Why is motif finding difficult? Motifs can mutate on non important bases, but these are a priori unknown atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga AgAAgAAAGGttGGG caataaaacggcggg
13 The Motif Finding Problem Mo#f Finding Problem: Given a list of t sequences each of length n, find the best pa8ern of length l that appears in each of the t sequences.
14 The Motif Finding Problem We use HEURÍSTICS: rules derived from our own experience that provide quick approximate solutions to complex problems.
15 The motif finding (pattern discovery) problem INPUT: N DNA sequences biologically related (e.g. promoters of co-expressed genes) OUTPUT: 1. The motif: Representation (e.g. Weight matrix) of a motif common to all (or some of the) sequences. This can be any type of model 2. The positions: The best occurrence of the motif in each sequence
16 Motif finding (Iterative) Algorithms while (convergence condition FALSE){ # calculate model # modify model # evaluate new model # (covergence?) # go to next movement } If after many iterations, the quality of the solution does not improve, we say that given the INPUT, the program has converged to a solution (supposedly the best)
17 Before mo=f finding How do we obtain a set of sequences on which to run mo=f finding? i.e. how do we get genes that we believe are regulated by the same transcrip=on factor? E.g. Expression data: Microarrays, RNA- Seq, Measures ac=vity of thousands of genes in parallel under one or more condi=ons Collect set of genes with similar expression (ac=vity) profiles and do mo=f finding on these. Protein binding data: ChIP- on- chip, ChIP- Seq, RNA IP (RIP), CLIP, etc Measure whether a par=cular factor binds to each of the sequences Provides hundreds or thousands of DNA/RNA target sequences Collect the set to which protein binds, do mo=f finding on these
18 Greedy Algorithm for Motif Search
19 Greedy Algorithm find a way to greedily change possible motif locations until we have converged to a highly likely hidden motif.
20 Consider t input sequences Greedy Algorithm Let s=(s 1,...,s t ) be the set of star=ng posi=ons for k- mers in our t sequences. select arbitrary posi=ons in the sequences This defines t substrings corresponding to these star=ng posi=ons and will form: - t x k alignment matrix and - 4 x k profile matrix P. The profile matrix P will be defined in terms of the frequency of le8ers, and not as the count of le8ers.
21 Greedy Algorithm Profile matrix from random 6-mers in the t input sequences: Positions in motif A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8
22 Greedy Algorithm Define the most probable k- mer from each sequence using the built the profile P. e.g. given a sequence = ctataaacc8acatc, find the most probable k- mer according to P: A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8
23 Greedy Algorithm Compute prob(a P) for every possible 6-mer: String, Highlighted in Red Calculations prob(a P) ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0 ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0 ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0 ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0 ctataaaccttacat 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/ ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/ ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0 ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0 ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0 ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8.0004
24 Greedy Algorithm Most Probable 6-mer in the sequence is aaacct: String, Highlighted in Red Calculations Prob(a P) ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0 ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0 ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0 ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0 ctataaaccttacat" 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/ ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/ ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0 ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0 ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0 ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8.0004
25 Greedy Algorithm P= Find the P-most probable k-mer in each of the sequences. A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct
26 Greedy Algorithm ctataaacgttacatc atagcgattcgactg cagcccagaaccct A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct
27 Greedy Algorithm 1 a a a c g t 2 a t a g c g 3 a a c c c t 4 g a a c c t 5 a t a g c t 6 g a c c t g 7 a t c c t t 8 t a c c t t A 5/8 5/8 4/ C 0 0 4/8 6/8 4/8 0 T 1/8 3/ /8 6/8 G 2/ /8 1/8 2/8 ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct The t most Probable k-mers define a new profile
28 Greedy Algorithm 1 a a a c g t 2 a t a g c g 3 a a c c c t 4 g a a c c t 5 a t a g c t 6 g a c c t g 7 a t c c t t 8 t a c c t t A 5/8 5/8 4/ C 0 0 4/8 6/8 4/8 0 T 1/8 3/ /8 6/8 G 2/ /8 1/8 2/8 ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct The t most Probable k-mers define a new profile
29 Greedy Algorithm 1 a a a c g t 2 a t a g c g 3 a a c c c t 4 g a a c c t 5 a t a g c t 6 g a c c t g 7 a t c c t t 8 t a c c t t A 5/8 5/8 4/ C 0 0 4/8 6/8 4/8 0 T 1/8 3/ /8 6/8 G 2/ /8 1/8 2/8 Previous Matrix A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 Red frequency increased, Blue frequency descreased
30 Greedy Algorithm The Steps 1) Select random star=ng posi=ons. 2) Create a profile P from the substrings at these star=ng posi=ons. 3) Find the most probable k- mer a in each sequence and change the star=ng posi=on to the star=ng posi=on of a. 4) Compute a new profile based on the new star=ng posi=ons a[er each itera=on and proceed un=l we cannot increase the score anymore.
31 Greedy Algorithm The pseudocode: 1. GreedyProfileMo=fSearch (DNA, t, n, l ) 2. Randomly select star=ng posi=ons s=(s 1,,s t ) from DNA 3. bestscore ß 0 4. while Score(s, DNA) > bestscore 5. Form profile P from s 6. bestscore ß Score(s, DNA) 7. for i ß 1 to t 8 Find the most probable k- mer a from the i th sequence 9 s i ß star=ng posi=on of a 10 return bestscore
32 Greedy Algorithm We have described it here with a probability matrix, but a more correct descrip=on would be a weight matrix model (log- likelihood ra=os). Note that the model could be other e.g., a Markov model of any order. Since we choose star=ng posi=ons randomly, there is li8le chance that our guess will be close to an op=mal mo=f, meaning it will take a long =me to find the op=mal mo=f. In prac=ce, this algorithm is run many =mes with the hope that random star=ng posi=ons will be close to the op=mum solu=on simply by chance.
33 The Gibbs sampling algorithm
34 Gibbs Sampling Gibbs Sampling is an itera=ve procedure, similar to the Greedy algorithm, but it discards one k- mer a[er each itera=on and replaces it with a new one. Gibbs Sampling proceeds more slowly than the Greedy algorithm. However, it chooses one new k- mer at each itera=on at random (using the score distribu=on), which increases the chance of finding the correct solu=on.
35 Gibbs Sampling
36 Can we make an educated guess? Gibbs Sampling
37 Gibbs Sampling 2. M temp
38 Gibbs Sampling 3. Build a weight matrix with the other sequences M temp
39 Gibbs Sampling M temp
40 Gibbs Sampling M temp
41 Gibbs Sampling M temp
42 Gibbs Sampling (no change in sites or M temp ) M temp
43 Gibbs Sampling Given N sequences of length L and desired mo#f width W: Step 1) Choose a star=ng posi=on in each sequence at random: a 1 in seq 1, a 2 in seq 2,, a N in seq N Step 2) Choose a sequence at random from the set (say, seq 1). Step 3) Step 4) Step 5) Step 6) Make a weight matrix model of width W from the sites in all sequences except the one chosen in step 2. Assign a probability to each posi=on in seq 1 using the weight matrix model constructed in step 3: p = { p1, p2, p3,, pl- W+1 } Sample a star=ng posi=on in seq 1 based on this probability distribu=on and set a 1 to this new posi=on. Update weight matrix model Step 7) Repeat from step 2) un=l convergence Lawrence et al., Science 1993
44 Input: Gibbs Sampling: an Example t = 5 sequences, motif length l = 8 1. GTAAACAATATTTATAGC 2. AAAATTTACCTCGCAAGG 3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA 5. TACTTAACACCCTGTCAA
45 1) Randomly choose starting positions, s=(s 1,s 2,s 3,s 4,s 5 ) in the 5 sequences: Gibbs Sampling: an Example s 1 =7 GTAAACAATATTTATAGC s 2 =11 AAAATTTACCTTAGAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA
46 Gibbs Sampling: an Example 2) Choose one of the sequences at random: Sequence 2: AAAATTTACCTTAGAAGG s 1 =7 GTAAACAATATTTATAGC s 2 =11 AAAATTTACCTTAGAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA
47 Gibbs Sampling: an Example 3) Build a motif description model with the remaining ones: s 1 =7 GTAAACAATATTTATAGC s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA
48 Create model (e.g. profile P) from k-mers in remaining 4 sequences: 1 A A T A T T T A 3 T C A A G C G T 4 G T A A A C G A 5 T A C T T A A C A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4 C 0 1/4 1/ /4 0 1/4 T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4 G 1/ /4 0 3/4 0 Consensus String Gibbs Sampling: an Example T A A A T C G A
49 Gibbs Sampling: an Example 4) Calculate the prob(a P) for every possible 8-mer in the removed sequence: Strings Highlighted in Red prob(a P) AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0
50 Gibbs Sampling: an Example 5) Create a distribution of probabilities of k-mers prob(a P), and randomly select a new starting position based on this distribution. Define probabilities of starting positions according to ratios Probability (Selecting Starting Position 1): P1/(P1 + P2 + P8) = Probability (Selecting Starting Position 2): P2/(P1 + P2 + P8) = Probability (Selecting Starting Position 8): P8/(P1 + P2 + P8) = 0.176
51 Gibbs Sampling: an Example Select the start position (in seq 2) according to computed ratios: P(selecting starting position 1):.706 P(selecting starting position 2):.118 P(selecting starting position 8):.176
52 Gibbs Sampling: an Example We select the k-mers with the highest probability then we are left with the following new k-mers and starting positions. s 1 =7 GTAAACAATATTTATAGC s 2 =1 AAAATTTACCTCGCAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA
53 Gibbs Sampling: an Example 6) Update the matrix: select one sequence and random and build the matrix with the rest, and repeat calculation. 7) We iterate until we cannot improve the score any more. Stop condition: overlap score does not change, relative entropy of the model (matrix), etc.
54 Gibbs Sampling: summary Gibbs sampling needs to be modified when applied to samples with unequal distributions of residues Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs. Not guaranteed to converge to same motif every time Needs to be run with many randomly chosen seeds to achieve good results.
55 Gibbs Sampling: summary Randomized algorithms make random rather than deterministic decisions. The main advantage is that no input can reliably produce worst-case results because the algorithm runs differently each time. These algorithms are commonly used in situations where no exact and fast algorithm is known Works for protein, DNA and RNA motifs
56 The Expecta=on Maximiza=on (EM) algorithm
57 EM Algorithm What we must specify: W: Motif length M: Motif description (probability matrix) B: Background (probability vector) Z: where the motif appears in the sequences
58 Motif Representation A motif is assumed to have a fixed width, W A motif is represented by a matrix of probabilities: M ck represents the probability of character c in column k This matrix M estimates the probabilities of the nucleotides in positions likely to contain the motif Example: DNA motif with W= A M = C G T
59 We also represent the background (i.e. sequence outside the motif) probability of each character B c represents the probability of character c in the background This vector B estimates the probabilities of the nucleotides in positions outside the likely motif Example: A 0.26 B = C 0.24 G 0.23 T 0.27 Background Representation
60 Motif Position representation Z a is a location indicator for each sequence S a Z aj = probability that the motif starts at position j in sequence S a Z aj = P(w aj =1 M,B,S a ) The elements Z aj of the matrix Z are updated to represent the probability that the motif starts in position j in sequence a Example: given 4 DNA sequences of length 6, where W= seq Z aj = seq seq seq Normalized per sequence
61 EM Algorithm given: length parameter W, training set of sequences set initial values for M and B do M and B are iteratively updated estimate Z from M and B estimate M and B from Z until change in p < threshold (E step) (M-step) return: B, M, Z
62 EM Algorithm Simplest model: We will assume that each sequence S a contains a single instance of the motif and that the starting position of the motif is chosen uniformly at random from among the positions of S a The positions of motifs in different sequences are independent
63 EM Algorithm we need to calculate the probability P of a training sequence S a of length L, given a hypothesized starting position j: P(S a w aj =1,M,B) = j 1 B ci M ci,i j +1 i=1 j +W 1 i= j L i= j +W B ci S a is one of the training sequences before motif motif after motif w aj is 1 if motif starts at position j in sequence a c i is the character at position i in the sequence S a
64 EM Algorithm we can also use a score using log-likelihood that the motif in sequence S a (of length L) starts at position j: $ & L(S a w aj =1,M,B) = log & & & % j +W 1 M ci,i j +1 i= j j +W 1 B ci i= j S a is one of the training sequences w aj is 1 if motif starts at position j in sequence a ' ) ) ) ) ( j +W 1 $ = log M ' c i,i j +1 & % B ) i= j ci ( c i is the character at position i in the sequence S a
65 Initialization W=3 We calculate M and B from the randomly selected start positions: A 0.25 A B = C 0.25 M = C G 0.25 G T 0.25 T Remember: S a = G C T G T A G Probability that the motif starts at position 3 in sequence S a: P(S a w a3 =1,M,B) = B G B C M T,1 M G,2 M T,3 B A B G =
66 Initialization W=3 We calculate M and B from the randomly selected start positions: A 0.25 A B = C 0.25 M = C G 0.25 G T 0.25 T Remember: S a = G C T G T A G The log-likelihood that the motif starts at position 3 in sequence S a: " L(S a w a 3 =1, M,B) = log M % " T,1 $ ' + log M % " G,2 $ ' + log M % T,3 $ ' # & # & # & = log log B T + log B G B T
67 Estimating Z (E-step): Z aj = P(w aj =1 M, B, S a ) Probability that the motif starts at position j in sequence S a
68 Estimating Z (E-step): This follows from Bayes theorem Z aj = P(w aj =1 M,B,S a ) = P(S a,w aj =1 M,B) P(S a M,B) = = P(S a w aj =1,M,B)P(w aj =1) L W +1 k =1 P(S a w ak =1,M,B)P(w ak =1) P(S a w aj =1,M,B) L W +1 k =1 P(S a w ak =1,M,B) We assume that a priori it is equally likely that the motif starts at any position of the sequence Probability that the motif starts at position j in sequence S a These are the probabilities that we were calculating before Z values must be normalized over the possible start positions
69 Estimating Z (E-step): The E-step computes the expected value of every Z aj using the matrix M and vector B: A 0.25 A B = C 0.25 M = C G 0.25 G T 0.25 T S a = G C T G T A G G C T G T A G G C T G T A G G C T G T A G G C T G T A G G C T G T A G Z a1 = Z a2 = Z a3 = Z a4 = Z a5 = (before normalizing)
70 Estimating Z (E-step): The E-step computes the expected value of every Z aj using the matrix M and vector B: A 0.25 A B = C 0.25 M = C G 0.25 G T 0.25 T S a = G C T G T A G G C T G T A G G C T G T A G G C T G T A G G C T G T A G G C T G T A G We can work with (log-likelihood ratio) scores: " Z a1 = log$ 0.3 % " ' + log$ 0.2 % " ' + log 0.1 % $ ' # 0.25& # 0.25& # 0.25& " Z a2 = log$ 0.4 % " ' + log$ 0.2 % " ' + log 0.6 % $ ' # 0.25& # 0.25& # 0.25& " Z a3 = log$ 0.2 % " ' + log$ 0.1 % " ' + log 0.1 % $ ' # 0.25& # 0.25& # 0.25& " Z a4 = log$ 0.3 % " ' + log$ 0.2 % " ' + log 0.2 % $ ' # 0.25& # 0.25& # 0.25& " Z a5 = log$ 0.2 % " ' + log$ 0.5 % " ' + log 0.6 % $ ' # 0.25& # 0.25& # 0.25&
71 Estimating Z (E-step): We normalize, dividing each value by the total sum of them: Z aj 1 Z a Z a 5 Z aj For 3 sequences, using probabilities: S 1 = A C A G C A" so that L W +1 j=1 Z aj =1 Z aj represents now the normalized score/probability associated to the motif starts at j in sequence S a Z 1,1 = 0.1, Z 1,2 = 0.7, Z 1,3 = 0.1, Z 1,4 = 0.1, S 2 = A G G C A G" S 3 = T C A G T C" Z 2,1 = 0.4, Z 2,2 = 0.1, Z 2,3 = 0.1, Z 2,4 = 0.4 Z 3,1 = 0.2, Z 3,2 = 0.6, Z 3,3 = 0.1, Z 3,4 = 0.1 Because the positions of motifs in different sequences are independent, the expectation of Z a depends only on the sequence S a
72 Estimating M and B (M-step) We want to estimate M from Z, where M ck is the probability that character c occurs at the k th position of the motif: n ck = a j/s a ( j+k 1)=c Z aj We sum every Z aj such that the character in the sequence S a at position k th of the motif is c Is the expected number of occurrences of the character c at the k th position of the motif We obtain a probability by normalizing (on the characters) these values and using pseudocounts: M ck = n ck +1 b {A,C,G,T } (n bk +1) pseudocounts M ck =1 c {A,C,G,T }
73 Estimating M and B (M-step) S 1 = A C A G C A" S 2 = A G G C A G" S 3 = T C A G T C" Z aj = S S S We estimate the probabilities in M and B, from the occurrences in Z: E.g. estimate probability of finding an A in the first position of the motif:
74 Estimating M and B (M-step) S 1 = A C A G C A" S 2 = A G G C A G" S 3 = T C A G T C" Z 11 Z 13 Z 21 Z33 We estimate the probabilities in M and B, from the occurrences in Z: E.g. estimate probability of finding an A in the first position of the motif:
75 Estimating M and B (M-step) S 1 = A C A G C A" S 2 = A G G C A G" S 3 = T C A G T C" Z 11 Z 13 Z 21 Z33 We estimate the probabilities in M and B, from the occurrences in Z: E.g. estimate probability of finding an A in the first position of the motif: M A,1 = Z 1,1 + Z 1,3 + Z 2,1 + Z 3,3 +1 Z 1,1 + Z 1, Z 3,3 + Z 3,4 + 4 Is proportional to the (sum of) probabilities of the motif start positions in the sequences at positions with an A
76 Estimating M and B (M-step) S 1 = A C A G C A" S 2 = A G G C A G" S 3 = T C A G T C" Z 11 Z 13 Z 21 Z33 We estimate the probabilities in M and B, from the occurrences in Z: E.g. estimate probability of finding an A in the first position of the motif: Sum the probabilities of A at position 1 of motif in all sequences: n A,1 = Z 1,1 + Z 1,3 + Z 2,1 + Z 3,3 n M A,1 = A,1 +1 n A,1 + n C,1 + n G,1 + n T,1 + 4 pseudocounts Divide by the sum of all the probabilities pseudocounts
77 Estimating M and B (M-step) To calculate B from Z, we estimate the probability that each chracter occurs outside the motif Total number of c s in the data set Expected number of character c outside the motifs n c,bg = n c W n ck k=1 Recall that n ck = Z aj a j/s a ( j+k 1)=c Total number of c s in the selected candidate motifs We obtain a probability by normalizing (on the characters) these values and using pseudocounts: B c = n c,bg +1 b {A,C,G,T } (n b,bg +1) pseudocounts B c =1 c {A,C,G,T }
78 EM Algorithm (summarized) 0. Initialisation: Select randomly a motif of W positions in each sequence Align the motifs to calculate for the first time the matrix M and the vector B. 1. E step: Use M and B to score each possible candidate segment (candidate_score). For each sequence, normalize the candidate_scores with respect to all candidates of the same sequence store the candidate segments and its scores 2. M step: Update the matrices M and B using the best scores: M: for each nucleotide, update its corresponding position in the matrix according to its occurrence in the candidates using the associated scores. B: for each nucleotide outside a candidate, update the vector B Normalize M and B
79 EM Algorithm Intuitively, the new motif model (M,B) derives its character frequencies directly from the sequence data by counting the (expected) number of occurrences of each character in each position of the motif. Because the location of the motif is uncertain, the derivation uses a weighted sum of frequencies given all possible locations, with more favorable locations getting more weight. If a strong motif is present, repeated iterations of EM will hopefully adjust the weights to favor of the true motif start sites.
80 EM Algorithm EM converges to a local maximum in the likelihood of the data given the model: a P(S a B,M ) Usually converges in a small number of iterations Sensitive to initial starting point (i.e. values in p)
81 MEME Enhancements to the Basic EM Approach MEME builds on the basic EM approach in the following ways: trying many starting points not assuming that there is exactly one motif occurrence in every sequence allowing multiple motifs to be learned incorporating Dirichlet prior distributions (instead of pseudocounts)
82 Starting Points in MEME For every distinct subsequence of length W in the training set derive an initial p matrix from this subsequence run EM for 1 iteration choose motif model (i.e. p matrix) with highest likelihood run EM to convergence
83 The ZOOPS Model The approach outlined before, assumes that each sequence has exactly one motif occurrence per sequence; this is the OOPS model The ZOOPS model assumes zero or one occurrences per sequence
84 The TCM Model The TCM (two-component mixture model) assumes zero or more motif occurrences per sequence
85 References Biological Sequence Analysis: Probabilis#c Models of Proteins and Nucleic Acids Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Cambridge University Press, 1999 Problems and Solu#ons in Biological Sequence Analysis Mark Borodovsky, Svetlana Ekisheva Cambridge University Press, 2006 An Introduc#on to Bioinforma#cs Algorithms (Computa=onal Molecular Biology) by Neil C. Jones, Pavel A. Pevzner. MIT Press, 2004
Finding Regulatory Motifs in DNA Sequences
Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median
More informationLecture 24: Randomized Algorithms
Lecture 24: Randomized Algorithms Chapter 12 11/26/2013 Comp 465 Fall 2013 1 Randomized Algorithms Randomized algorithms incorporate random, rather than deterministic, decisions Commonly used in situations
More informationLecture 23: Randomized Algorithms
Lecture 23: Randomized Algorithms Chapter 12 11/20/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Randomized Algorithms Randomized algorithms incorporate random, rather than deterministic, decisions Commonly
More informationVL Bioinformatik für Nebenfächler SS2018 Woche 8
VL Bioinformatik für Nebenfächler SS2018 Woche 8 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of Teil 1 (Woche
More informationFinding Regulatory Motifs in DNA Sequences
Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median
More informationPartial restriction digest
This lecture Exhaustive search Torgeir R. Hvidsten Restriction enzymes and the partial digest problem Finding regulatory motifs in DNA Sequences Exhaustive search methods T.R. Hvidsten: 1MB304: Discrete
More informationAlgorithms for Bioinformatics
These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 2: Exhaustive search and motif finding 6.9.202 Outline Implanted motifs - an introduction
More informationAlgorithms for Bioinformatics
These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics
More informationHidden Markov Models for biological sequence analysis I
Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands
More informationLearning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling
Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence
More informationLecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008
Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically
More informationHidden Markov Models for biological sequence analysis
Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA
More informationAlgorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding
Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12:00 4001 Motif finding This exposition was developed by Knut Reinert and Clemens Gröpl. It is based on the following
More informationTranscrip:on factor binding mo:fs
Transcrip:on factor binding mo:fs BMMB- 597D Lecture 29 Shaun Mahony Transcrip.on factor binding sites Short: Typically between 6 20bp long Degenerate: TFs have favorite binding sequences but don t require
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationNeyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?
Neyman-Pearson More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery Given a sample x 1, x 2,..., x n, from a distribution f(... #) with parameter #, want to test
More informationMatrix-based pattern discovery algorithms
Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationQuantitative Bioinformatics
Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize
More informationFinding Regulatory Motifs in DNA Sequences
Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median
More informationAlgorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,
Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25, 200707 Motif Finding This exposition is based on the following sources, which are all recommended reading:.
More informationHMM : Viterbi algorithm - a toy example
MM : Viterbi algorithm - a toy example 0.6 et's consider the following simple MM. This model is composed of 2 states, (high GC content) and (low GC content). We can for example consider that state characterizes
More informationGibbs Sampling Methods for Multiple Sequence Alignment
Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical
More informationCAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan
CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationHidden Markov Models
Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly
More informationVL Bioinformatik für Nebenfächler SS2018 Woche 9
VL Bioinformatik für Nebenfächler SS2018 Woche 9 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of Teil 1 (Woche
More informationGene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji
Gene Regula*on, ChIP- X and DNA Mo*fs Statistics in Genomics Hongkai Ji (hji@jhsph.edu) Genetic information is stored in DNA TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTC
More informationDe novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes
De novo identification of motifs in one species Modified from Serafim Batzoglou s lecture notes Finding Regulatory Motifs... Given a collection of genes that may be regulated by the same transcription
More informationEM-algorithm for motif discovery
EM-algorithm for motif discovery Xiaohui Xie University of California, Irvine EM-algorithm for motif discovery p.1/19 Position weight matrix Position weight matrix representation of a motif with width
More informationIntroduction to Bioinformatics
CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationChapter 7: Regulatory Networks
Chapter 7: Regulatory Networks 7.2 Analyzing Regulation Prof. Yechiam Yemini (YY) Computer Science Department Columbia University The Challenge How do we discover regulatory mechanisms? Complexity: hundreds
More informationStephen Scott.
1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative
More informationAlgorithms for Bioinformatics
These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php Algorithms for Bioinformatics
More informationProbabilistic models of biological sequence motifs
Probabilistic models of biological sequence motifs Description of Known Motifs AGB - Master in Bioinformatics UPF 2016-2017 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona,
More informationToday s Lecture: HMMs
Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models
More informationMCMC: Markov Chain Monte Carlo
I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov
More informationA Combined Motif Discovery Method
University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations Dissertations and Theses 8-6-2009 A Combined Motif Discovery Method Daming Lu University of New Orleans Follow
More informationCSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery
CSE 527 Autumn 2006 Lectures 8-9 (& part of 10) Motifs: Representation & Discovery 1 DNA Binding Proteins A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%,
More informationDNA Binding Proteins CSE 527 Autumn 2007
DNA Binding Proteins CSE 527 Autumn 2007 A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationOn the Monotonicity of the String Correction Factor for Words with Mismatches
On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.
More informationSearching Sear ( Sub- (Sub )Strings Ulf Leser
Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased
More informationAlignment. Peak Detection
ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie
More informationCSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:
Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2
More informationBLAST: Target frequencies and information content Dannie Durand
Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences
More informationCS 6140: Machine Learning Spring 2017
CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis@cs Assignment
More informationHidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes
Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -
More informationData Mining in Bioinformatics HMM
Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics
More informationCSCI1950 Z Computa3onal Methods for Biology Lecture 24. Ben Raphael April 29, hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs
CSCI1950 Z Computa3onal Methods for Biology Lecture 24 Ben Raphael April 29, 2009 hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs Subnetworks with more occurrences than expected by chance. How to
More informationIntroduction to Hidden Markov Models for Gene Prediction ECE-S690
Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence
More informationLecture: Mixture Models for Microbiome data
Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance
More informationCSCE 471/871 Lecture 3: Markov Chains and
and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State
More informationMultiple Sequence Alignment using Profile HMM
Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,
More informationSI Materials and Methods
SI Materials and Methods Gibbs Sampling with Informative Priors. Full description of the PhyloGibbs algorithm, including comprehensive tests on synthetic and yeast data sets, can be found in Siddharthan
More informationHMMs and biological sequence analysis
HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the
More information6.096 Algorithms for Computational Biology. Prof. Manolis Kellis
6.096 Algorithms for Computational Biology Prof. Manolis Kellis Today s Goals Introduction Class introduction Challenges in Computational Biology Gene Regulation: Regulatory Motif Discovery Exhaustive
More informationMEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY
Command line Training Set First Motif Summary of Motifs Termination Explanation MEME - Motif discovery tool MEME version 3.0 (Release date: 2002/04/02 00:11:59) For further information on how to interpret
More information10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison
10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:
More informationBio nformatics. Lecture 3. Saad Mneimneh
Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per
More informationMotifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.
Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer
More informationSTRUCTURAL BIOINFORMATICS I. Fall 2015
STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;
More informationLecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data
Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2 Hypothesis Tests - reminder
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationHidden Markov Models. x 1 x 2 x 3 x K
Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)
More informationHMM : Viterbi algorithm - a toy example
MM : Viterbi algorithm - a toy example 0.5 0.5 0.5 A 0.2 A 0.3 0.5 0.6 C 0.3 C 0.2 G 0.3 0.4 G 0.2 T 0.2 T 0.3 et's consider the following simple MM. This model is composed of 2 states, (high GC content)
More informationHidden Markov Models in computational biology. Ron Elber Computer Science Cornell
Hidden Markov Models in computational biology Ron Elber Computer Science Cornell 1 Or: how to fish homolog sequences from a database Many sequences in database RPOBESEQ Partitioned data base 2 An accessible
More informationComputational Biology: Basics & Interesting Problems
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationLearning in Bayesian Networks
Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks
More informationInferring Protein-Signaling Networks
Inferring Protein-Signaling Networks Lectures 14 Nov 14, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More information13 Comparative RNA analysis
13 Comparative RNA analysis Sources for this lecture: R. Durbin, S. Eddy, A. Krogh und G. Mitchison, Biological sequence analysis, Cambridge, 1998 D.W. Mount. Bioinformatics: Sequences and Genome analysis,
More informationComputational Genomics. Systems biology. Putting it together: Data integration using graphical models
02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationPosition-specific scoring matrices (PSSM)
Regulatory Sequence nalysis Position-specific scoring matrices (PSSM) Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d ix-marseille, France Technological dvances for Genomics and Clinics
More informationProtein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.
Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein
More informationModeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions)
Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions) Computational Genomics Course Cold Spring Harbor Labs Oct 31, 2016 Gary D. Stormo Department of Genetics
More informationHidden Markov Models
Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm
More informationThe value of prior knowledge in discovering motifs with MEME
To appear in: Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology July, 1995 AAAI Press The value of prior knowledge in discovering motifs with MEME Timothy L.
More informationWhat is the expectation maximization algorithm?
primer 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology What is the expectation maximization algorithm? Chuong B Do & Serafim Batzoglou The expectation maximization algorithm arises
More informationLearning Sequence Motif Models Using Gibbs Sampling
Learning Sequence Motif Models Using Gibbs Samling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Sring 2018 Anthony Gitter gitter@biostat.wisc.edu These slides excluding third-arty material are licensed under
More informationPredicting Protein Functions and Domain Interactions from Protein Interactions
Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput
More informationDNA Feature Sensors. B. Majoros
DNA Feature Sensors B. Majoros What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals short, fixed-length
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationStephen Scott.
1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More information6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II
6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution Lecture 05 Hidden Markov Models Part II 1 2 Module 1: Aligning and modeling genomes Module 1: Computational foundations Dynamic programming:
More informationHidden Markov Models. Three classic HMM problems
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems
More informationAssignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:
Assignments for lecture Bioinformatics III WS 03/04 Assignment 5, return until Dec 16, 2003, 11 am Your name: Matrikelnummer: Fachrichtung: Please direct questions to: Jörg Niggemann, tel. 302-64167, email:
More informationInferring Transcriptional Regulatory Networks from Gene Expression Data II
Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday
More informationCS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS
CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * The contents are adapted from Dr. Jean Gao at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Primer on Probability Random
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationMarkov Models & DNA Sequence Evolution
7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under
More informationLecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More information