Probabilistic models of biological sequence motifs

Size: px
Start display at page:

Download "Probabilistic models of biological sequence motifs"

Transcription

1 Probabilistic models of biological sequence motifs Discovery of new motifs Master in Bioinformatics UPF Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain

2 what is a sequence motif? a sequence (or set of sequences) of biological significance (usually represented together as a pattern/matrix/etc) Examples: Sequence Motifs protein binding sites in DNA protein sequences corresponding to common functions conserved pieces of structure

3 Sequence Motifs Pattern Matching: search known motifs in a sequence (e.g. using a PWM) Pattern Discovery: search new motifs that are common to many sequences (Expectation Maximization algorithm)

4 Biological Relevance of Motif finding: Find a pattern or similar motif in a set of sequences that play a similar biological role can serve as evidence that this motif has a specific function E.g.: promoter sequences of co-expressed genes in Yeast:

5 Motifs and Weight Matrices Given a set of aligned sequences, we already know that we can construct a weight matrix to characterize the motif (see lecture on PWMs)

6 Motifs and Weight Matrices how can we construct a weight matrix if the sequences are not aligned? Moreover, in general we do not know what the motif looks like

7 But, what is the problem? Let s consider the reverse situation Consider 10 DNA sequences atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

8 We implant the motif AAAAAAAGGGGGGG atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga"

9 Where is the implanted motif? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga"

10 We implant the motif AAAAAAAGGGGGGG with four mutations atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga"

11 Where is the implanted motif now? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

12 Why is motif finding difficult? Motifs can mutate on non important bases, but these are a priori unknown atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga AgAAgAAAGGttGGG caataaaacggcggg

13 The Motif Finding Problem Mo#f Finding Problem: Given a list of t sequences each of length n, find the best pa8ern of length l that appears in each of the t sequences.

14 The Motif Finding Problem We use HEURÍSTICS: rules derived from our own experience that provide quick approximate solutions to complex problems.

15 The motif finding (pattern discovery) problem INPUT: N DNA sequences biologically related (e.g. promoters of co-expressed genes) OUTPUT: 1. The motif: Representation (e.g. Weight matrix) of a motif common to all (or some of the) sequences. This can be any type of model 2. The positions: The best occurrence of the motif in each sequence

16 Motif finding (Iterative) Algorithms while (convergence condition FALSE){ # calculate model # modify model # evaluate new model # (covergence?) # go to next movement } If after many iterations, the quality of the solution does not improve, we say that given the INPUT, the program has converged to a solution (supposedly the best)

17 Before mo=f finding How do we obtain a set of sequences on which to run mo=f finding? i.e. how do we get genes that we believe are regulated by the same transcrip=on factor? E.g. Expression data: Microarrays, RNA- Seq, Measures ac=vity of thousands of genes in parallel under one or more condi=ons Collect set of genes with similar expression (ac=vity) profiles and do mo=f finding on these. Protein binding data: ChIP- on- chip, ChIP- Seq, RNA IP (RIP), CLIP, etc Measure whether a par=cular factor binds to each of the sequences Provides hundreds or thousands of DNA/RNA target sequences Collect the set to which protein binds, do mo=f finding on these

18 Greedy Algorithm for Motif Search

19 Greedy Algorithm find a way to greedily change possible motif locations until we have converged to a highly likely hidden motif.

20 Consider t input sequences Greedy Algorithm Let s=(s 1,...,s t ) be the set of star=ng posi=ons for k- mers in our t sequences. select arbitrary posi=ons in the sequences This defines t substrings corresponding to these star=ng posi=ons and will form: - t x k alignment matrix and - 4 x k profile matrix P. The profile matrix P will be defined in terms of the frequency of le8ers, and not as the count of le8ers.

21 Greedy Algorithm Profile matrix from random 6-mers in the t input sequences: Positions in motif A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8

22 Greedy Algorithm Define the most probable k- mer from each sequence using the built the profile P. e.g. given a sequence = ctataaacc8acatc, find the most probable k- mer according to P: A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8

23 Greedy Algorithm Compute prob(a P) for every possible 6-mer: String, Highlighted in Red Calculations prob(a P) ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0 ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0 ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0 ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0 ctataaaccttacat 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/ ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/ ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0 ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0 ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0 ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8.0004

24 Greedy Algorithm Most Probable 6-mer in the sequence is aaacct: String, Highlighted in Red Calculations Prob(a P) ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0 ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0 ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0 ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0 ctataaaccttacat" 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/ ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/ ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0 ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0 ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0 ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8.0004

25 Greedy Algorithm P= Find the P-most probable k-mer in each of the sequences. A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct

26 Greedy Algorithm ctataaacgttacatc atagcgattcgactg cagcccagaaccct A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct

27 Greedy Algorithm 1 a a a c g t 2 a t a g c g 3 a a c c c t 4 g a a c c t 5 a t a g c t 6 g a c c t g 7 a t c c t t 8 t a c c t t A 5/8 5/8 4/ C 0 0 4/8 6/8 4/8 0 T 1/8 3/ /8 6/8 G 2/ /8 1/8 2/8 ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct The t most Probable k-mers define a new profile

28 Greedy Algorithm 1 a a a c g t 2 a t a g c g 3 a a c c c t 4 g a a c c t 5 a t a g c t 6 g a c c t g 7 a t c c t t 8 t a c c t t A 5/8 5/8 4/ C 0 0 4/8 6/8 4/8 0 T 1/8 3/ /8 6/8 G 2/ /8 1/8 2/8 ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct The t most Probable k-mers define a new profile

29 Greedy Algorithm 1 a a a c g t 2 a t a g c g 3 a a c c c t 4 g a a c c t 5 a t a g c t 6 g a c c t g 7 a t c c t t 8 t a c c t t A 5/8 5/8 4/ C 0 0 4/8 6/8 4/8 0 T 1/8 3/ /8 6/8 G 2/ /8 1/8 2/8 Previous Matrix A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/ /4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 Red frequency increased, Blue frequency descreased

30 Greedy Algorithm The Steps 1) Select random star=ng posi=ons. 2) Create a profile P from the substrings at these star=ng posi=ons. 3) Find the most probable k- mer a in each sequence and change the star=ng posi=on to the star=ng posi=on of a. 4) Compute a new profile based on the new star=ng posi=ons a[er each itera=on and proceed un=l we cannot increase the score anymore.

31 Greedy Algorithm The pseudocode: 1. GreedyProfileMo=fSearch (DNA, t, n, l ) 2. Randomly select star=ng posi=ons s=(s 1,,s t ) from DNA 3. bestscore ß 0 4. while Score(s, DNA) > bestscore 5. Form profile P from s 6. bestscore ß Score(s, DNA) 7. for i ß 1 to t 8 Find the most probable k- mer a from the i th sequence 9 s i ß star=ng posi=on of a 10 return bestscore

32 Greedy Algorithm We have described it here with a probability matrix, but a more correct descrip=on would be a weight matrix model (log- likelihood ra=os). Note that the model could be other e.g., a Markov model of any order. Since we choose star=ng posi=ons randomly, there is li8le chance that our guess will be close to an op=mal mo=f, meaning it will take a long =me to find the op=mal mo=f. In prac=ce, this algorithm is run many =mes with the hope that random star=ng posi=ons will be close to the op=mum solu=on simply by chance.

33 The Gibbs sampling algorithm

34 Gibbs Sampling Gibbs Sampling is an itera=ve procedure, similar to the Greedy algorithm, but it discards one k- mer a[er each itera=on and replaces it with a new one. Gibbs Sampling proceeds more slowly than the Greedy algorithm. However, it chooses one new k- mer at each itera=on at random (using the score distribu=on), which increases the chance of finding the correct solu=on.

35 Gibbs Sampling

36 Can we make an educated guess? Gibbs Sampling

37 Gibbs Sampling 2. M temp

38 Gibbs Sampling 3. Build a weight matrix with the other sequences M temp

39 Gibbs Sampling M temp

40 Gibbs Sampling M temp

41 Gibbs Sampling M temp

42 Gibbs Sampling (no change in sites or M temp ) M temp

43 Gibbs Sampling Given N sequences of length L and desired mo#f width W: Step 1) Choose a star=ng posi=on in each sequence at random: a 1 in seq 1, a 2 in seq 2,, a N in seq N Step 2) Choose a sequence at random from the set (say, seq 1). Step 3) Step 4) Step 5) Step 6) Make a weight matrix model of width W from the sites in all sequences except the one chosen in step 2. Assign a probability to each posi=on in seq 1 using the weight matrix model constructed in step 3: p = { p1, p2, p3,, pl- W+1 } Sample a star=ng posi=on in seq 1 based on this probability distribu=on and set a 1 to this new posi=on. Update weight matrix model Step 7) Repeat from step 2) un=l convergence Lawrence et al., Science 1993

44 Input: Gibbs Sampling: an Example t = 5 sequences, motif length l = 8 1. GTAAACAATATTTATAGC 2. AAAATTTACCTCGCAAGG 3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA 5. TACTTAACACCCTGTCAA

45 1) Randomly choose starting positions, s=(s 1,s 2,s 3,s 4,s 5 ) in the 5 sequences: Gibbs Sampling: an Example s 1 =7 GTAAACAATATTTATAGC s 2 =11 AAAATTTACCTTAGAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

46 Gibbs Sampling: an Example 2) Choose one of the sequences at random: Sequence 2: AAAATTTACCTTAGAAGG s 1 =7 GTAAACAATATTTATAGC s 2 =11 AAAATTTACCTTAGAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

47 Gibbs Sampling: an Example 3) Build a motif description model with the remaining ones: s 1 =7 GTAAACAATATTTATAGC s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

48 Create model (e.g. profile P) from k-mers in remaining 4 sequences: 1 A A T A T T T A 3 T C A A G C G T 4 G T A A A C G A 5 T A C T T A A C A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4 C 0 1/4 1/ /4 0 1/4 T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4 G 1/ /4 0 3/4 0 Consensus String Gibbs Sampling: an Example T A A A T C G A

49 Gibbs Sampling: an Example 4) Calculate the prob(a P) for every possible 8-mer in the removed sequence: Strings Highlighted in Red prob(a P) AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0

50 Gibbs Sampling: an Example 5) Create a distribution of probabilities of k-mers prob(a P), and randomly select a new starting position based on this distribution. Define probabilities of starting positions according to ratios Probability (Selecting Starting Position 1): P1/(P1 + P2 + P8) = Probability (Selecting Starting Position 2): P2/(P1 + P2 + P8) = Probability (Selecting Starting Position 8): P8/(P1 + P2 + P8) = 0.176

51 Gibbs Sampling: an Example Select the start position (in seq 2) according to computed ratios: P(selecting starting position 1):.706 P(selecting starting position 2):.118 P(selecting starting position 8):.176

52 Gibbs Sampling: an Example We select the k-mers with the highest probability then we are left with the following new k-mers and starting positions. s 1 =7 GTAAACAATATTTATAGC s 2 =1 AAAATTTACCTCGCAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

53 Gibbs Sampling: an Example 6) Update the matrix: select one sequence and random and build the matrix with the rest, and repeat calculation. 7) We iterate until we cannot improve the score any more. Stop condition: overlap score does not change, relative entropy of the model (matrix), etc.

54 Gibbs Sampling: summary Gibbs sampling needs to be modified when applied to samples with unequal distributions of residues Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs. Not guaranteed to converge to same motif every time Needs to be run with many randomly chosen seeds to achieve good results.

55 Gibbs Sampling: summary Randomized algorithms make random rather than deterministic decisions. The main advantage is that no input can reliably produce worst-case results because the algorithm runs differently each time. These algorithms are commonly used in situations where no exact and fast algorithm is known Works for protein, DNA and RNA motifs

56 The Expecta=on Maximiza=on (EM) algorithm

57 EM Algorithm What we must specify: W: Motif length M: Motif description (probability matrix) B: Background (probability vector) Z: where the motif appears in the sequences

58 Motif Representation A motif is assumed to have a fixed width, W A motif is represented by a matrix of probabilities: M ck represents the probability of character c in column k This matrix M estimates the probabilities of the nucleotides in positions likely to contain the motif Example: DNA motif with W= A M = C G T

59 We also represent the background (i.e. sequence outside the motif) probability of each character B c represents the probability of character c in the background This vector B estimates the probabilities of the nucleotides in positions outside the likely motif Example: A 0.26 B = C 0.24 G 0.23 T 0.27 Background Representation

60 Motif Position representation Z a is a location indicator for each sequence S a Z aj = probability that the motif starts at position j in sequence S a Z aj = P(w aj =1 M,B,S a ) The elements Z aj of the matrix Z are updated to represent the probability that the motif starts in position j in sequence a Example: given 4 DNA sequences of length 6, where W= seq Z aj = seq seq seq Normalized per sequence

61 EM Algorithm given: length parameter W, training set of sequences set initial values for M and B do M and B are iteratively updated estimate Z from M and B estimate M and B from Z until change in p < threshold (E step) (M-step) return: B, M, Z

62 EM Algorithm Simplest model: We will assume that each sequence S a contains a single instance of the motif and that the starting position of the motif is chosen uniformly at random from among the positions of S a The positions of motifs in different sequences are independent

63 EM Algorithm we need to calculate the probability P of a training sequence S a of length L, given a hypothesized starting position j: P(S a w aj =1,M,B) = j 1 B ci M ci,i j +1 i=1 j +W 1 i= j L i= j +W B ci S a is one of the training sequences before motif motif after motif w aj is 1 if motif starts at position j in sequence a c i is the character at position i in the sequence S a

64 EM Algorithm we can also use a score using log-likelihood that the motif in sequence S a (of length L) starts at position j: $ & L(S a w aj =1,M,B) = log & & & % j +W 1 M ci,i j +1 i= j j +W 1 B ci i= j S a is one of the training sequences w aj is 1 if motif starts at position j in sequence a ' ) ) ) ) ( j +W 1 $ = log M ' c i,i j +1 & % B ) i= j ci ( c i is the character at position i in the sequence S a

65 Initialization W=3 We calculate M and B from the randomly selected start positions: A 0.25 A B = C 0.25 M = C G 0.25 G T 0.25 T Remember: S a = G C T G T A G Probability that the motif starts at position 3 in sequence S a: P(S a w a3 =1,M,B) = B G B C M T,1 M G,2 M T,3 B A B G =

66 Initialization W=3 We calculate M and B from the randomly selected start positions: A 0.25 A B = C 0.25 M = C G 0.25 G T 0.25 T Remember: S a = G C T G T A G The log-likelihood that the motif starts at position 3 in sequence S a: " L(S a w a 3 =1, M,B) = log M % " T,1 $ ' + log M % " G,2 $ ' + log M % T,3 $ ' # & # & # & = log log B T + log B G B T

67 Estimating Z (E-step): Z aj = P(w aj =1 M, B, S a ) Probability that the motif starts at position j in sequence S a

68 Estimating Z (E-step): This follows from Bayes theorem Z aj = P(w aj =1 M,B,S a ) = P(S a,w aj =1 M,B) P(S a M,B) = = P(S a w aj =1,M,B)P(w aj =1) L W +1 k =1 P(S a w ak =1,M,B)P(w ak =1) P(S a w aj =1,M,B) L W +1 k =1 P(S a w ak =1,M,B) We assume that a priori it is equally likely that the motif starts at any position of the sequence Probability that the motif starts at position j in sequence S a These are the probabilities that we were calculating before Z values must be normalized over the possible start positions

69 Estimating Z (E-step): The E-step computes the expected value of every Z aj using the matrix M and vector B: A 0.25 A B = C 0.25 M = C G 0.25 G T 0.25 T S a = G C T G T A G G C T G T A G G C T G T A G G C T G T A G G C T G T A G G C T G T A G Z a1 = Z a2 = Z a3 = Z a4 = Z a5 = (before normalizing)

70 Estimating Z (E-step): The E-step computes the expected value of every Z aj using the matrix M and vector B: A 0.25 A B = C 0.25 M = C G 0.25 G T 0.25 T S a = G C T G T A G G C T G T A G G C T G T A G G C T G T A G G C T G T A G G C T G T A G We can work with (log-likelihood ratio) scores: " Z a1 = log$ 0.3 % " ' + log$ 0.2 % " ' + log 0.1 % $ ' # 0.25& # 0.25& # 0.25& " Z a2 = log$ 0.4 % " ' + log$ 0.2 % " ' + log 0.6 % $ ' # 0.25& # 0.25& # 0.25& " Z a3 = log$ 0.2 % " ' + log$ 0.1 % " ' + log 0.1 % $ ' # 0.25& # 0.25& # 0.25& " Z a4 = log$ 0.3 % " ' + log$ 0.2 % " ' + log 0.2 % $ ' # 0.25& # 0.25& # 0.25& " Z a5 = log$ 0.2 % " ' + log$ 0.5 % " ' + log 0.6 % $ ' # 0.25& # 0.25& # 0.25&

71 Estimating Z (E-step): We normalize, dividing each value by the total sum of them: Z aj 1 Z a Z a 5 Z aj For 3 sequences, using probabilities: S 1 = A C A G C A" so that L W +1 j=1 Z aj =1 Z aj represents now the normalized score/probability associated to the motif starts at j in sequence S a Z 1,1 = 0.1, Z 1,2 = 0.7, Z 1,3 = 0.1, Z 1,4 = 0.1, S 2 = A G G C A G" S 3 = T C A G T C" Z 2,1 = 0.4, Z 2,2 = 0.1, Z 2,3 = 0.1, Z 2,4 = 0.4 Z 3,1 = 0.2, Z 3,2 = 0.6, Z 3,3 = 0.1, Z 3,4 = 0.1 Because the positions of motifs in different sequences are independent, the expectation of Z a depends only on the sequence S a

72 Estimating M and B (M-step) We want to estimate M from Z, where M ck is the probability that character c occurs at the k th position of the motif: n ck = a j/s a ( j+k 1)=c Z aj We sum every Z aj such that the character in the sequence S a at position k th of the motif is c Is the expected number of occurrences of the character c at the k th position of the motif We obtain a probability by normalizing (on the characters) these values and using pseudocounts: M ck = n ck +1 b {A,C,G,T } (n bk +1) pseudocounts M ck =1 c {A,C,G,T }

73 Estimating M and B (M-step) S 1 = A C A G C A" S 2 = A G G C A G" S 3 = T C A G T C" Z aj = S S S We estimate the probabilities in M and B, from the occurrences in Z: E.g. estimate probability of finding an A in the first position of the motif:

74 Estimating M and B (M-step) S 1 = A C A G C A" S 2 = A G G C A G" S 3 = T C A G T C" Z 11 Z 13 Z 21 Z33 We estimate the probabilities in M and B, from the occurrences in Z: E.g. estimate probability of finding an A in the first position of the motif:

75 Estimating M and B (M-step) S 1 = A C A G C A" S 2 = A G G C A G" S 3 = T C A G T C" Z 11 Z 13 Z 21 Z33 We estimate the probabilities in M and B, from the occurrences in Z: E.g. estimate probability of finding an A in the first position of the motif: M A,1 = Z 1,1 + Z 1,3 + Z 2,1 + Z 3,3 +1 Z 1,1 + Z 1, Z 3,3 + Z 3,4 + 4 Is proportional to the (sum of) probabilities of the motif start positions in the sequences at positions with an A

76 Estimating M and B (M-step) S 1 = A C A G C A" S 2 = A G G C A G" S 3 = T C A G T C" Z 11 Z 13 Z 21 Z33 We estimate the probabilities in M and B, from the occurrences in Z: E.g. estimate probability of finding an A in the first position of the motif: Sum the probabilities of A at position 1 of motif in all sequences: n A,1 = Z 1,1 + Z 1,3 + Z 2,1 + Z 3,3 n M A,1 = A,1 +1 n A,1 + n C,1 + n G,1 + n T,1 + 4 pseudocounts Divide by the sum of all the probabilities pseudocounts

77 Estimating M and B (M-step) To calculate B from Z, we estimate the probability that each chracter occurs outside the motif Total number of c s in the data set Expected number of character c outside the motifs n c,bg = n c W n ck k=1 Recall that n ck = Z aj a j/s a ( j+k 1)=c Total number of c s in the selected candidate motifs We obtain a probability by normalizing (on the characters) these values and using pseudocounts: B c = n c,bg +1 b {A,C,G,T } (n b,bg +1) pseudocounts B c =1 c {A,C,G,T }

78 EM Algorithm (summarized) 0. Initialisation: Select randomly a motif of W positions in each sequence Align the motifs to calculate for the first time the matrix M and the vector B. 1. E step: Use M and B to score each possible candidate segment (candidate_score). For each sequence, normalize the candidate_scores with respect to all candidates of the same sequence store the candidate segments and its scores 2. M step: Update the matrices M and B using the best scores: M: for each nucleotide, update its corresponding position in the matrix according to its occurrence in the candidates using the associated scores. B: for each nucleotide outside a candidate, update the vector B Normalize M and B

79 EM Algorithm Intuitively, the new motif model (M,B) derives its character frequencies directly from the sequence data by counting the (expected) number of occurrences of each character in each position of the motif. Because the location of the motif is uncertain, the derivation uses a weighted sum of frequencies given all possible locations, with more favorable locations getting more weight. If a strong motif is present, repeated iterations of EM will hopefully adjust the weights to favor of the true motif start sites.

80 EM Algorithm EM converges to a local maximum in the likelihood of the data given the model: a P(S a B,M ) Usually converges in a small number of iterations Sensitive to initial starting point (i.e. values in p)

81 MEME Enhancements to the Basic EM Approach MEME builds on the basic EM approach in the following ways: trying many starting points not assuming that there is exactly one motif occurrence in every sequence allowing multiple motifs to be learned incorporating Dirichlet prior distributions (instead of pseudocounts)

82 Starting Points in MEME For every distinct subsequence of length W in the training set derive an initial p matrix from this subsequence run EM for 1 iteration choose motif model (i.e. p matrix) with highest likelihood run EM to convergence

83 The ZOOPS Model The approach outlined before, assumes that each sequence has exactly one motif occurrence per sequence; this is the OOPS model The ZOOPS model assumes zero or one occurrences per sequence

84 The TCM Model The TCM (two-component mixture model) assumes zero or more motif occurrences per sequence

85 References Biological Sequence Analysis: Probabilis#c Models of Proteins and Nucleic Acids Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Cambridge University Press, 1999 Problems and Solu#ons in Biological Sequence Analysis Mark Borodovsky, Svetlana Ekisheva Cambridge University Press, 2006 An Introduc#on to Bioinforma#cs Algorithms (Computa=onal Molecular Biology) by Neil C. Jones, Pavel A. Pevzner. MIT Press, 2004

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median

More information

Lecture 24: Randomized Algorithms

Lecture 24: Randomized Algorithms Lecture 24: Randomized Algorithms Chapter 12 11/26/2013 Comp 465 Fall 2013 1 Randomized Algorithms Randomized algorithms incorporate random, rather than deterministic, decisions Commonly used in situations

More information

Lecture 23: Randomized Algorithms

Lecture 23: Randomized Algorithms Lecture 23: Randomized Algorithms Chapter 12 11/20/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Randomized Algorithms Randomized algorithms incorporate random, rather than deterministic, decisions Commonly

More information

VL Bioinformatik für Nebenfächler SS2018 Woche 8

VL Bioinformatik für Nebenfächler SS2018 Woche 8 VL Bioinformatik für Nebenfächler SS2018 Woche 8 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of Teil 1 (Woche

More information

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median

More information

Partial restriction digest

Partial restriction digest This lecture Exhaustive search Torgeir R. Hvidsten Restriction enzymes and the partial digest problem Finding regulatory motifs in DNA Sequences Exhaustive search methods T.R. Hvidsten: 1MB304: Discrete

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 2: Exhaustive search and motif finding 6.9.202 Outline Implanted motifs - an introduction

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics

More information

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models for biological sequence analysis I Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Hidden Markov Models for biological sequence analysis

Hidden Markov Models for biological sequence analysis Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA

More information

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12:00 4001 Motif finding This exposition was developed by Knut Reinert and Clemens Gröpl. It is based on the following

More information

Transcrip:on factor binding mo:fs

Transcrip:on factor binding mo:fs Transcrip:on factor binding mo:fs BMMB- 597D Lecture 29 Shaun Mahony Transcrip.on factor binding sites Short: Typically between 6 20bp long Degenerate: TFs have favorite binding sequences but don t require

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM? Neyman-Pearson More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery Given a sample x 1, x 2,..., x n, from a distribution f(... #) with parameter #, want to test

More information

Matrix-based pattern discovery algorithms

Matrix-based pattern discovery algorithms Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Quantitative Bioinformatics

Quantitative Bioinformatics Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize

More information

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median

More information

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25, Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25, 200707 Motif Finding This exposition is based on the following sources, which are all recommended reading:.

More information

HMM : Viterbi algorithm - a toy example

HMM : Viterbi algorithm - a toy example MM : Viterbi algorithm - a toy example 0.6 et's consider the following simple MM. This model is composed of 2 states, (high GC content) and (low GC content). We can for example consider that state characterizes

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Hidden Markov Models

Hidden Markov Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly

More information

VL Bioinformatik für Nebenfächler SS2018 Woche 9

VL Bioinformatik für Nebenfächler SS2018 Woche 9 VL Bioinformatik für Nebenfächler SS2018 Woche 9 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of Teil 1 (Woche

More information

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji Gene Regula*on, ChIP- X and DNA Mo*fs Statistics in Genomics Hongkai Ji (hji@jhsph.edu) Genetic information is stored in DNA TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTC

More information

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes De novo identification of motifs in one species Modified from Serafim Batzoglou s lecture notes Finding Regulatory Motifs... Given a collection of genes that may be regulated by the same transcription

More information

EM-algorithm for motif discovery

EM-algorithm for motif discovery EM-algorithm for motif discovery Xiaohui Xie University of California, Irvine EM-algorithm for motif discovery p.1/19 Position weight matrix Position weight matrix representation of a motif with width

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Chapter 7: Regulatory Networks

Chapter 7: Regulatory Networks Chapter 7: Regulatory Networks 7.2 Analyzing Regulation Prof. Yechiam Yemini (YY) Computer Science Department Columbia University The Challenge How do we discover regulatory mechanisms? Complexity: hundreds

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php Algorithms for Bioinformatics

More information

Probabilistic models of biological sequence motifs

Probabilistic models of biological sequence motifs Probabilistic models of biological sequence motifs Description of Known Motifs AGB - Master in Bioinformatics UPF 2016-2017 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona,

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

A Combined Motif Discovery Method

A Combined Motif Discovery Method University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations Dissertations and Theses 8-6-2009 A Combined Motif Discovery Method Daming Lu University of New Orleans Follow

More information

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery CSE 527 Autumn 2006 Lectures 8-9 (& part of 10) Motifs: Representation & Discovery 1 DNA Binding Proteins A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%,

More information

DNA Binding Proteins CSE 527 Autumn 2007

DNA Binding Proteins CSE 527 Autumn 2007 DNA Binding Proteins CSE 527 Autumn 2007 A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased

More information

Alignment. Peak Detection

Alignment. Peak Detection ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

CS 6140: Machine Learning Spring 2017

CS 6140: Machine Learning Spring 2017 CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis@cs Assignment

More information

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

CSCI1950 Z Computa3onal Methods for Biology Lecture 24. Ben Raphael April 29, hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs

CSCI1950 Z Computa3onal Methods for Biology Lecture 24. Ben Raphael April 29, hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs CSCI1950 Z Computa3onal Methods for Biology Lecture 24 Ben Raphael April 29, 2009 hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs Subnetworks with more occurrences than expected by chance. How to

More information

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

More information

Lecture: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance

More information

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 471/871 Lecture 3: Markov Chains and and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State

More information

Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

More information

SI Materials and Methods

SI Materials and Methods SI Materials and Methods Gibbs Sampling with Informative Priors. Full description of the PhyloGibbs algorithm, including comprehensive tests on synthetic and yeast data sets, can be found in Siddharthan

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis 6.096 Algorithms for Computational Biology Prof. Manolis Kellis Today s Goals Introduction Class introduction Challenges in Computational Biology Gene Regulation: Regulatory Motif Discovery Exhaustive

More information

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY Command line Training Set First Motif Summary of Motifs Termination Explanation MEME - Motif discovery tool MEME version 3.0 (Release date: 2002/04/02 00:11:59) For further information on how to interpret

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1. Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer

More information

STRUCTURAL BIOINFORMATICS I. Fall 2015

STRUCTURAL BIOINFORMATICS I. Fall 2015 STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;

More information

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2 Hypothesis Tests - reminder

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

HMM : Viterbi algorithm - a toy example

HMM : Viterbi algorithm - a toy example MM : Viterbi algorithm - a toy example 0.5 0.5 0.5 A 0.2 A 0.3 0.5 0.6 C 0.3 C 0.2 G 0.3 0.4 G 0.2 T 0.2 T 0.3 et's consider the following simple MM. This model is composed of 2 states, (high GC content)

More information

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell Hidden Markov Models in computational biology Ron Elber Computer Science Cornell 1 Or: how to fish homolog sequences from a database Many sequences in database RPOBESEQ Partitioned data base 2 An accessible

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Inferring Protein-Signaling Networks

Inferring Protein-Signaling Networks Inferring Protein-Signaling Networks Lectures 14 Nov 14, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

13 Comparative RNA analysis

13 Comparative RNA analysis 13 Comparative RNA analysis Sources for this lecture: R. Durbin, S. Eddy, A. Krogh und G. Mitchison, Biological sequence analysis, Cambridge, 1998 D.W. Mount. Bioinformatics: Sequences and Genome analysis,

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Position-specific scoring matrices (PSSM)

Position-specific scoring matrices (PSSM) Regulatory Sequence nalysis Position-specific scoring matrices (PSSM) Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d ix-marseille, France Technological dvances for Genomics and Clinics

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions)

Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions) Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions) Computational Genomics Course Cold Spring Harbor Labs Oct 31, 2016 Gary D. Stormo Department of Genetics

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

The value of prior knowledge in discovering motifs with MEME

The value of prior knowledge in discovering motifs with MEME To appear in: Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology July, 1995 AAAI Press The value of prior knowledge in discovering motifs with MEME Timothy L.

More information

What is the expectation maximization algorithm?

What is the expectation maximization algorithm? primer 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology What is the expectation maximization algorithm? Chuong B Do & Serafim Batzoglou The expectation maximization algorithm arises

More information

Learning Sequence Motif Models Using Gibbs Sampling

Learning Sequence Motif Models Using Gibbs Sampling Learning Sequence Motif Models Using Gibbs Samling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Sring 2018 Anthony Gitter gitter@biostat.wisc.edu These slides excluding third-arty material are licensed under

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

DNA Feature Sensors. B. Majoros

DNA Feature Sensors. B. Majoros DNA Feature Sensors B. Majoros What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals short, fixed-length

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II 6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution Lecture 05 Hidden Markov Models Part II 1 2 Module 1: Aligning and modeling genomes Module 1: Computational foundations Dynamic programming:

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung: Assignments for lecture Bioinformatics III WS 03/04 Assignment 5, return until Dec 16, 2003, 11 am Your name: Matrikelnummer: Fachrichtung: Please direct questions to: Jörg Niggemann, tel. 302-64167, email:

More information

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from Gene Expression Data II Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * The contents are adapted from Dr. Jean Gao at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Primer on Probability Random

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information