VL Bioinformatik für Nebenfächler SS2018 Woche 8

Size: px
Start display at page:

Download "VL Bioinformatik für Nebenfächler SS2018 Woche 8"

Transcription

1 VL Bioinformatik für Nebenfächler SS2018 Woche 8 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of

2 Teil 1 (Woche 1-3) : Einführung in Python & Algorithmen Teil 2 (Woche 4-6) : Finden von unbekannten Mustern (ohne Fehler) in großen Texten am Beispiel der oric Region Teil 3 (Woche 7 - ) : Finden von unbekannten Mustern (mit Fehlern) in großen Texten am Beispiel von Transkriptions-Faktor Bindungs-Stellen

3 Motif finding using randomized algorithms

4 4

5 Zusammenfassung kommender Block (grob!) Gene haben gewisse Merkmale auf Sequenzebene Änderungen in der Zusammensetzung der Sequenz Bestimmte Motive, die z.b. Start oder Ende angeben Problem: kein Anhaltspunkt, wie diese Motive aussehen könnten Mögliches Vorgehen Randomisierte Algorithmen

6 Basic Gene Structures Eukaryotic genes Exons,introns, translation starts and stops, splice (donor/acceptor) junctions, 6

7 How do Plants Know Whether it is Day or Night? Plants need to change gene expression of genes (photosynthesis, flowering, frost resistance, etc.) switching from day to night and vice-versa. Who are the molecular managers telling genes to switch expression on and off in billions of cells? Just 3 genes call the shots: CCA1, LCY, and TOC1. Night BloomingCereus only blooms at night Sunflower follows the sun

8 How Does CCA1 Know Where to Bind? CCA1 CCA1 CCA1 CCA1 Gene1 Gene2 Gene3 Gene4 There must be some hidden messages in these regions that tells CCA1 WHERE to bind.

9 Transcription Factor Binding Sites cagtataaagtctactgatgcaacctgactcatgacgaggaa Gene1 agtcgactgacttaacaaatctcggatcgattcgtccgagga cgtcagctctgtcgggattcgccccgtattcaaaaaagctac accgtctccacaaaacctgctcgtgccactgatgcaacctga Gene2 Gene3 Gene4 The hidden messages (motif AAAAAATCT): transcription factor binding sites of CCA1

10 Outline From Implanted Patterns to Regulatory Motifs Implanting Patterns into Strings Implanted Motif Problem Motif Finding Problem Median String Problem Greedy Motif Search How Rolling Dice Helps Us Find Regulatory Motifs Randomized Motif Search How do Bacteria Hibernate? Gibbs Sampling Pseudocounts

11 Generate Ten Random Sequences atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

12 Implant a Pattern AAAAAAAGGGGGGG at Randomly Chosen Positions atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

13 Where Are the Implanted Patterns Hiding? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga The implanted patterns can be found by the Frequent Words algorithm.

14 Implant a Pattern AAAAAAGGGGGGG with 4 Random Mutations at Random Positions atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga A (k,d)-motif: a k-mer that appears in each sequence with at most d mismatches. AAAAAAGGGGGGG is a (15,4)-motif

15 Now Where are the Implanted Patterns? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga Would the Frequent Words algorithm find this hidden message?

16 Now Where are the Implanted Patterns? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga Would the Frequent Words algorithm find this hidden message? No.

17 Implanted Motifs Problem Implanted Motif Problem. Finding (k,d)-motifs in a set of strings. Input: A set of strings Dna, and integers k (motif length) and d (maximal number of mismatches in a motif). Output: All (k,d)-motifs in Dna.

18 Implanted Motifs Problem Implanted Motif Problem. Finding (k,d)-motifs in a set of strings. Input: A set of strings Dna, and integers k (motif length) and d (maximal number of mismatches in a motif). Output: All (k,d)-motifs in Dna. Idea: DNA strings are generated randomly The only thing that is not random inside them is the implanted motif If we try to find the most similar k-mers in first two strings, with at most d differences, maybe they will point us to the motifs?

19 Finding Implanted Motifs by Pairwise Comparison atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

20 Finding Implanted Motifs by Pairwise Comparison atgaccgggatactgatagaagaaaggttggg ggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga AgAAgAAAGGttGGG caataaaacggcggg

21 Why Pairwise Comparison Won t Work AgAAgAAAGGttGGG Implanted instance 1 4 mismatches AAAAAAAAGGGGGGG + Pattern 4 mismatches caataaaacgggggc Implanted instance 2 The implanted instances have up to = 8 mutations they don t even look like similar strings!

22 Resorting to Motif Enumeration Instead Should we explore ALL 4 k k-mers to solve the Implanted Motif Problem? Not necessarily, because if a k-mer is so far away from all k-mers in the strings that we analyze, there is no reason to explore it. We will analyze only k-mers that differ by d MotifEnumeration(Dna, k, d) for each k-mer a in Dna generate all possible k-mers a differing from a by at most d mutations for each such k-mer a if a is a (k,d)-mer in each sequence in Dna output a Would this simple (albeit slow) algorithm work for finding real biological motifs?

23 Resorting to Motif Enumeration Instead Would this simple (albeit slow) algorithm work for finding real biological motifs? MotifEnumeration assumes that EACH sequence has a k- mer similar to an (unknown) implanted pattern. This condition does not hold for noisy biological datasets. Some genes that are controlled by circadian clock do not have a particular motif implanted. We need to find a way to search for motifs even if some sequences do not have any motifs implanted. That is why we need to define scoring for every set of motifs even when some of them don t look very similar to the canonical motif

24 Outline From Implanted Patterns to Regulatory Motifs Implanting Patterns into Strings Do We Have a Clock Gene? Implanted Motif Problem Motif Finding Problem Median String Problem Greedy Motif Search How Rolling Dice Helps Us Find Regulatory Motifs Randomized Motif Search How do Bacteria Hibernate? Gibbs Sampling Pseudocounts

25 Transcription Factor Binding Sites Let us assume for now that we know the motifs. cagtataaagtctactgatgcaacctgactcatgacgaggaa Gene1 agtcgactgacttaacaaatctcggatcgattcgtccgagga cgtcagctctgtcgggattcgccccgtattcaaaaaagctac accgtctccacaaaacctgctcgtgccactgatgcaacctga Gene2 Gene3 Gene4 Transcription factor binding sites of CCA1 (motif AAAAAATCT)

26 From Motif to Consensus String Motifs T C G G G G G T T T T T C C G G T G A C T T A C A C G G G G A T T T T C T T G G G G A C T T T T A A G G G G A C T T C C T T G G G G A C T T C C T C G G G G A T T C A T T C G G G G A T T C C T T A G G G G A A C T A C T C G G G T A T A A C C

27 From Motif to Consensus String Motifs T C G G G G G T T T T T C C G G T G A C T T A C A C G G G G A T T T T C T T G G G G A C T T T T A A G G G G A C T T C C T T G G G G A C T T C C T C G G G G A T T C A T T C G G G G A T T C C T T A G G G G A A C T A C T C G G G T A T A A C C Visualizing using: motif logo Height of a logo-letter: 2- randomness of the letters in a column. Randomness can be between 0 (no randomness) and 2 (full randomness) (See also: Entropy)

28 From Motif to Consensus String Motifs T C G G G G G T T T T T C C G G T G A C T T A C A C G G G G A T T T T C T T G G G G A C T T T T A A G G G G A C T T C C T T G G G G A C T T C C T C G G G G A T T C A T T C G G G G A T T C C T T A G G G G A A C T A C T C G G G T A T A A C C Visualizing using: motif logo Height of a logo-letter: 2- randomness of the letters in a column. Randomness can be between 0 (no randomness) and 2 (full randomness) (See also: Entropy)

29 From Motif to Consensus String Motifs T C G G G G g T T T T t c C G G t G A c T T a C a C G G G G A T T T t C T t G G G G A c T T t t a a G G G G A c T T C C T t G G G G A c T T C C T C G G G G A T T c a t T C G G G G A T T c C t T a G G G G A a c T a C T C G G G t A T a a C C Most popular nucleotide in each column is in uppercase and bold. Other (unpopular) nucleotides in each column are shown in lowercase.

30 From Motif to Consensus String Motifs T C G G G G g T T T T t c C G G t G A c T T a C a C G G G G A T T T t C T t G G G G A c T T t t a a G G G G A c T T C C T t G G G G A c T T C C T C G G G G A T T c a t T C G G G G A T T c C t T a G G G G A a c T a C T C G G G t A T a a C C Most popular nucleotide in each column is in uppercase and bold. Other (unpopular) nucleotides in each column are shown in lowercase. Consensus(Motifs) T C G G G G A T T T C C most popular symbol in each column Score(Motifs) =30 #unpopular symbols Goal: Minimize Score! How?

31 From Motif to Consensus String Motifs T C G G G G g T T T T t c C G G t G A c T T a C a C G G G G A T T T t C T t G G G G A c T T t t a a G G G G A c T T C C T t G G G G A c T T C C T C G G G G A T T c a t T C G G G G A T T c C t T a G G G G A a c T a C T C G G G t A T a a C C Most popular nucleotide in each column is in uppercase and bold. Other (unpopular) nucleotides in each column are shown in lowercase. Consensus(Motifs) T C G G G G A T T T C C most popular symbol in each column Score(Motifs) =30 #unpopular symbols Goal: Minimize Score! How? Select similar k-mers, one from each string

32 32

33 The Motif Finding Problem Motif Finding Problem. Given a set of sequences, find a set of k-mers (one from each sequence) with minimal score among all choices of k-mers. Input: A set of sequences Dna and integer k. Output: A set of k-mers Motifs, one from each sequence in Dna, minimizing Score(Motifs). BruteForceMotifSearch: t number of strings, n length of each string (n-k+1) t ways to form a motif matrix (n-k+1 choices of k-mers in each of t sequences) scoring the matrix requires k t steps k<<n (in practice, k is much smaller than n) runtime: OO nn tt kk tt too slow!

34

35 Reformulating the Motif Finding Problem Consensus(Motifs) T C G G G G A T T T C C most popular symbol each column Score(Motifs) =30 #unpoplular symbols

36 Reformulating the Motif Finding Problem Hamming distance: number of mismatches between k-mers. GATTCTCA GACGCTGA d(gattctca, GACGCTGA) = 3

37 Reformulating the Motif Finding Problem #unpopular symbols (counted ROW-BY-ROW) Hamming distance from Consensus to each motif: d(consensus,motif i ) Consensus(Motifs) T C G G G G A T T T C C most popular symbol each column Score(Motifs) =30 #unpoplular symbols Hamming distance: number of mismatches between k-mers. GATTCTCA GACGCTGA d(gattctca, GACGCTGA) = 3

38 Reformulating the Motif Finding Problem Motifs consensus string + Dna What is the distance between a k-mer and Motifs={Motif 1,,Motif t }? Define d(k-mer, Motifs) as the sum of Hamming distances between a k-mer and each Motif i : d(k-mer, Motifs) = i=1,t d(k-mer, Motif i ) What is d(consensus(motifs), Motifs)?

39 Score(Motifs) = d(consensus(motifs), Motifs) #unpopular symbols (ROW-BY-ROW) Hamming distance from Consensus to each motif: distance(consensus,motif i ) Consensus(Motifs) T C G G G G A T T T C C Score(Motifs) =30 #unpopular symbols (COLUMN-BY-COLUMN) Score(Motifs)= # unpopular symbols counted column-by-column = # unpopular symbols counted row-by-row= d(consensus(motifs), Motifs)

40 Yet Another (Equivalent) Motif Finding Problem Motifs Search for Motifs in Dna minimimizing Score(Motifs) consensus string + Dna Search for a k-mer minimizing d(k-mer,motifs) among all possible k-mers and all possible Motifs in Dna. Equivalent Motif Finding Problem. Find a k-mer Pattern and a set of k-mers Motifs in Dna that minimizes the distance between all possible choices of Pattern and Motifs. Input: A set of sequences Dna and integer k. Output: A k-mer Pattern and a set of k-mers Motifs in Dna minimizing d(pattern, Motifs).

41 Have We Just Made Our Task More Difficult? Motif Finding Problem. Find a set of k-mers Motifs, one from each sequence in Dna, minimizing Score(Motifs). Equivalent Motif Finding Problem. Find k-mer Pattern AND a set of k-mers Motifs, one from each sequence in Dna, minimizing d(pattern, Motifs). The key insight for solving the Equivalent Motif Finding Problem: given Pattern, we can quickly find Motifs minimizing d(pattern,motifs). No need to explore all Motifs like in Motif Finding Problem! Instead, the Median String Problem explores all k-mers Pattern

42 Outline From Implanted Patterns to Regulatory Motifs Implanting Patterns into Strings Do We Have a Clock Gene? Implanted Motif Problem Motif Finding Problem Median String Problem Greedy Motif Search How Rolling Dice Helps Us Find Regulatory Motifs Randomized Motif Search How do Bacteria Hibernate? Gibbs Sampling Pseudocounts

43 Distance between a k-mer and a (longer) String Distance d(gattctca, GCAAAGACGCTGACCAA) =? Distance: 7 G A T T C T C A G C A A A G A C G C T G A C C A A d(pattern, String): minimum distance between Pattern and all k-mers in String

44 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G A T T C T C A G C A A A G A C G Distance: 7 6 C T G A C C A A d(pattern, String): minimum distance between Pattern and all k-mers in String

45 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? Distance: G A T T C T C A G C A A A G T G A C C A A d(pattern, String): minimum distance between Pattern and all k-mers in String

46 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G A T T C T C A G C A AG C A A A G A C G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

47 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G C A A G A T T C T C A A G A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

48 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G A G C A A A G A T T C T C A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

49 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G C A A A G G A T T C T C A A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

50 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G C A A A G A G A T T C T C A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

51 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G A T T C T C A G C A A A G A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

52 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G C A A A G A C G G A T T C T C A C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

53 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) = 3 G A G C A A A G A T T C T C A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

54 Distance between a k-mer and a Set of Strings Distance between a k-mer and a set of strings Dna = {Dna 1,, DNA t }: d(k-mer, Dna) = d(k-mer, Dna i ) all strings in Dna Pattern = AAA ttaccttaac 1 gatatctgtc 1 Acggcgttcg 2 ccctaaagag 0 cgtcagaggt 1 d(aaa, Dna)= 5 A median string for the set of strings Dna: a k-mer minimizing distance d(k-mer, Dna) over all possible k-mers.

55 Median String Problem Median String Problem. Finding a median string. Input: A set of sequences Dna and an integer k. Output: A k-mer minimizing distance d(k-mer, Dna) among all k-mers. MedianString(Dna, k) best-k-mer AAA AA for each k-mer from AAA AA to TTT TT if d(k-mer, Dna) < d (best-k-mer, Dna) best-k-mer k-mer return(best-k-mer) Runtime: 4 k n t k (for Dna with t sequences of length n). d(k-mer, Dna) requires n-k+1 comparisons of strings; each string comparison has k character comparisons; since k<<n, it s n k steps for calculating each d(k-mer, sequence); since we have t sequences in Dna, it s totally n t k steps per k-mer Motif Finding Problem versus Median String Problem Runtime: n t k t Runtime: 4 k n t k

56 Motif Finding Problem = Median String Problem Motif Finding Problem versus Median String Problem Runtime: n t k t Runtime: 4 k n t k Although MedianString is much faster than BruteForceMotifSearch, it is still slow for large k.

57 BA2H BA2B

58 Assignment 3 1. Create sub-directory assignment 3 within your GIThub account 2. Solve Rosalind exercises BA2H, BA2B ( 3. Upload code of all of the solutions and a screenshot of each program run using the Rosalind examples (into the assignment3 subdirectory) 4. Send an to your tutor, containing the url to your GIThub repository sub-directory

59 1. Gibbs-Sampling 2. Hidden Markov Models 3. Find Patterns in Databases (using Profile-HMMS)

VL Bioinformatik für Nebenfächler SS2018 Woche 9

VL Bioinformatik für Nebenfächler SS2018 Woche 9 VL Bioinformatik für Nebenfächler SS2018 Woche 9 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of Teil 1 (Woche

More information

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median

More information

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median

More information

Partial restriction digest

Partial restriction digest This lecture Exhaustive search Torgeir R. Hvidsten Restriction enzymes and the partial digest problem Finding regulatory motifs in DNA Sequences Exhaustive search methods T.R. Hvidsten: 1MB304: Discrete

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 2: Exhaustive search and motif finding 6.9.202 Outline Implanted motifs - an introduction

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics

More information

Probabilistic models of biological sequence motifs

Probabilistic models of biological sequence motifs Probabilistic models of biological sequence motifs Discovery of new motifs Master in Bioinformatics UPF 2015-2016 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain what

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php Algorithms for Bioinformatics

More information

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median

More information

9/8/09 Comp /Comp Fall

9/8/09 Comp /Comp Fall 9/8/09 Comp 590-90/Comp 790-90 Fall 2009 1 Genomes contain billions of bases (10 9 ) Within these there are 10s of 1000s genes (10 4 ) Genes are 1000s of bases long on average (10 3 ) So only 1% of DNA

More information

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16 VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 16 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Based on slides by

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

How can one gene have such drastic effects?

How can one gene have such drastic effects? Slides revised and adapted Computational Biology course IST Ana Teresa Freitas 2011/2012 A recent microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed How can

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1. Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12:00 4001 Motif finding This exposition was developed by Knut Reinert and Clemens Gröpl. It is based on the following

More information

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25, Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25, 200707 Motif Finding This exposition is based on the following sources, which are all recommended reading:.

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM? Neyman-Pearson More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery Given a sample x 1, x 2,..., x n, from a distribution f(... #) with parameter #, want to test

More information

Jianlin Cheng, PhD. Department of Computer Science University of Missouri, Columbia. Fall, 2014

Jianlin Cheng, PhD. Department of Computer Science University of Missouri, Columbia. Fall, 2014 Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia Fall, 2014 Free for academic use. Copyright @ Jianlin Cheng & original sources for some materials Find a set of sub-sequences

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Exhaustive search. CS 466 Saurabh Sinha

Exhaustive search. CS 466 Saurabh Sinha Exhaustive search CS 466 Saurabh Sinha Agenda Two different problems Restriction mapping Motif finding Common theme: exhaustive search of solution space Reading: Chapter 4. Restriction Mapping Restriction

More information

Hidden Markov Models for biological sequence analysis

Hidden Markov Models for biological sequence analysis Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

Quantitative Bioinformatics

Quantitative Bioinformatics Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize

More information

Hidden Markov Models 1

Hidden Markov Models 1 Hidden Markov Models Dinucleotide Frequency Consider all 2-mers in a sequence {AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT} Given 4 nucleotides: each with a probability of occurrence of. 4 Thus, one

More information

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models for biological sequence analysis I Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data 1 Gene Networks Definition: A gene network is a set of molecular components, such as genes and proteins, and interactions between

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

HIDDEN MARKOV MODELS

HIDDEN MARKOV MODELS HIDDEN MARKOV MODELS Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Multiple Sequence Alignment BMI/CS 576 www.biostat.wisc.edu/bmi576.html Colin Dewey cdewey@biostat.wisc.edu Multiple Sequence Alignment: Tas Definition Given a set of more than 2 sequences a method for

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * The contents are adapted from Dr. Jean Gao at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Primer on Probability Random

More information

Pair Hidden Markov Models

Pair Hidden Markov Models Pair Hidden Markov Models Scribe: Rishi Bedi Lecturer: Serafim Batzoglou January 29, 2015 1 Recap of HMMs alphabet: Σ = {b 1,...b M } set of states: Q = {1,..., K} transition probabilities: A = [a ij ]

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Sequence Alignment. Johannes Starlinger

Sequence Alignment. Johannes Starlinger Sequence Alignment Johannes Starlinger his Lecture Approximate String Matching Edit distance and alignment Computing global alignments Local alignment Johannes Starlinger: Bioinformatics, Summer Semester

More information

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island

More information

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline

More information

Probability: Terminology and Examples Class 2, Jeremy Orloff and Jonathan Bloom

Probability: Terminology and Examples Class 2, Jeremy Orloff and Jonathan Bloom 1 Learning Goals Probability: Terminology and Examples Class 2, 18.05 Jeremy Orloff and Jonathan Bloom 1. Know the definitions of sample space, event and probability function. 2. Be able to organize a

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms  Hidden Markov Models Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training

More information

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene. GENE FINDING The Computational Problem We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene. The Computational Problem Confounding Realities:

More information

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs Bioinformatics 1--lectures 15, 16 Markov chains Hidden Markov models Profile HMMs target sequence database input to database search results are sequence family pseudocounts or background-weighted pseudocounts

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Example questions. Z:\summer_10_teaching\bioinfo\Beispiel_frage_bioinformatik.doc [1 / 5]

Example questions. Z:\summer_10_teaching\bioinfo\Beispiel_frage_bioinformatik.doc [1 / 5] Example questions for Bioinformatics, first semester half Sommersemester 00 ote The schriftliche Klausur wurde auf deutsch geschrieben The questions will be based on material from the Übungen and the Lectures.

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

More information

BMI/CS 576 Fall 2016 Final Exam

BMI/CS 576 Fall 2016 Final Exam BMI/CS 576 all 2016 inal Exam Prof. Colin Dewey Saturday, December 17th, 2016 10:05am-12:05pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary.

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Lecture #5. Dependencies along the genome

Lecture #5. Dependencies along the genome Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3., Polanski&Kimmel Section 2.8. Prepared by Shlomo Moran, based on Danny Geiger s and Nir Friedman s. Dependencies along the genome

More information

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns. Curriculum, fourth lecture: Niels Richard Hansen November 30, 2011 NRH: Handout pages 1-8 (NRH: Sections 2.1-2.5) Keywords: binomial distribution, dice games, discrete probability distributions, geometric

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

Introduction to spectral alignment

Introduction to spectral alignment SI Appendix C. Introduction to spectral alignment Due to the complexity of the anti-symmetric spectral alignment algorithm described in Appendix A, this appendix provides an extended introduction to the

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 81 7 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis,

More information

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: Hidden Markov Models Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: www.ioalgorithms.info Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm

More information

Probability and Inference. POLI 205 Doing Research in Politics. Populations and Samples. Probability. Fall 2015

Probability and Inference. POLI 205 Doing Research in Politics. Populations and Samples. Probability. Fall 2015 Fall 2015 Population versus Sample Population: data for every possible relevant case Sample: a subset of cases that is drawn from an underlying population Inference Parameters and Statistics A parameter

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 89 8 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis,

More information

Inferring Protein-Signaling Networks

Inferring Protein-Signaling Networks Inferring Protein-Signaling Networks Lectures 14 Nov 14, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1

More information

Inferring Models of cis-regulatory Modules using Information Theory

Inferring Models of cis-regulatory Modules using Information Theory Inferring Models of cis-regulatory Modules using Information Theory BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 28 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material,

More information

DNA Feature Sensors. B. Majoros

DNA Feature Sensors. B. Majoros DNA Feature Sensors B. Majoros What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals short, fixed-length

More information

Lab 4. Current, Voltage, and the Circuit Construction Kit

Lab 4. Current, Voltage, and the Circuit Construction Kit Physics 2020, Spring 2009 Lab 4 Page 1 of 8 Your name: Lab section: M Tu Wed Th F TA name: 8 10 12 2 4 Lab 4. Current, Voltage, and the Circuit Construction Kit The Circuit Construction Kit (CCK) is a

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics 1. Introduction. Jorge L. Ortiz Department of Electrical and Computer Engineering College of Engineering University of Puerto

More information

1 The Basic Counting Principles

1 The Basic Counting Principles 1 The Basic Counting Principles The Multiplication Rule If an operation consists of k steps and the first step can be performed in n 1 ways, the second step can be performed in n ways [regardless of how

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

1. In most cases, genes code for and it is that

1. In most cases, genes code for and it is that Name Chapter 10 Reading Guide From DNA to Protein: Gene Expression Concept 10.1 Genetics Shows That Genes Code for Proteins 1. In most cases, genes code for and it is that determine. 2. Describe what Garrod

More information

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Kumud Joseph Kujur, Sumit Pal Singh, O.P. Vyas, Ruchir Bhatia, Varun Singh* Indian Institute of Information

More information

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden

More information

Inferring Models of cis-regulatory Modules using Information Theory

Inferring Models of cis-regulatory Modules using Information Theory Inferring Models of cis-regulatory Modules using Information Theory BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 26 Anthony Gitter gitter@biostat.wisc.edu Overview Biological question What is causing

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information

EM-algorithm for motif discovery

EM-algorithm for motif discovery EM-algorithm for motif discovery Xiaohui Xie University of California, Irvine EM-algorithm for motif discovery p.1/19 Position weight matrix Position weight matrix representation of a motif with width

More information

Introduction to Hidden Markov Models (HMMs)

Introduction to Hidden Markov Models (HMMs) Introduction to Hidden Markov Models (HMMs) But first, some probability and statistics background Important Topics 1.! Random Variables and Probability 2.! Probability Distributions 3.! Parameter Estimation

More information

CS711008Z Algorithm Design and Analysis

CS711008Z Algorithm Design and Analysis .. Lecture 6. Hidden Markov model and Viterbi s decoding algorithm Institute of Computing Technology Chinese Academy of Sciences, Beijing, China . Outline The occasionally dishonest casino: an example

More information

Advanced Protein Models

Advanced Protein Models Advanced Protein Models James. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University March 27, 2014 Outline Advanced Protein Models The Bound Fraction Transcription

More information