VL Bioinformatik für Nebenfächler SS2018 Woche 8

Size: px

Start display at page:

Download "VL Bioinformatik für Nebenfächler SS2018 Woche 8"

Suzanna Washington
5 years ago
Views:

für Mathematik & Informatik, FU Berlin Based on

1 VL Bioinformatik für Nebenfächler SS2018 Woche 8 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of

2 Teil 1 (Woche 1-3) : Einführung in Python & Algorithmen Teil 2 (Woche 4-6) : Finden von unbekannten Mustern (ohne Fehler) in großen Texten am Beispiel der oric Region Teil 3 (Woche 7 - ) : Finden von unbekannten Mustern (mit Fehlern) in großen Texten am Beispiel von Transkriptions-Faktor Bindungs-Stellen

3 Motif finding using randomized algorithms

4 4

5 Zusammenfassung kommender Block (grob!) Gene haben gewisse Merkmale auf Sequenzebene Änderungen in der Zusammensetzung der Sequenz Bestimmte Motive, die z.b. Start oder Ende angeben Problem: kein Anhaltspunkt, wie diese Motive aussehen könnten Mögliches Vorgehen Randomisierte Algorithmen

6 Basic Gene Structures Eukaryotic genes Exons,introns, translation starts and stops, splice (donor/acceptor) junctions, 6

How do Plants Know Whether it is Day or Night?

genes (photosynthesis, flowering, frost resistance,

Who are the molecular managers telling genes to

Just 3 genes call the shots: CCA1, LCY, and TOC1.

7 How do Plants Know Whether it is Day or Night? Plants need to change gene expression of genes (photosynthesis, flowering, frost resistance, etc.) switching from day to night and vice-versa. Who are the molecular managers telling genes to switch expression on and off in billions of cells? Just 3 genes call the shots: CCA1, LCY, and TOC1. Night BloomingCereus only blooms at night Sunflower follows the sun

8 How Does CCA1 Know Where to Bind? CCA1 CCA1 CCA1 CCA1 Gene1 Gene2 Gene3 Gene4 There must be some hidden messages in these regions that tells CCA1 WHERE to bind.

9 Transcription Factor Binding Sites cagtataaagtctactgatgcaacctgactcatgacgaggaa Gene1 agtcgactgacttaacaaatctcggatcgattcgtccgagga cgtcagctctgtcgggattcgccccgtattcaaaaaagctac accgtctccacaaaacctgctcgtgccactgatgcaacctga Gene2 Gene3 Gene4 The hidden messages (motif AAAAAATCT): transcription factor binding sites of CCA1

10 Outline From Implanted Patterns to Regulatory Motifs Implanting Patterns into Strings Implanted Motif Problem Motif Finding Problem Median String Problem Greedy Motif Search How Rolling Dice Helps Us Find Regulatory Motifs Randomized Motif Search How do Bacteria Hibernate? Gibbs Sampling Pseudocounts

11 Generate Ten Random Sequences atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

12 Implant a Pattern AAAAAAAGGGGGGG at Randomly Chosen Positions atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

13 Where Are the Implanted Patterns Hiding? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga The implanted patterns can be found by the Frequent Words algorithm.

14 Implant a Pattern AAAAAAGGGGGGG with 4 Random Mutations at Random Positions atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga A (k,d)-motif: a k-mer that appears in each sequence with at most d mismatches. AAAAAAGGGGGGG is a (15,4)-motif

15 Now Where are the Implanted Patterns? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga Would the Frequent Words algorithm find this hidden message?

16 Now Where are the Implanted Patterns? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga Would the Frequent Words algorithm find this hidden message? No.

17 Implanted Motifs Problem Implanted Motif Problem. Finding (k,d)-motifs in a set of strings. Input: A set of strings Dna, and integers k (motif length) and d (maximal number of mismatches in a motif). Output: All (k,d)-motifs in Dna.

18 Implanted Motifs Problem Implanted Motif Problem. Finding (k,d)-motifs in a set of strings. Input: A set of strings Dna, and integers k (motif length) and d (maximal number of mismatches in a motif). Output: All (k,d)-motifs in Dna. Idea: DNA strings are generated randomly The only thing that is not random inside them is the implanted motif If we try to find the most similar k-mers in first two strings, with at most d differences, maybe they will point us to the motifs?

19 Finding Implanted Motifs by Pairwise Comparison atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

20 Finding Implanted Motifs by Pairwise Comparison atgaccgggatactgatagaagaaaggttggg ggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga AgAAgAAAGGttGGG caataaaacggcggg

21 Why Pairwise Comparison Won t Work AgAAgAAAGGttGGG Implanted instance 1 4 mismatches AAAAAAAAGGGGGGG + Pattern 4 mismatches caataaaacgggggc Implanted instance 2 The implanted instances have up to = 8 mutations they don t even look like similar strings!

22 Resorting to Motif Enumeration Instead Should we explore ALL 4 k k-mers to solve the Implanted Motif Problem? Not necessarily, because if a k-mer is so far away from all k-mers in the strings that we analyze, there is no reason to explore it. We will analyze only k-mers that differ by d MotifEnumeration(Dna, k, d) for each k-mer a in Dna generate all possible k-mers a differing from a by at most d mutations for each such k-mer a if a is a (k,d)-mer in each sequence in Dna output a Would this simple (albeit slow) algorithm work for finding real biological motifs?

23 Resorting to Motif Enumeration Instead Would this simple (albeit slow) algorithm work for finding real biological motifs? MotifEnumeration assumes that EACH sequence has a k- mer similar to an (unknown) implanted pattern. This condition does not hold for noisy biological datasets. Some genes that are controlled by circadian clock do not have a particular motif implanted. We need to find a way to search for motifs even if some sequences do not have any motifs implanted. That is why we need to define scoring for every set of motifs even when some of them don t look very similar to the canonical motif

24 Outline From Implanted Patterns to Regulatory Motifs Implanting Patterns into Strings Do We Have a Clock Gene? Implanted Motif Problem Motif Finding Problem Median String Problem Greedy Motif Search How Rolling Dice Helps Us Find Regulatory Motifs Randomized Motif Search How do Bacteria Hibernate? Gibbs Sampling Pseudocounts

25 Transcription Factor Binding Sites Let us assume for now that we know the motifs. cagtataaagtctactgatgcaacctgactcatgacgaggaa Gene1 agtcgactgacttaacaaatctcggatcgattcgtccgagga cgtcagctctgtcgggattcgccccgtattcaaaaaagctac accgtctccacaaaacctgctcgtgccactgatgcaacctga Gene2 Gene3 Gene4 Transcription factor binding sites of CCA1 (motif AAAAAATCT)

26 From Motif to Consensus String Motifs T C G G G G G T T T T T C C G G T G A C T T A C A C G G G G A T T T T C T T G G G G A C T T T T A A G G G G A C T T C C T T G G G G A C T T C C T C G G G G A T T C A T T C G G G G A T T C C T T A G G G G A A C T A C T C G G G T A T A A C C

27 From Motif to Consensus String Motifs T C G G G G G T T T T T C C G G T G A C T T A C A C G G G G A T T T T C T T G G G G A C T T T T A A G G G G A C T T C C T T G G G G A C T T C C T C G G G G A T T C A T T C G G G G A T T C C T T A G G G G A A C T A C T C G G G T A T A A C C Visualizing using: motif logo Height of a logo-letter: 2- randomness of the letters in a column. Randomness can be between 0 (no randomness) and 2 (full randomness) (See also: Entropy)

28 From Motif to Consensus String Motifs T C G G G G G T T T T T C C G G T G A C T T A C A C G G G G A T T T T C T T G G G G A C T T T T A A G G G G A C T T C C T T G G G G A C T T C C T C G G G G A T T C A T T C G G G G A T T C C T T A G G G G A A C T A C T C G G G T A T A A C C Visualizing using: motif logo Height of a logo-letter: 2- randomness of the letters in a column. Randomness can be between 0 (no randomness) and 2 (full randomness) (See also: Entropy)

29 From Motif to Consensus String Motifs T C G G G G g T T T T t c C G G t G A c T T a C a C G G G G A T T T t C T t G G G G A c T T t t a a G G G G A c T T C C T t G G G G A c T T C C T C G G G G A T T c a t T C G G G G A T T c C t T a G G G G A a c T a C T C G G G t A T a a C C Most popular nucleotide in each column is in uppercase and bold. Other (unpopular) nucleotides in each column are shown in lowercase.

30 From Motif to Consensus String Motifs T C G G G G g T T T T t c C G G t G A c T T a C a C G G G G A T T T t C T t G G G G A c T T t t a a G G G G A c T T C C T t G G G G A c T T C C T C G G G G A T T c a t T C G G G G A T T c C t T a G G G G A a c T a C T C G G G t A T a a C C Most popular nucleotide in each column is in uppercase and bold. Other (unpopular) nucleotides in each column are shown in lowercase. Consensus(Motifs) T C G G G G A T T T C C most popular symbol in each column Score(Motifs) =30 #unpopular symbols Goal: Minimize Score! How?

31 From Motif to Consensus String Motifs T C G G G G g T T T T t c C G G t G A c T T a C a C G G G G A T T T t C T t G G G G A c T T t t a a G G G G A c T T C C T t G G G G A c T T C C T C G G G G A T T c a t T C G G G G A T T c C t T a G G G G A a c T a C T C G G G t A T a a C C Most popular nucleotide in each column is in uppercase and bold. Other (unpopular) nucleotides in each column are shown in lowercase. Consensus(Motifs) T C G G G G A T T T C C most popular symbol in each column Score(Motifs) =30 #unpopular symbols Goal: Minimize Score! How? Select similar k-mers, one from each string

32 32

33 The Motif Finding Problem Motif Finding Problem. Given a set of sequences, find a set of k-mers (one from each sequence) with minimal score among all choices of k-mers. Input: A set of sequences Dna and integer k. Output: A set of k-mers Motifs, one from each sequence in Dna, minimizing Score(Motifs). BruteForceMotifSearch: t number of strings, n length of each string (n-k+1) t ways to form a motif matrix (n-k+1 choices of k-mers in each of t sequences) scoring the matrix requires k t steps k<<n (in practice, k is much smaller than n) runtime: OO nn tt kk tt too slow!

35 Reformulating the Motif Finding Problem Consensus(Motifs) T C G G G G A T T T C C most popular symbol each column Score(Motifs) =30 #unpoplular symbols

36 Reformulating the Motif Finding Problem Hamming distance: number of mismatches between k-mers. GATTCTCA GACGCTGA d(gattctca, GACGCTGA) = 3

37 Reformulating the Motif Finding Problem #unpopular symbols (counted ROW-BY-ROW) Hamming distance from Consensus to each motif: d(consensus,motif i ) Consensus(Motifs) T C G G G G A T T T C C most popular symbol each column Score(Motifs) =30 #unpoplular symbols Hamming distance: number of mismatches between k-mers. GATTCTCA GACGCTGA d(gattctca, GACGCTGA) = 3

38 Reformulating the Motif Finding Problem Motifs consensus string + Dna What is the distance between a k-mer and Motifs={Motif 1,,Motif t }? Define d(k-mer, Motifs) as the sum of Hamming distances between a k-mer and each Motif i : d(k-mer, Motifs) = i=1,t d(k-mer, Motif i ) What is d(consensus(motifs), Motifs)?

39 Score(Motifs) = d(consensus(motifs), Motifs) #unpopular symbols (ROW-BY-ROW) Hamming distance from Consensus to each motif: distance(consensus,motif i ) Consensus(Motifs) T C G G G G A T T T C C Score(Motifs) =30 #unpopular symbols (COLUMN-BY-COLUMN) Score(Motifs)= # unpopular symbols counted column-by-column = # unpopular symbols counted row-by-row= d(consensus(motifs), Motifs)

40 Yet Another (Equivalent) Motif Finding Problem Motifs Search for Motifs in Dna minimimizing Score(Motifs) consensus string + Dna Search for a k-mer minimizing d(k-mer,motifs) among all possible k-mers and all possible Motifs in Dna. Equivalent Motif Finding Problem. Find a k-mer Pattern and a set of k-mers Motifs in Dna that minimizes the distance between all possible choices of Pattern and Motifs. Input: A set of sequences Dna and integer k. Output: A k-mer Pattern and a set of k-mers Motifs in Dna minimizing d(pattern, Motifs).

41 Have We Just Made Our Task More Difficult? Motif Finding Problem. Find a set of k-mers Motifs, one from each sequence in Dna, minimizing Score(Motifs). Equivalent Motif Finding Problem. Find k-mer Pattern AND a set of k-mers Motifs, one from each sequence in Dna, minimizing d(pattern, Motifs). The key insight for solving the Equivalent Motif Finding Problem: given Pattern, we can quickly find Motifs minimizing d(pattern,motifs). No need to explore all Motifs like in Motif Finding Problem! Instead, the Median String Problem explores all k-mers Pattern

42 Outline From Implanted Patterns to Regulatory Motifs Implanting Patterns into Strings Do We Have a Clock Gene? Implanted Motif Problem Motif Finding Problem Median String Problem Greedy Motif Search How Rolling Dice Helps Us Find Regulatory Motifs Randomized Motif Search How do Bacteria Hibernate? Gibbs Sampling Pseudocounts

43 Distance between a k-mer and a (longer) String Distance d(gattctca, GCAAAGACGCTGACCAA) =? Distance: 7 G A T T C T C A G C A A A G A C G C T G A C C A A d(pattern, String): minimum distance between Pattern and all k-mers in String

44 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G A T T C T C A G C A A A G A C G Distance: 7 6 C T G A C C A A d(pattern, String): minimum distance between Pattern and all k-mers in String

45 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? Distance: G A T T C T C A G C A A A G T G A C C A A d(pattern, String): minimum distance between Pattern and all k-mers in String

46 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G A T T C T C A G C A AG C A A A G A C G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

47 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G C A A G A T T C T C A A G A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

48 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G A G C A A A G A T T C T C A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

49 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G C A A A G G A T T C T C A A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

50 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G C A A A G A G A T T C T C A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

51 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G A T T C T C A G C A A A G A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

52 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) =? G C A A A G A C G G A T T C T C A C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

53 Distance between a k-mer and a (longer) String d(gattctca, GCAAAGACGCTGACCAA) = 3 G A G C A A A G A T T C T C A C G C T G A C C A A Distance: d(pattern, String): minimum distance between Pattern and all k-mers in String

Distance between a k-mer and a Set of Strings Distance between a k-mer and a set of strings Dna = {Dna 1,, DNA t }: d(k-mer, Dna) = d(k-mer, Dna i ) all strings in Dna Pattern = AAA

54 Distance between a k-mer and a Set of Strings Distance between a k-mer and a set of strings Dna = {Dna 1,, DNA t }: d(k-mer, Dna) = d(k-mer, Dna i ) all strings in Dna Pattern = AAA ttaccttaac 1 gatatctgtc 1 Acggcgttcg 2 ccctaaagag 0 cgtcagaggt 1 d(aaa, Dna)= 5 A median string for the set of strings Dna: a k-mer minimizing distance d(k-mer, Dna) over all possible k-mers.

MedianString(Dna, k) best-k-mer AAA AA for each k-mer from AAA AA to TTT TT if d(k-mer, Dna) < d (best-k-mer, Dna) best-k-mer k-mer return(best-k-mer) Runtime: 4 k n t k (for Dna

55 Median String Problem Median String Problem. Finding a median string. Input: A set of sequences Dna and an integer k. Output: A k-mer minimizing distance d(k-mer, Dna) among all k-mers. MedianString(Dna, k) best-k-mer AAA AA for each k-mer from AAA AA to TTT TT if d(k-mer, Dna) < d (best-k-mer, Dna) best-k-mer k-mer return(best-k-mer) Runtime: 4 k n t k (for Dna with t sequences of length n). d(k-mer, Dna) requires n-k+1 comparisons of strings; each string comparison has k character comparisons; since k<<n, it s n k steps for calculating each d(k-mer, sequence); since we have t sequences in Dna, it s totally n t k steps per k-mer Motif Finding Problem versus Median String Problem Runtime: n t k t Runtime: 4 k n t k

56 Motif Finding Problem = Median String Problem Motif Finding Problem versus Median String Problem Runtime: n t k t Runtime: 4 k n t k Although MedianString is much faster than BruteForceMotifSearch, it is still slow for large k.

57 BA2H BA2B

58 Assignment 3 1. Create sub-directory assignment 3 within your GIThub account 2. Solve Rosalind exercises BA2H, BA2B ( 3. Upload code of all of the solutions and a screenshot of each program run using the Rosalind examples (into the assignment3 subdirectory) 4. Send an to your tutor, containing the url to your GIThub repository sub-directory

59 1. Gibbs-Sampling 2. Hidden Markov Models 3. Find Patterns in Databases (using Profile-HMMS)

VL Bioinformatik für Nebenfächler SS2018 Woche 9

VL Bioinformatik für Nebenfächler SS2018 Woche 9 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of Teil 1 (Woche