Algorithms for Bioinformatics

Size: px
Start display at page:

Download "Algorithms for Bioinformatics"

Transcription

1 These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from Algorithms for Bioinformatics Lecture 2: Exhaustive search and randomized algorithms for motif discovery

2 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 2 / 59

3 DNA Sequence Motif A DNA sequence motif is a recurring pattern in DNA presumed to have some biological funtion. Often the occurences of the motif in DNA are binding sites where proteins such as transcription factors can attach. 3 / 59

4 Transcription Factors Every gene contains a regulatory region typically stretching bp upstream of the transcriptional start site. Transcription factor (TF) is a regulatory protein that can bind to a regulatory region and promote or inhibit the transcription of the gene. A single TF typically regulates multiple (co-regulated) genes. 4 / 59

5 Transcription Factor Binding Sites and Motifs Transcription Factor Binding Site (TFBS) is the DNA location where the TF binds to. May be located anywhere in the regulatory region. A TF binds to a specific kind of DNA sequence. Similar but not identical in different co-regulated genes. Some positions are more important and thus conserved while other positions allow variation. Motif is a pattern that describes the TFBS sequences of a given TF. 5 / 59

6 TBFSs ATCCCG gene TTCC GG gene ATCCCG gene AT GCCG gene AT GCC C gene 6 / 59

7 Motif Representations Set of sequences Consensus sequence Profile matrix TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA TGAGGGA A C G T Motif logo 7 / 59

8 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 8 / 59

9 Identifying Motifs Identify a set of co-regulated genes. Extract their regulatory regions. Find a pattern shared by all of the regions: the motif. The motif can be used for finding other (previously unknown) genes regulated by the same TF. cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc 9 / 59

10 The Motif Finding Problem: Formulation Goal: Find the best motif in a set of DNA sequences Input: A collection of t strings Dna and an integer k Output: The collection of t k-mers, one from each input string, forming the best motif. We will later consider what is the best motif. t = 5 k = 8 cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc }{{} n = / 59

11 Planted (k, d)-motif Problem Artificially generated problem instance 1. Generate a random k-mer M (the motif) 2. Generate t random sequences of length n 3. Implant M into a random position in each of the t sequences 4. In each implanted copy of M, make d random mutations 11 / 59

12 Example: Generate sequences gatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtac cctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttata tgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttca ttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagc 12 / 59

13 Example: Implant motif AAAAAAAAGGGGGGG gatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggg cctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttata tgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttca ttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggg 13 / 59

14 Where are the implanted motifs? gatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggg cctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttata tgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttca ttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggg 14 / 59

15 Example: Add four mutations per motif occurrence gatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggg cctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttata tgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttca ttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgg 15 / 59

16 Where are the implanted motifs??? gatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggg cctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttata tgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttca ttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgg 16 / 59

17 Why finding (15,4)-motifs is hard? gatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggg Occurrences can differ in 8 positions AgAAgAAAGGttGGG caataaaacggcggg Difficult to recognize similarity even if you know the positions 17 / 59

18 Planted (k, d)-motif Problem Common benchmark or challenge problem for motif finding Adjustable difficulty For example, the planted (15, 4)-motif problem is difficult but not impossible Real data is noisier and more complex No upper bound d on mutation Some positions have little variation, others have a lot 18 / 59

19 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 19 / 59

20 Scoring Motifs The output of motif finding is a set of k-mers, one from each input sequence In a desired output, the k-mers are as similar to each other as possible, representing a coherent motif How to measure the goodness of the motif? 20 / 59

21 Consensus Motif and Score a G g t a c T t C c A t a c g t Motif occurrences a c g t T A g t a c g t C c A t C c g t a c g G A Counts C G T Consensus A C G T A C G T t motifs occurrences, one from each sequence Count symbols in each column Consensus formed by most frequent symbols Score is the number of differences from consensus Score =10 21 / 59

22 Consensus Motif and Score Good model for planted motif problem In real data, mutations should be expensive in conserved positions but much cheaper in variable positions 22 / 59

23 BruteForceMotifSearch Compute the score for every possible combination of motifs Output the set of motifs with the smallest score 23 / 59

24 Running Time of BruteForceMotifSearch (n k + 1) different k-mers in each sequence (n k + 1) t different combinations of motifs kt time to compute score for one set of motifs kt(n k + 1) t = O(ktn t ) time in total E.g. for t = 20, n = 600, k = 15 we must perform approximately computations it would take billions of years 24 / 59

25 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 25 / 59

26 The Median String Problem Given a set of t DNA sequences find a pattern that appears in all t sequences with the minimum number of total mismatches This pattern will be the shared motif 26 / 59

27 Hamming Distance The Hamming distance d(v, w) the number of mismatches between two k-mers v and w For example: d(aaaaaa, ACAAAC) = 2 27 / 59

28 Computing Score a G g t a c T t 2 C c A t a c g t 2 Motifs a c g t T A g t 2 a c g t C c A t 2 C c g t a c g G 2 Score(Motifs) =10 Consensus(Motifs) A C G T A C G T Score is the number of mismatching symbols Can be computed column by column or row by row Row sums are Hamming distances 28 / 59

29 Computing Score Define Motifs = {Motif 1, Motif 2,..., Motif t } d(pattern, Motifs) = t i=1 d(pattern, Motif i) Then Score(Motifs) = d(consensus(motifs), Motifs) 29 / 59

30 Best Match Distance Assume String > Pattern = k The best match distance d(pattern, String) is the smallest Hamming distance d(pattern, Motif ) between Pattern and any k-mer Motif in String Example: d(acgtacgt, gcaaaaggtacttccaa) = 2 Generalize for a set of strings Dna = {Dna 1, Dna 2,..., Dna t } d(pattern, Dna) = t i=1 d(pattern, Dna i) 30 / 59

31 The Median String Problem Goal: Given a set of DNA sequences, find a median string Input: A collection of strings Dna and an integer k Output: A k-mer Pattern minimizing d(pattern, Dna) among all k-mers Pattern 31 / 59

32 Motif Finding Problem = Median String Problem Input: Dna, k Motifs: output of Motif Finding (t k-mers) Pattern: output of Median String Score(Motifs) = d(pattern, Dna) 32 / 59

33 Motif Finding Problem = Median String Problem Input: Dna, k Motifs: output of Motif Finding (t k-mers) Pattern: output of Median String Score(Motifs) = d(pattern, Dna) Why? If Score(Motifs) < d(pattern, Dna), we could choose Consensus(Motifs) as a better Pattern If Score(Motifs) > d(pattern, Dna), we could choose the best match occurrences of Pattern as better Motifs 32 / 59

34 Median String Algorithm MedianString(DNA, k) 1: BestPattern AAA...A 2: for each k-mer Pattern from AAA...A to TTT...T do 3: if d(pattern, DNA) < d(bestpattern, DNA) then 4: BestPattern Pattern 5: return BestPattern 33 / 59

35 Running Time of MedianString 4 k different k-mers O(k n) time to compute the best match distance to one string O(knt4 k ) time in total E.g. for t = 20, n = 600, k = 15 this is about about still a lot but much less than / 59

36 Running Time of MedianString 4 k different k-mers O(k n) time to compute the best match distance to one string O(knt4 k ) time in total E.g. for t = 20, n = 600, k = 15 this is about about still a lot but much less than Reformulating a problem can help! 34 / 59

37 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 35 / 59

38 Search Space BruteForceMotifSearch and MedianString algorithms have exponential running time This is because the search space, the set of possible solutions, is exponential n t different ways to choose Motifs 4 k different ways to choose Pattern 36 / 59

39 Exploring Only Part of Search Space Branch and bound algorithms (covered in study groups) Avoid regions that cannot improve solution Still exponential in the worst case Greedy algorithms Search the most promising directions No guarantee of finding an optimal solution Randomized algorithms Add randomness to greedy search Avoids getting stuck in a dead end 37 / 59

40 Profile Matrix a G g t a c T t C c A t a c g t Motifs a c g t T A g t a c g t C c A t C c g t a c g G A Count(Motifs) C G T A Profile(Motifs) C G T Profile represents the probability of each nucleotide in each position More detailed summary of the set of motifs than consensus Consensus(Motifs) A C G T A C G T 38 / 59

41 k-mer Probabilities A Profile C G T The probability of a k-mer given a profile Pr(AGGTACTT Profile) = = Measure how well the k-mer matches the motif Does imply a good match? 39 / 59

42 Profile-Most Probable k-mer The k-mer with the highest probability in a string Considered the best matching motif Example: The Profile-most probable 8-mer in gcaaaaggtacttccaa is AGGTACTT Pr(AGGTACTT Profile) = A Profile C G T / 59

43 Problem: Zero Probabilities A Profile C G T Consensus A C G T A C G T Pr(TCGTACGT Profile) = = 0 Only one mismatch compared to consensus Should this probability really be 0? 41 / 59

44 Pseudocounts Add one to all counts Avoids zero counts A Count C G T A PseudoCount C G T / 59

45 Laplace s Rule of Succession Use pseudocounts instead of counts to compute probabilities As if we had seen one occurrence of each symbol before the main data A PseudoCount C G T A 4/9 1/9 2/9 1/9 4/9 2/9 2/9 1/9 Profile C 3/9 5/9 1/9 1/9 2/9 5/9 1/9 1/9 G 1/9 2/9 5/9 1/9 1/9 1/9 4/9 2/9 T 1/9 1/9 1/9 6/9 2/9 1/9 2/9 5/9 Pr(TCGTACGT Profile) = 1/9 5/9 5/9 6/9 4/9 5/9 4/5 5/9 = 60000/ = / 59

46 Greedy Motif Search Solve Motif Finding problem For each input string, choose the profile-most probable k-mer as the motif Greedy choice The profile is computed from motifs of earlier strings In first string, try all k-mers 44 / 59

47 Greedy Motif Search GreedyMotifSearch(DNA, k, t) 1: BestMotifs the first k-mer of each string in DNA 2: for each k-mer Motif in the first string in DNA do 3: Motif 1 Motif 4: for i 2 to t do 5: form Profile from Motif 1,..., Motif i 1 6: Motif i Profile-most probable k-mer in the i-th string in DNA 7: Motifs Motif 1,..., Motif t 8: if Score(Motifs) < Score(BestMotifs) then 9: BestMotifs Motifs 10: return BestMotifs 45 / 59

48 Performance of GreedyMotifSearch Running time O(n t k (t + n)) polynomial not exponential May not find the best motifs Early choices may lead to a wrong direction 46 / 59

49 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 47 / 59

50 Randomized Algorithms Make random choices during computation Use random number generator to toss coins or to roll dice Why Randomness Helps? If a greedy algorithm fails for some input, it will always fail for that input If a randomized algorithm fails, it is unlikely to fail again in the same way We can run it many times and choose the best output 48 / 59

51 Monte Carlo and Las Vegas Algorithms Monte Carlo algorithm May return an incorrect or inoptimal result Returns a correct answer or a good approximation with high probability (if repeated sufficiently many times) Las Vegas algorithm Always returns a correct/optimal result Very long runtime is possible but very unlikely 49 / 59

52 Turning Monte Carlo into Las Vegas 1. Run the Monte Carlo algorithm 2. If the result is good, stop. Otherwise return to Step / 59

53 Turning Monte Carlo into Las Vegas 1. Run the Monte Carlo algorithm 2. If the result is good, stop. Otherwise return to Step 1. Requires that a correct or optimal result can be easily recognized This is not the case with the Motif Finding problem The following algoriths are Monte Carlo algorithms 50 / 59

54 Randomized Motif Search Improving a set of motifs Starting with a set of motifs (one from each sequence) 1. Compute a profile from the motifs 2. Find the profile-most probable motif in each sequence The result is a potentially better set of motifs Repeat this as long as the set of motifs keeps improving Randomization Start with a random set of motifs 51 / 59

55 Randomized Motif Search RandomizedMotifSearch(DNA, k, t) 1: randomly select k-mers Motifs = (Motif 1,..., Motif t ), one from each string in DNA 2: BestMotifs Motifs 3: while forever do 4: Profile Profile(Motifs) 5: for i 1 to t do 6: Motif i Profile-most probable k-mer in the i-th string in DNA 7: Motifs Motif 1,..., Motif t 8: if Score(Motifs) < Score(BestMotifs) then 9: BestMotifs Motifs 10: else 11: return BestMotifs 52 / 59

56 Why Randomized Motif Search Works? If Motifs is a random set, the expectation is that Profile(Motifs) has about the same probability 0.25 for each symbol in each column If Motifs contains some of the true motifs, it is not random and Profile(Motifs) reflects this Then Profile(Motifs) is more likely to match the other true motifs Thus we might need just a few of the true motifs in the initial set This will happen eventually if repeated many times (may require thousands of repeats) 53 / 59

57 Gibbs Sampler Gibbs Sampler is a more refined randomized algorithm Compared to Randomized Motif Search Gibbs Sampler is More cautious More randomized 54 / 59

58 Gibbs Sampler Is More Cautious Randomized Motif Search might get some true motifs right but throw them all away in the next round Gibbs Sampler changes just one motif in each round 55 / 59

59 Gibbs Sampler Is More Randomized Randomized Motif Search uses randomness only in the beginning Gibbs Sampler uses randomness in every round Choose a random motif to discard Replace it with a random motif (from the same sequence) The second random choice is biased: a profile-randomly generated k-mer 56 / 59

60 Profile-Randomly Generated k-mer Given a Profile and a String 1. Compute probabilities of all k-mers in String 2. Choose one of the k-mers randomly but biased by the probabilities The probabilities with respect to Profile do not usually sum up to 1 and have to be normalized: Replace p 1,..., p n with p 1 /C,... p 1 /C, where C = n i=1 p i Example p 1 = 0.1, p 2 = 0.2, p 3 = 0.3 C = = 0.6 p1 /C = 1/6, p 2 /C = 1/3, p 3 /C = 1/2 p1 /C + p 2 /C + p 3 /C = 1/6 + 1/3 + 1/2 = 1 57 / 59

61 Gibbs Sampler GibbsSampler(DNA, k, t, N) 1: randomly select k-mers Motifs = (Motif 1,..., Motif t ) in each string in Dna 2: BestMotifs Motifs 3: for j 1 to N do 4: i Random(t) 5: Profile profile matrix constructed from all strings in Motifs except for Motif i 6: Motif i profile-randomly generated k-mer in the i-th sequence in DNA 7: if Score(Motifs) < Score(BestMotifs) then 8: BestMotifs Motifs 9: return BestMotifs 58 / 59

62 Gibbs Sampler Because of randomness in every round, Gibbs Sampler can keep on running without getting stuck to single solution However, it may end up exploring the same small set of solutions repeatedly: It gets stuck in a local optimum This can be corrected by restarting from a new random set of motifs every now and then 59 / 59

Algorithms for Bioinformatics

Algorithms for Bioinformatics These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics

More information

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 2: Exhaustive search and motif finding 6.9.202 Outline Implanted motifs - an introduction

More information

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median

More information

Partial restriction digest

Partial restriction digest This lecture Exhaustive search Torgeir R. Hvidsten Restriction enzymes and the partial digest problem Finding regulatory motifs in DNA Sequences Exhaustive search methods T.R. Hvidsten: 1MB304: Discrete

More information

9/8/09 Comp /Comp Fall

9/8/09 Comp /Comp Fall 9/8/09 Comp 590-90/Comp 790-90 Fall 2009 1 Genomes contain billions of bases (10 9 ) Within these there are 10s of 1000s genes (10 4 ) Genes are 1000s of bases long on average (10 3 ) So only 1% of DNA

More information

How can one gene have such drastic effects?

How can one gene have such drastic effects? Slides revised and adapted Computational Biology course IST Ana Teresa Freitas 2011/2012 A recent microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed How can

More information

VL Bioinformatik für Nebenfächler SS2018 Woche 9

VL Bioinformatik für Nebenfächler SS2018 Woche 9 VL Bioinformatik für Nebenfächler SS2018 Woche 9 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of Teil 1 (Woche

More information

VL Bioinformatik für Nebenfächler SS2018 Woche 8

VL Bioinformatik für Nebenfächler SS2018 Woche 8 VL Bioinformatik für Nebenfächler SS2018 Woche 8 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of Teil 1 (Woche

More information

Lecture 24: Randomized Algorithms

Lecture 24: Randomized Algorithms Lecture 24: Randomized Algorithms Chapter 12 11/26/2013 Comp 465 Fall 2013 1 Randomized Algorithms Randomized algorithms incorporate random, rather than deterministic, decisions Commonly used in situations

More information

Lecture 23: Randomized Algorithms

Lecture 23: Randomized Algorithms Lecture 23: Randomized Algorithms Chapter 12 11/20/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Randomized Algorithms Randomized algorithms incorporate random, rather than deterministic, decisions Commonly

More information

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25, Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25, 200707 Motif Finding This exposition is based on the following sources, which are all recommended reading:.

More information

Exhaustive search. CS 466 Saurabh Sinha

Exhaustive search. CS 466 Saurabh Sinha Exhaustive search CS 466 Saurabh Sinha Agenda Two different problems Restriction mapping Motif finding Common theme: exhaustive search of solution space Reading: Chapter 4. Restriction Mapping Restriction

More information

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12:00 4001 Motif finding This exposition was developed by Knut Reinert and Clemens Gröpl. It is based on the following

More information

Alignment. Peak Detection

Alignment. Peak Detection ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie

More information

Matrix-based pattern discovery algorithms

Matrix-based pattern discovery algorithms Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

Randomized algorithms. Inge Li Gørtz

Randomized algorithms. Inge Li Gørtz Randomized algorithms Inge Li Gørtz Randomized algorithms Today What are randomized algorithms? Properties of randomized algorithms Two examples: Median/Select. Quick-sort Randomized Algorithms Randomized

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes De novo identification of motifs in one species Modified from Serafim Batzoglou s lecture notes Finding Regulatory Motifs... Given a collection of genes that may be regulated by the same transcription

More information

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM? Neyman-Pearson More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery Given a sample x 1, x 2,..., x n, from a distribution f(... #) with parameter #, want to test

More information

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye! Today Gene regulation Synteny Good bye! Gene regulation What governs gene transcription? Genes active under different circumstances. Gene regulation What governs gene transcription? Genes active under

More information

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences Finding Regulatory Motifs in DNA Sequences Outline Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median

More information

Probabilistic models of biological sequence motifs

Probabilistic models of biological sequence motifs Probabilistic models of biological sequence motifs Discovery of new motifs Master in Bioinformatics UPF 2015-2016 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain what

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

More Protein Synthesis and a Model for Protein Transcription Error Rates

More Protein Synthesis and a Model for Protein Transcription Error Rates More Protein Synthesis and a Model for Protein James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University October 3, 2013 Outline 1 Signal Patterns Example

More information

Self Similar (Scale Free, Power Law) Networks (I)

Self Similar (Scale Free, Power Law) Networks (I) Self Similar (Scale Free, Power Law) Networks (I) E6083: lecture 4 Prof. Predrag R. Jelenković Dept. of Electrical Engineering Columbia University, NY 10027, USA {predrag}@ee.columbia.edu February 7, 2007

More information

Towards More Effective Formulations of the Genome Assembly Problem

Towards More Effective Formulations of the Genome Assembly Problem Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS June 26, 2015 1 / 25 2 / 25 CENTRAL DOGMA OF BIOLOGY

More information

Inferring Protein-Signaling Networks

Inferring Protein-Signaling Networks Inferring Protein-Signaling Networks Lectures 14 Nov 14, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1

More information

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery CSE 527 Autumn 2006 Lectures 8-9 (& part of 10) Motifs: Representation & Discovery 1 DNA Binding Proteins A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%,

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

Bio nformatics. Lecture 23. Saad Mneimneh

Bio nformatics. Lecture 23. Saad Mneimneh Bio nformatics Lecture 23 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

DNA Binding Proteins CSE 527 Autumn 2007

DNA Binding Proteins CSE 527 Autumn 2007 DNA Binding Proteins CSE 527 Autumn 2007 A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding

More information

Variable Selection in Structured High-dimensional Covariate Spaces

Variable Selection in Structured High-dimensional Covariate Spaces Variable Selection in Structured High-dimensional Covariate Spaces Fan Li 1 Nancy Zhang 2 1 Department of Health Care Policy Harvard University 2 Department of Statistics Stanford University May 14 2007

More information

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix) Computat onal Biology Lecture 21 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Chapter 7: Regulatory Networks

Chapter 7: Regulatory Networks Chapter 7: Regulatory Networks 7.2 Analyzing Regulation Prof. Yechiam Yemini (YY) Computer Science Department Columbia University The Challenge How do we discover regulatory mechanisms? Complexity: hundreds

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

Quantitative Bioinformatics

Quantitative Bioinformatics Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Hidden Markov Models 1

Hidden Markov Models 1 Hidden Markov Models Dinucleotide Frequency Consider all 2-mers in a sequence {AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT} Given 4 nucleotides: each with a probability of occurrence of. 4 Thus, one

More information

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough Search Search is a key component of intelligent problem solving Search can be used to Find a desired goal if time allows Get closer to the goal if time is not enough section 11 page 1 The size of the search

More information

Outline CSE 527 Autumn 2009

Outline CSE 527 Autumn 2009 Outline CSE 527 Autumn 2009 5 Motifs: Representation & Discovery Previously: Learning from data MLE: Max Likelihood Estimators EM: Expectation Maximization (MLE w/hidden data) These Slides: Bio: Expression

More information

A Simple Protein Synthesis Model

A Simple Protein Synthesis Model A Simple Protein Synthesis Model James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University September 3, 213 Outline A Simple Protein Synthesis Model

More information

Local Search & Optimization

Local Search & Optimization Local Search & Optimization CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018 Soleymani Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 4 Some

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Motif Finding Algorithms. Sudarsan Padhy IIIT Bhubaneswar

Motif Finding Algorithms. Sudarsan Padhy IIIT Bhubaneswar Motif Finding Algorithms Sudarsan Padhy IIIT Bhubaneswar Outline Gene Regulation Regulatory Motifs The Motif Finding Problem Brute Force Motif Finding Consensus and Pattern Branching: Greedy Motif Search

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

More information

CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms

CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms Antoine Vigneron King Abdullah University of Science and Technology December 5, 2012 Antoine Vigneron (KAUST) CS 372 Lecture

More information

Local Search & Optimization

Local Search & Optimization Local Search & Optimization CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2017 Soleymani Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 4 Outline

More information

6.842 Randomness and Computation Lecture 5

6.842 Randomness and Computation Lecture 5 6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its

More information

A Combined Motif Discovery Method

A Combined Motif Discovery Method University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations Dissertations and Theses 8-6-2009 A Combined Motif Discovery Method Daming Lu University of New Orleans Follow

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Alexandru Tomescu, Leena Salmela, Veli Mäkinen, Esa Pitkänen 582670 Algorithms for Bioinformatics Lecture 5: Combinatorial Algorithms and Genomic Rearrangements 1.10.2015 Background

More information

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data 1 Gene Networks Definition: A gene network is a set of molecular components, such as genes and proteins, and interactions between

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny

PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny Rahul Siddharthan 1,2, Eric D. Siggia 1, Erik van Nimwegen 1,3* 1 Center for Studies in Physics and Biology, The Rockefeller University,

More information

How much non-coding DNA do eukaryotes require?

How much non-coding DNA do eukaryotes require? How much non-coding DNA do eukaryotes require? Andrei Zinovyev UMR U900 Computational Systems Biology of Cancer Institute Curie/INSERM/Ecole de Mine Paritech Dr. Sebastian Ahnert Dr. Thomas Fink Bioinformatics

More information

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday 1. What is the Central Dogma? 2. How does prokaryotic DNA compare to eukaryotic DNA? 3. How is DNA

More information

Name: Exam 2 Solutions. March 13, 2017

Name: Exam 2 Solutions. March 13, 2017 Department of Mathematics University of Notre Dame Math 00 Finite Math Spring 07 Name: Instructors: Conant/Galvin Exam Solutions March, 07 This exam is in two parts on pages and contains problems worth

More information

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring 2006 Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. Latombe Scribe: Neda Nategh How do you update the energy function during the

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji Gene Regula*on, ChIP- X and DNA Mo*fs Statistics in Genomics Hongkai Ji (hji@jhsph.edu) Genetic information is stored in DNA TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTC

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

networks in molecular biology Wolfgang Huber

networks in molecular biology Wolfgang Huber networks in molecular biology Wolfgang Huber networks in molecular biology Regulatory networks: components = gene products interactions = regulation of transcription, translation, phosphorylation... Metabolic

More information

Algorithms for Molecular Biology

Algorithms for Molecular Biology Algorithms for Molecular Biology BioMed Central Research A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series Sara C Madeira* 1,2,3 and Arlindo

More information

Greedy Algorithms. CS 498 SS Saurabh Sinha

Greedy Algorithms. CS 498 SS Saurabh Sinha Greedy Algorithms CS 498 SS Saurabh Sinha Chapter 5.5 A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of length l. Enumerative approach O(l n

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

Computing Bi-Clusters for Microarray Analysis

Computing Bi-Clusters for Microarray Analysis December 21, 2006 General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model General Bi-clustering Problem Input: a n m matrix A. Output: a sub-matrix A P,Q of A such that the rows

More information

Introduction to Computer Science and Programming for Astronomers

Introduction to Computer Science and Programming for Astronomers Introduction to Computer Science and Programming for Astronomers Lecture 8. István Szapudi Institute for Astronomy University of Hawaii March 7, 2018 Outline Reminder 1 Reminder 2 3 4 Reminder We have

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

BIOINFORMATICS. Neighbourhood Thresholding for Projection-Based Motif Discovery. James King, Warren Cheung and Holger H. Hoos

BIOINFORMATICS. Neighbourhood Thresholding for Projection-Based Motif Discovery. James King, Warren Cheung and Holger H. Hoos BIOINFORMATICS Vol. no. 25 Pages 7 Neighbourhood Thresholding for Projection-Based Motif Discovery James King, Warren Cheung and Holger H. Hoos University of British Columbia Department of Computer Science

More information

Honors Biology Reading Guide Chapter 11

Honors Biology Reading Guide Chapter 11 Honors Biology Reading Guide Chapter 11 v Promoter a specific nucleotide sequence in DNA located near the start of a gene that is the binding site for RNA polymerase and the place where transcription begins

More information

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns. Curriculum, fourth lecture: Niels Richard Hansen November 30, 2011 NRH: Handout pages 1-8 (NRH: Sections 2.1-2.5) Keywords: binomial distribution, dice games, discrete probability distributions, geometric

More information

CS-E5880 Modeling biological networks Gene regulatory networks

CS-E5880 Modeling biological networks Gene regulatory networks CS-E5880 Modeling biological networks Gene regulatory networks Jukka Intosalmi (based on slides by Harri Lähdesmäki) Department of Computer Science Aalto University January 12, 2018 Outline Modeling gene

More information

GS 559. Lecture 12a, 2/12/09 Larry Ruzzo. A little more about motifs

GS 559. Lecture 12a, 2/12/09 Larry Ruzzo. A little more about motifs GS 559 Lecture 12a, 2/12/09 Larry Ruzzo A little more about motifs 1 Reflections from 2/10 Bioinformatics: Motif scanning stuff was very cool Good explanation of max likelihood; good use of examples (2)

More information

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 31 11 Motif Finding Sources for this section: Rouchka, 1997, A Brief Overview of Gibbs Sapling. J. Buhler, M. Topa:

More information

Sequence Alignment. Johannes Starlinger

Sequence Alignment. Johannes Starlinger Sequence Alignment Johannes Starlinger his Lecture Approximate String Matching Edit distance and alignment Computing global alignments Local alignment Johannes Starlinger: Bioinformatics, Summer Semester

More information

Simulation Modeling. Random Numbers

Simulation Modeling. Random Numbers Mathematical Modeling Lia Vas Simulation Modeling. Random Numbers In many cases one of the following situations might occur: - It is not possible to observe the behavior directly or to conduct experiments.

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization. 3.B.1 Gene Regulation Gene regulation results in differential gene expression, leading to cell specialization. We will focus on gene regulation in prokaryotes first. Gene regulation accounts for some of

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Algorithm Design and Analysis

Algorithm Design and Analysis Algorithm Design and Analysis LECTURE 26 Computational Intractability Polynomial Time Reductions Sofya Raskhodnikova S. Raskhodnikova; based on slides by A. Smith and K. Wayne L26.1 What algorithms are

More information

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline

More information

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: Hidden Markov Models Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: www.ioalgorithms.info Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm

More information

18.440: Lecture 19 Normal random variables

18.440: Lecture 19 Normal random variables 18.440 Lecture 19 18.440: Lecture 19 Normal random variables Scott Sheffield MIT Outline Tossing coins Normal random variables Special case of central limit theorem Outline Tossing coins Normal random

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

A Phylogenetic Gibbs Recursive Sampler for Locating Transcription Factor Binding Sites

A Phylogenetic Gibbs Recursive Sampler for Locating Transcription Factor Binding Sites A for Locating Transcription Factor Binding Sites Sean P. Conlan 1 Lee Ann McCue 2 1,3 Thomas M. Smith 3 William Thompson 4 Charles E. Lawrence 4 1 Wadsworth Center, New York State Department of Health

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Network motifs in the transcriptional regulation network (of Escherichia coli):

Network motifs in the transcriptional regulation network (of Escherichia coli): Network motifs in the transcriptional regulation network (of Escherichia coli): Janne.Ravantti@Helsinki.Fi (disclaimer: IANASB) Contents: Transcription Networks (aka. The Very Boring Biology Part ) Network

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information