Algorithms for Bioinformatics

Size: px

Start display at page:

Download "Algorithms for Bioinformatics"

Carol Wilcox
6 years ago
Views:

1 These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from Algorithms for Bioinformatics Lecture 2: Exhaustive search and randomized algorithms for motif discovery

2 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 2 / 59

3 DNA Sequence Motif A DNA sequence motif is a recurring pattern in DNA presumed to have some biological funtion. Often the occurences of the motif in DNA are binding sites where proteins such as transcription factors can attach. 3 / 59

4 Transcription Factors Every gene contains a regulatory region typically stretching bp upstream of the transcriptional start site. Transcription factor (TF) is a regulatory protein that can bind to a regulatory region and promote or inhibit the transcription of the gene. A single TF typically regulates multiple (co-regulated) genes. 4 / 59

5 Transcription Factor Binding Sites and Motifs Transcription Factor Binding Site (TFBS) is the DNA location where the TF binds to. May be located anywhere in the regulatory region. A TF binds to a specific kind of DNA sequence. Similar but not identical in different co-regulated genes. Some positions are more important and thus conserved while other positions allow variation. Motif is a pattern that describes the TFBS sequences of a given TF. 5 / 59

6 TBFSs ATCCCG gene TTCC GG gene ATCCCG gene AT GCCG gene AT GCC C gene 6 / 59

7 Motif Representations Set of sequences Consensus sequence Profile matrix TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA TGAGGGA A C G T Motif logo 7 / 59

8 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 8 / 59

9 Identifying Motifs Identify a set of co-regulated genes. Extract their regulatory regions. Find a pattern shared by all of the regions: the motif. The motif can be used for finding other (previously unknown) genes regulated by the same TF. cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc 9 / 59

10 The Motif Finding Problem: Formulation Goal: Find the best motif in a set of DNA sequences Input: A collection of t strings Dna and an integer k Output: The collection of t k-mers, one from each input string, forming the best motif. We will later consider what is the best motif. t = 5 k = 8 cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc }{{} n = / 59

11 Planted (k, d)-motif Problem Artificially generated problem instance 1. Generate a random k-mer M (the motif) 2. Generate t random sequences of length n 3. Implant M into a random position in each of the t sequences 4. In each implanted copy of M, make d random mutations 11 / 59

12 Example: Generate sequences gatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtac cctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttata tgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttca ttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagc 12 / 59

13 Example: Implant motif AAAAAAAAGGGGGGG gatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggg cctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttata tgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttca ttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggg 13 / 59

14 Where are the implanted motifs? gatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggg cctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttata tgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttca ttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggg 14 / 59

15 Example: Add four mutations per motif occurrence gatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggg cctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttata tgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttca ttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgg 15 / 59

16 Where are the implanted motifs??? gatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggg cctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg ttggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttata tgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgca gcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttca ttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaa caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgg 16 / 59

17 Why finding (15,4)-motifs is hard? gatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggg Occurrences can differ in 8 positions AgAAgAAAGGttGGG caataaaacggcggg Difficult to recognize similarity even if you know the positions 17 / 59

18 Planted (k, d)-motif Problem Common benchmark or challenge problem for motif finding Adjustable difficulty For example, the planted (15, 4)-motif problem is difficult but not impossible Real data is noisier and more complex No upper bound d on mutation Some positions have little variation, others have a lot 18 / 59

19 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 19 / 59

20 Scoring Motifs The output of motif finding is a set of k-mers, one from each input sequence In a desired output, the k-mers are as similar to each other as possible, representing a coherent motif How to measure the goodness of the motif? 20 / 59

21 Consensus Motif and Score a G g t a c T t C c A t a c g t Motif occurrences a c g t T A g t a c g t C c A t C c g t a c g G A Counts C G T Consensus A C G T A C G T t motifs occurrences, one from each sequence Count symbols in each column Consensus formed by most frequent symbols Score is the number of differences from consensus Score =10 21 / 59

22 Consensus Motif and Score Good model for planted motif problem In real data, mutations should be expensive in conserved positions but much cheaper in variable positions 22 / 59

23 BruteForceMotifSearch Compute the score for every possible combination of motifs Output the set of motifs with the smallest score 23 / 59

24 Running Time of BruteForceMotifSearch (n k + 1) different k-mers in each sequence (n k + 1) t different combinations of motifs kt time to compute score for one set of motifs kt(n k + 1) t = O(ktn t ) time in total E.g. for t = 20, n = 600, k = 15 we must perform approximately computations it would take billions of years 24 / 59

25 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 25 / 59

26 The Median String Problem Given a set of t DNA sequences find a pattern that appears in all t sequences with the minimum number of total mismatches This pattern will be the shared motif 26 / 59

27 Hamming Distance The Hamming distance d(v, w) the number of mismatches between two k-mers v and w For example: d(aaaaaa, ACAAAC) = 2 27 / 59

28 Computing Score a G g t a c T t 2 C c A t a c g t 2 Motifs a c g t T A g t 2 a c g t C c A t 2 C c g t a c g G 2 Score(Motifs) =10 Consensus(Motifs) A C G T A C G T Score is the number of mismatching symbols Can be computed column by column or row by row Row sums are Hamming distances 28 / 59

29 Computing Score Define Motifs = {Motif 1, Motif 2,..., Motif t } d(pattern, Motifs) = t i=1 d(pattern, Motif i) Then Score(Motifs) = d(consensus(motifs), Motifs) 29 / 59

30 Best Match Distance Assume String > Pattern = k The best match distance d(pattern, String) is the smallest Hamming distance d(pattern, Motif ) between Pattern and any k-mer Motif in String Example: d(acgtacgt, gcaaaaggtacttccaa) = 2 Generalize for a set of strings Dna = {Dna 1, Dna 2,..., Dna t } d(pattern, Dna) = t i=1 d(pattern, Dna i) 30 / 59

31 The Median String Problem Goal: Given a set of DNA sequences, find a median string Input: A collection of strings Dna and an integer k Output: A k-mer Pattern minimizing d(pattern, Dna) among all k-mers Pattern 31 / 59

32 Motif Finding Problem = Median String Problem Input: Dna, k Motifs: output of Motif Finding (t k-mers) Pattern: output of Median String Score(Motifs) = d(pattern, Dna) 32 / 59

33 Motif Finding Problem = Median String Problem Input: Dna, k Motifs: output of Motif Finding (t k-mers) Pattern: output of Median String Score(Motifs) = d(pattern, Dna) Why? If Score(Motifs) < d(pattern, Dna), we could choose Consensus(Motifs) as a better Pattern If Score(Motifs) > d(pattern, Dna), we could choose the best match occurrences of Pattern as better Motifs 32 / 59

34 Median String Algorithm MedianString(DNA, k) 1: BestPattern AAA...A 2: for each k-mer Pattern from AAA...A to TTT...T do 3: if d(pattern, DNA) < d(bestpattern, DNA) then 4: BestPattern Pattern 5: return BestPattern 33 / 59

35 Running Time of MedianString 4 k different k-mers O(k n) time to compute the best match distance to one string O(knt4 k ) time in total E.g. for t = 20, n = 600, k = 15 this is about about still a lot but much less than / 59

36 Running Time of MedianString 4 k different k-mers O(k n) time to compute the best match distance to one string O(knt4 k ) time in total E.g. for t = 20, n = 600, k = 15 this is about about still a lot but much less than Reformulating a problem can help! 34 / 59

37 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 35 / 59

38 Search Space BruteForceMotifSearch and MedianString algorithms have exponential running time This is because the search space, the set of possible solutions, is exponential n t different ways to choose Motifs 4 k different ways to choose Pattern 36 / 59

39 Exploring Only Part of Search Space Branch and bound algorithms (covered in study groups) Avoid regions that cannot improve solution Still exponential in the worst case Greedy algorithms Search the most promising directions No guarantee of finding an optimal solution Randomized algorithms Add randomness to greedy search Avoids getting stuck in a dead end 37 / 59

40 Profile Matrix a G g t a c T t C c A t a c g t Motifs a c g t T A g t a c g t C c A t C c g t a c g G A Count(Motifs) C G T A Profile(Motifs) C G T Profile represents the probability of each nucleotide in each position More detailed summary of the set of motifs than consensus Consensus(Motifs) A C G T A C G T 38 / 59

41 k-mer Probabilities A Profile C G T The probability of a k-mer given a profile Pr(AGGTACTT Profile) = = Measure how well the k-mer matches the motif Does imply a good match? 39 / 59

42 Profile-Most Probable k-mer The k-mer with the highest probability in a string Considered the best matching motif Example: The Profile-most probable 8-mer in gcaaaaggtacttccaa is AGGTACTT Pr(AGGTACTT Profile) = A Profile C G T / 59

43 Problem: Zero Probabilities A Profile C G T Consensus A C G T A C G T Pr(TCGTACGT Profile) = = 0 Only one mismatch compared to consensus Should this probability really be 0? 41 / 59

44 Pseudocounts Add one to all counts Avoids zero counts A Count C G T A PseudoCount C G T / 59

45 Laplace s Rule of Succession Use pseudocounts instead of counts to compute probabilities As if we had seen one occurrence of each symbol before the main data A PseudoCount C G T A 4/9 1/9 2/9 1/9 4/9 2/9 2/9 1/9 Profile C 3/9 5/9 1/9 1/9 2/9 5/9 1/9 1/9 G 1/9 2/9 5/9 1/9 1/9 1/9 4/9 2/9 T 1/9 1/9 1/9 6/9 2/9 1/9 2/9 5/9 Pr(TCGTACGT Profile) = 1/9 5/9 5/9 6/9 4/9 5/9 4/5 5/9 = 60000/ = / 59

46 Greedy Motif Search Solve Motif Finding problem For each input string, choose the profile-most probable k-mer as the motif Greedy choice The profile is computed from motifs of earlier strings In first string, try all k-mers 44 / 59

47 Greedy Motif Search GreedyMotifSearch(DNA, k, t) 1: BestMotifs the first k-mer of each string in DNA 2: for each k-mer Motif in the first string in DNA do 3: Motif 1 Motif 4: for i 2 to t do 5: form Profile from Motif 1,..., Motif i 1 6: Motif i Profile-most probable k-mer in the i-th string in DNA 7: Motifs Motif 1,..., Motif t 8: if Score(Motifs) < Score(BestMotifs) then 9: BestMotifs Motifs 10: return BestMotifs 45 / 59

48 Performance of GreedyMotifSearch Running time O(n t k (t + n)) polynomial not exponential May not find the best motifs Early choices may lead to a wrong direction 46 / 59

49 Outline DNA Sequence Motifs Motif Finding Problem Brute Force Motif Finding Median String Problem Greedy Motif Search Randomized Algorithms 47 / 59

50 Randomized Algorithms Make random choices during computation Use random number generator to toss coins or to roll dice Why Randomness Helps? If a greedy algorithm fails for some input, it will always fail for that input If a randomized algorithm fails, it is unlikely to fail again in the same way We can run it many times and choose the best output 48 / 59

51 Monte Carlo and Las Vegas Algorithms Monte Carlo algorithm May return an incorrect or inoptimal result Returns a correct answer or a good approximation with high probability (if repeated sufficiently many times) Las Vegas algorithm Always returns a correct/optimal result Very long runtime is possible but very unlikely 49 / 59

52 Turning Monte Carlo into Las Vegas 1. Run the Monte Carlo algorithm 2. If the result is good, stop. Otherwise return to Step / 59

53 Turning Monte Carlo into Las Vegas 1. Run the Monte Carlo algorithm 2. If the result is good, stop. Otherwise return to Step 1. Requires that a correct or optimal result can be easily recognized This is not the case with the Motif Finding problem The following algoriths are Monte Carlo algorithms 50 / 59

54 Randomized Motif Search Improving a set of motifs Starting with a set of motifs (one from each sequence) 1. Compute a profile from the motifs 2. Find the profile-most probable motif in each sequence The result is a potentially better set of motifs Repeat this as long as the set of motifs keeps improving Randomization Start with a random set of motifs 51 / 59

55 Randomized Motif Search RandomizedMotifSearch(DNA, k, t) 1: randomly select k-mers Motifs = (Motif 1,..., Motif t ), one from each string in DNA 2: BestMotifs Motifs 3: while forever do 4: Profile Profile(Motifs) 5: for i 1 to t do 6: Motif i Profile-most probable k-mer in the i-th string in DNA 7: Motifs Motif 1,..., Motif t 8: if Score(Motifs) < Score(BestMotifs) then 9: BestMotifs Motifs 10: else 11: return BestMotifs 52 / 59

56 Why Randomized Motif Search Works? If Motifs is a random set, the expectation is that Profile(Motifs) has about the same probability 0.25 for each symbol in each column If Motifs contains some of the true motifs, it is not random and Profile(Motifs) reflects this Then Profile(Motifs) is more likely to match the other true motifs Thus we might need just a few of the true motifs in the initial set This will happen eventually if repeated many times (may require thousands of repeats) 53 / 59

57 Gibbs Sampler Gibbs Sampler is a more refined randomized algorithm Compared to Randomized Motif Search Gibbs Sampler is More cautious More randomized 54 / 59

58 Gibbs Sampler Is More Cautious Randomized Motif Search might get some true motifs right but throw them all away in the next round Gibbs Sampler changes just one motif in each round 55 / 59

59 Gibbs Sampler Is More Randomized Randomized Motif Search uses randomness only in the beginning Gibbs Sampler uses randomness in every round Choose a random motif to discard Replace it with a random motif (from the same sequence) The second random choice is biased: a profile-randomly generated k-mer 56 / 59

60 Profile-Randomly Generated k-mer Given a Profile and a String 1. Compute probabilities of all k-mers in String 2. Choose one of the k-mers randomly but biased by the probabilities The probabilities with respect to Profile do not usually sum up to 1 and have to be normalized: Replace p 1,..., p n with p 1 /C,... p 1 /C, where C = n i=1 p i Example p 1 = 0.1, p 2 = 0.2, p 3 = 0.3 C = = 0.6 p1 /C = 1/6, p 2 /C = 1/3, p 3 /C = 1/2 p1 /C + p 2 /C + p 3 /C = 1/6 + 1/3 + 1/2 = 1 57 / 59

61 Gibbs Sampler GibbsSampler(DNA, k, t, N) 1: randomly select k-mers Motifs = (Motif 1,..., Motif t ) in each string in Dna 2: BestMotifs Motifs 3: for j 1 to N do 4: i Random(t) 5: Profile profile matrix constructed from all strings in Motifs except for Motif i 6: Motif i profile-randomly generated k-mer in the i-th sequence in DNA 7: if Score(Motifs) < Score(BestMotifs) then 8: BestMotifs Motifs 9: return BestMotifs 58 / 59

62 Gibbs Sampler Because of randomness in every round, Gibbs Sampler can keep on running without getting stuck to single solution However, it may end up exploring the same small set of solutions repeatedly: It gets stuck in a local optimum This can be corrected by restarting from a new random set of motifs every now and then 59 / 59

Algorithms for Bioinformatics

These slides are based on previous years slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics