Seed-based sequence search: some theory and some applications

Size: px
Start display at page:

Download "Seed-based sequence search: some theory and some applications"

Transcription

1 Seed-based sequence search: some theory and some applications Gregory Kucherov CNRS/LIGM, Marne-la-Vallée joint work with Laurent Noé (LIFL LIlle) Journées GDR IM, Lyon, January -, 3

2 Filtration for approximate pattern matching

3 Filtration for approximate pattern matching

4 Filtration for approximate pattern matching

5 Filtration for approximate pattern matching

6 Filtration for sequence alignment

7 Filtration for sequence alignment

8 Filtration for sequence alignment

9 Contiguous seeds Example pattern size 8, allowed substitution errors (Hamming distance) ### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGCTCAAGA

10 Contiguous seeds Example pattern size 8, allowed substitution errors (Hamming distance) ###### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGCTCAAGA 6 consecutive matching characters (######) constitutealossless filter

11 Contiguous seeds Example pattern size 8, allowed substitution errors (Hamming distance) ####### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGCTCAAGA 6 consecutive matching characters (######) constitutealossless filter 7 consecutive matching characters (#######) constitutealossy filter ( 5% occurrences missed)

12 Contiguous seeds Example pattern size 8, allowed substitution errors (Hamming distance) ####### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGCTCAAGA 6 consecutive matching characters (######) constitutealossless filter 7 consecutive matching characters (#######) constitutealossy filter ( 5% occurrences missed) BLAST [Altschul et al., J.Mol.Biol. 99] 43 citations according to Google Scholar

13 Spaced seeds Definition (spaced seed) A spaced seed π is a pattern composed of symbols {#, -}, where # stands for a match ( ), and - (joker) stands for either match ( ) ormismatch (:) s(π) : span (length), w(π) : weight (number of #). Example seed π = ###--#-### (of weight 7!) is lossless for pattern length 8 and errors ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGTGACTGTGCAAGA

14 Spaced seeds Definition (spaced seed) A spaced seed π is a pattern composed of symbols {#, -}, where # stands for a match ( ), and - (joker) stands for either match ( ) ormismatch (:) s(π) : span (length), w(π) : weight (number of #). Example seed π = ###--#-### (of weight 7!) is lossless for pattern length 8 and errors ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGTGCAAGA

15 Spaced seeds Definition (spaced seed) A spaced seed π is a pattern composed of symbols {#, -}, where # stands for a match ( ), and - (joker) stands for either match ( ) ormismatch (:) s(π) : span (length), w(π) : weight (number of #). Example seed π = ###--#-### (of weight 7!) is lossless for pattern length 8 and errors ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGTGACTGTGCAAGA

16 Spaced seeds Definition (spaced seed) A spaced seed π is a pattern composed of symbols {#, -}, where # stands for a match ( ), and - (joker) stands for either match ( ) ormismatch (:) s(π) : span (length), w(π) : weight (number of #). Example seed π = ###--#-### (of weight 7!) is lossless for pattern length 8 and errors ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGTGACTGTGCAAGA seed π = ###-#--###-# (of weight 8!!) is lossless too

17 Spaced seeds (cont) Example Consider similarities of length 64 with Pr( ) =.7 andpr(:) =.3 (Bernoulli model)....atcagtgaatgcgtaagact... : : : :...ATCAGCGAATGTGCAAGAGT... A random similarity contains seed # =########### (Blast) with probability.3 seed ###-#--#-#--##-### (PatternHunter) with probability.47 seed # =########## with probability.4

18 Filtration: Sensitivity and Selectivity Sensitivity (recall) = measure of true similarities detected (true positives) Selectivity (precision, specificity) = measure of spurious candidates (false positives) text, pattern seed filter candidate occurrences (costly) check Selectivity measured by the number of # s In the lossless case, sensitivity= % true occurrences

19 Why spaced seeds are better? Some probabilistic observations: For spaced seeds, occurrences at subsequent positions are more independent events For contiguous vs spaced seeds of the same weight, the expected number of occurrences is (roughly) the same but the probabilities of at least one occurrence are very different

20 Why spaced seeds are better? Some probabilistic observations: For spaced seeds, occurrences at subsequent positions are more independent events For contiguous vs spaced seeds of the same weight, the expected number of occurrences is (roughly) the same but the probabilities of at least one occurrence are very different Question: Which of aababb and aaaaaa is more likely to appear first in an i.i.d. binary sequence? Answer:

21 Why spaced seeds are better? Some probabilistic observations: For spaced seeds, occurrences at subsequent positions are more independent events For contiguous vs spaced seeds of the same weight, the expected number of occurrences is (roughly) the same but the probabilities of at least one occurrence are very different Question: Which of aababb and aaaaaa is more likely to appear first in an i.i.d. binary sequence? Answer: aababb Spaced seeds have a smaller waiting time than contiguous seeds

22 Seed Analysis: Formalization π =###--#-### ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGTGCAAGA

23 Seed Analysis: Formalization π =###--#-### ###--#-### : :

24 Seed Analysis: Formalization π =###--#-### ###--#-### : : ATCAGTGAATGCGCAAGA ATCAGCGAATGTGCAAGA

25 Seed Analysis: Formalization π =###--#-### ###--#-### : :??? ATCAGTGAATGCGCAAGA ATCAGCGAATGTGCAAGA

26 Seed Analysis: Formalization π =###--#-### ###--#-### : :??? ATCAGTGAATGCGCAAGA ATCAGCGAATGTGCAAGA Lossless case: every similarity of length m with k mismatches ( s) and (m k) matches ( s) is covered by π

27 Seed Analysis: Formalization π =###--#-### ###--#-### : :??? ATCAGTGAATGCGCAAGA ATCAGCGAATGTGCAAGA Lossless case: every similarity of length m with k mismatches ( s) and (m k) matches ( s) is covered by π Lossy case: specify a probabilistic model F of similarity e.g. Pr() = p, Pr() = p sensitivity F (π) =Pr(seed covers a random similarity) specify a probabilistic model B of random alignment ( background ) selectivity B (π) =Pr(seed covers a random alignment)

28 Basic Problems Lossless case: for a given seed π, verifyifπ is lossless for similarity length m and number of errors k, construct lossless seeds of maximal weight (selectivity)

29 Basic Problems Lossless case: for a given seed π, verifyifπ is lossless for similarity length m and number of errors k, construct lossless seeds of maximal weight (selectivity) Lossy case: for a given seed π, modelsf and B, compute sensitivity F (π) and selectivity B (π), construct seeds that maximize sensitivity F (π) (minimize selectivity B (π))

30 Spaced Seeds: Historical Remarks Burkhardt, Kärkkäinen, CPM : spaced seeds for (lossless) approximate pattern matching Ma, Tromp, Li (PatternHunter): spaced seeds for (lossy) similarity search in DNA sequences YASS [Noé, Kucherov 4] references in noe/spaced seeds.html

31 (m, k)-problem m: length of similarity, k number of errors (substitutions) Definition ((m, k)-problem) A seed π {#,-} s is (m, k)-lossless iff all ( m k ) similarities are covered by π Reminder : similarities are represented as binary strings over {, }, where # matches and - matches both and

32 Verifying losslessness: existing algorithms [Burkhardt,Kärkkäinen ] Theorem One can verify if a seed π is (m, k)-lossless in time O(m k s(π) )

33 Verifying losslessness: existing algorithms [Burkhardt,Kärkkäinen ] Theorem One can verify if a seed π is (m, k)-lossless in time O(m k s(π) ) s(π) can be improved to s(π) w(π)

34 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-#

35 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-# detected similarity fragments: {,,, }

36 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-# detected similarity fragments: {,,, } q q q 3 q q 5 q 4 q 9 q 8 q 7 q 6 q f,

37 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-# detected similarity fragments: {,,, } q q q 3 q q 5 q 4 q 9 q 8 q 7 q 6 q f,

38 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-# detected similarity fragments: {,,, } q q q 3 q q 5 q 4 q 9 q 8 q 7 q 6 q f, number of states can be bounded by O(w(π) s(π) w(π) )

39 Verifying losslessness: automata technique π =#-# m =7 k =

40 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f

41 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f

42 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f

43 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f q q q q 3 q f

44 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f q q q q 3 q f Verify that every word of length m with k s is accepted

45 π =#-#-# m =7 k = q q q q 3 q f

46 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)}

47 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k

48 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

49 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

50 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

51 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

52 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

53 π =#-#-# m =7 k = q 3 q q 3 q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

54 π =#-#-# m =7 k = q 3 4 q 3 q 3 3 q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

55 π =#-#-# m =7 k = q q 3 4 q q 3 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

56 π =#-#-# m =7 k = q q 3 4 q q 3 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)} running time O(m w(π) s(π) w(π) )

57 Sensitivity computation [Kucherov,Noé,Roytberg 5] π =#-#-# m =7 F: Pr() =.8, Pr() =. q q q q 3 q f

58 Sensitivity computation [Kucherov,Noé,Roytberg 5] π =#-#-# m =7 F: Pr() =.8, Pr() =. q q q q 3 q f cost() =cost( ) =., cost() = cost( ) =.8 cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) = {cost(q init... q)}

59 Sensitivity computation [Kucherov,Noé,Roytberg 5] π =#-#-# m =7 F: Pr() =.8, Pr() =. q q q q 3 q f cost() =cost( ) =., cost() = cost( ) =.8 cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) = {cost(q init... q)} Theorem sensitivity F (π) =cost(q Final )

60 Sensitivity computation [Kucherov,Noé,Roytberg 5] π =#-#-# m =7 F: Pr() =.8, Pr() =. q q q q 3 q f cost() =cost( ) =., cost() = cost( ) =.8 cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) = {cost(q init... q)} Theorem sensitivity F (π) =cost(q Final ) forward-update cost(q) = q q {cost(q ) cost(q q)}

61 Common setting: dynamic programming on weighted automata over semirings [Mohri, Handbook of Weighted Automata 9] semiring S =(S,,,, ) verifying losslessness tropical semiring (S = N {}, =min, =+, =, =) computing sensitivity probability semiring (S = R, =+, =, =, =)

62 Common setting: dynamic programming on weighted automata over semirings [Mohri, Handbook of Weighted Automata 9] semiring S =(S,,,, ) verifying losslessness tropical semiring (S = N {}, =min, =+, =, =) computing sensitivity probability semiring (S = R, =+, =, =, =) counting the number of detected similarities (lossless setting) counting semiring (S = N, =+, =, =, =)

63 Common setting: dynamic programming on weighted automata over semirings [Mohri, Handbook of Weighted Automata 9] semiring S =(S,,,, ) verifying losslessness tropical semiring (S = N {}, =min, =+, =, =) computing sensitivity probability semiring (S = R, =+, =, =, =) counting the number of detected similarities (lossless setting) counting semiring (S = N, =+, =, =, =) detecting similarities exceeding a given (additive) score max-plus algebra (S = R { }, =max, =+, =, =)

64 Common setting: dynamic programming on weighted automata over semirings [Mohri, Handbook of Weighted Automata 9] semiring S =(S,,,, ) verifying losslessness tropical semiring (S = N {}, =min, =+, =, =) computing sensitivity probability semiring (S = R, =+, =, =, =) counting the number of detected similarities (lossless setting) counting semiring (S = N, =+, =, =, =) detecting similarities exceeding a given (additive) score max-plus algebra (S = R { }, =max, =+, =, =) similarity with maximum probability to be detected (Viterbi algorithm) Viterbi semiring (S =[, ], =max, =, =, =)

65 Lossless property is NP-complete [Nicolas,Rivals 5] [Ma,Li 6] From previous algorithm: one can verify is a seed π is (m, k)-lossless in time O(m w(π) s(π) w(π) )

66 Lossless property is NP-complete [Nicolas,Rivals 5] [Ma,Li 6] From previous algorithm: one can verify is a seed π is (m, k)-lossless in time O(m w(π) s(π) w(π) ) Theorem Verifying if a given seed π is (m, k)-lossless is NP-hard

67 Lossless property is NP-complete [Nicolas,Rivals 5] [Ma,Li 6] From previous algorithm: one can verify is a seed π is (m, k)-lossless in time O(m w(π) s(π) w(π) ) Theorem Verifying if a given seed π is (m, k)-lossless is NP-hard Similar result holds for the lossy case: Theorem Computing the sensitivity of a given seed π under a Bernoulli model of similarities is NP-hard

68 Design of lossless seeds similarities: lengthm, number of errors k seeds: span s, weightw (number of # s) design lossless seeds (for given s, w) thatmaximize k for a given m (or minimize m for a given k) given m and k, design lossless seeds that maximize w

69 Asymptotics [Kucherov,Noé,Roytberg 4] [Farach-Colton,Landau,Sahinalp,Tsur 5]

70 Asymptotics [Kucherov,Noé,Roytberg 4] [Farach-Colton,Landau,Sahinalp,Tsur 5] Theorem (fixed number of jokers) Let the number of jokers d be fixed. The maximal number k of errors detected by an optimal (m, k)-lossless seed is ( d + ) m s ± O(max{, m s })

71 Asymptotics [Kucherov,Noé,Roytberg 4] [Farach-Colton,Landau,Sahinalp,Tsur 5] Theorem (fixed number of jokers) Let the number of jokers d be fixed. The maximal number k of errors detected by an optimal (m, k)-lossless seed is ( d + ) m s ± O(max{, m s }) Theorem (fixed number of errors) Let the number k of errors be fixed. Then the maximal weight w of an optimal (m, k)-lossless seed is m Θ(m k k+ )

72 Asymptotics [Kucherov,Noé,Roytberg 4] [Farach-Colton,Landau,Sahinalp,Tsur 5] Theorem (fixed number of jokers) Let the number of jokers d be fixed. The maximal number k of errors detected by an optimal (m, k)-lossless seed is ( d + ) m s ± O(max{, m s }) Theorem (fixed number of errors) Let the number k of errors be fixed. Then the maximal weight w of an optimal (m, k)-lossless seed is m Θ(m k k+ ) Proofs are constructive but the results are only asymptotic

73 Practical design: periodic seeds [Kucherov,Noé,Roytberg 4] Notation: [π,π ] i =(π π ) i π Theorem (periodic seeds) If seed π solves a cyclic (m, k)-problem, then for all i, seed [π, - m s(π) ] i is (m(i +)+s(π), k)-lossless.

74 Practical design: periodic seeds [Kucherov,Noé,Roytberg 4] Notation: [π,π ] i =(π π ) i π Theorem (periodic seeds) If seed π solves a cyclic (m, k)-problem, then for all i, seed [π, - m s(π) ] i is (m(i +)+s(π), k)-lossless. Example ###-# solves the cyclic (7, )-problem. Then [###-#, --] i is ( + 7i, )-lossless for all i. For i =,, 3 this produces optimal (maximally weighted) seeds. E.g. ###-#--###-# is (8, )-lossless

75 Practical design: periodic seeds [Kucherov,Noé,Roytberg 4] Notation: [π,π ] i =(π π ) i π Theorem (periodic seeds) If seed π solves a cyclic (m, k)-problem, then for all i, seed [π, - m s(π) ] i is (m(i +)+s(π), k)-lossless. Example ###-# solves the cyclic (7, )-problem. Then [###-#, --] i is ( + 7i, )-lossless for all i. For i =,, 3 this produces optimal (maximally weighted) seeds. E.g. ###-#--###-# is (8, )-lossless From the asymptotic bound (previous slide), periodic seeds are not asymptotically optimal, as they have a constant fraction of jokers

76 Multiple seeds Example Using several seeds simultaneously in a complementary fashion m = 8, k =3 seed weight #### 4 ###-## 5

77 Multiple seeds Example Using several seeds simultaneously in a complementary fashion m = 8, k =3 seed weight #### 4 { ###-## 5 ##-#-#### 7 ###---#--##-#

78 Multiple seeds Example Using several seeds simultaneously in a complementary fashion m = 8, k =3 seed weight Pr(false positive) (i.i.d. on 4 letters) #### 4.39 { ###-## ##-#-#### 7. ###---#--##-# 3

79 Multiple seeds Example Using several seeds simultaneously in a complementary fashion m = 8, k =3 seed weight Pr(false positive) (i.i.d. on 4 letters) #### 4.39 { ###-## ##-#-#### 7. ###---#--##-# 3 ##-##-##### ###-####--## ###-##---#-### 9.3 ##----####-### 4 ###---#-#-##-## ###-#-#-#-----###

80 Multiple seeds (cont) Multiple seeds further improve the sensitivity in the lossy setting [PatternHunter II 3] Multiple seeds incur a computational overhead which is compensated by an increase in efficiency The automaton approach applies to multiple seeds too (but size of seed automata becomes larger of course) Seed design is even harder for multiple seeds, but the idea of periodic seeds can be applied [Kucherov,Noé,Roytberg 4]

81 Bioinformatics applications: Next-Generation Sequencing NGS technologies, appeared in 5, produce tens of Gb of sequence data in a single run these data constist of short (35-4 letters) reads of DNA sequence many methods exist for mapping these reads to a reference genomic sequence, some of them (usually the most precise ones) are based on seeds (SHRiMP, MAQ, ZOOM, PASS, PerM,...) lossless seeds are applied (e.g. ZOOM), periodic seeds as well (PerM) specific seeding techniques have been developed: positioned seeds [Noé,Gîrdea,Kucherov ]

82 Anti-summary What I have not talked about

83 Anti-summary What I have not talked about More complex models for similarities and seeds: subset seeds, Markov and Hidden Markov models for similarities

84 Anti-summary What I have not talked about More complex models for similarities and seeds: subset seeds, Markov and Hidden Markov models for similarities Seeds for protein sequences (on letters)

85 Anti-summary What I have not talked about More complex models for similarities and seeds: subset seeds, Markov and Hidden Markov models for similarities Seeds for protein sequences (on letters) Implementation issues, extension step

86 Anti-summary What I have not talked about More complex models for similarities and seeds: subset seeds, Markov and Hidden Markov models for similarities Seeds for protein sequences (on letters) Implementation issues, extension step software YASS, iedera, hedera, SToRM:

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Multi-seed lossless filtration (Extended abstract)

Multi-seed lossless filtration (Extended abstract) Multi-seed lossless filtration (Extended abstract) Gregory Kucherov, Laurent Noé, Mikhail Roytberg To cite this version: Gregory Kucherov, Laurent Noé, Mikhail Roytberg. Multi-seed lossless filtration

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Multiseed Lossless Filtration

Multiseed Lossless Filtration Multiseed Lossless Filtration Gregory Kucherov, Laurent Noé, Mikhail Roytberg To cite this version: Gregory Kucherov, Laurent Noé, Mikhail Roytberg. Multiseed Lossless Filtration. IEEE/ACM Transactions

More information

Multiseed lossless filtration

Multiseed lossless filtration Multiseed lossless filtration 1 Gregory Kucherov, Laurent Noé, Mikhail Roytberg arxiv:0901.3215v1 [q-bio.qm] 21 Jan 2009 Abstract We study a method of seed-based lossless filtration for approximate string

More information

Best hits of : model free selection and parameter free sensitivity calculation of spaced seeds

Best hits of : model free selection and parameter free sensitivity calculation of spaced seeds Noé Algorithms Mol Biol (27) 2: DOI.86/s35-7-92- Algorithms for Molecular Biology RESEARCH Open Access Best hits of : model free selection and parameter free sensitivity calculation of spaced seeds Laurent

More information

Subset seed automaton

Subset seed automaton Subset seed automaton Gregory Kucherov, Laurent Noé, and Mikhail Roytberg 2 LIFL/CNRS/INRIA, Bât. M3 Cité Scientifique, 59655, Villeneuve d Ascq cedex, France, {Gregory.Kucherov,Laurent.Noe}@lifl.fr 2

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Improved hit criteria for DNA local alignment

Improved hit criteria for DNA local alignment Improved hit criteria for DNA local alignment Laurent Noé Gregory Kucherov Abstract The hit criterion is a key component of heuristic local alignment algorithms. It specifies a class of patterns assumed

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Pair Hidden Markov Models

Pair Hidden Markov Models Pair Hidden Markov Models Scribe: Rishi Bedi Lecturer: Serafim Batzoglou January 29, 2015 1 Recap of HMMs alphabet: Σ = {b 1,...b M } set of states: Q = {1,..., K} transition probabilities: A = [a ij ]

More information

Hardness of Optimal Spaced Seed Design. François Nicolas, Eric Rivals,1

Hardness of Optimal Spaced Seed Design. François Nicolas, Eric Rivals,1 Hardness of Optimal Spaced Seed Design François Nicolas, Eric Rivals,1 L.I.R.M.M., U.M.R. 5506 C.N.R.S. Université de Montpellier II 161, rue Ada 34392 Montpellier Cedex 5 France Correspondence address:

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

More information

Spaced Seeds Design Using Perfect Rulers

Spaced Seeds Design Using Perfect Rulers Spaced Seeds Design Using Perfect Rulers Lavinia Egidi and Giovanni Manzini Dipartimento di Informatica, Università del Piemonte Orientale, Italy {lavinia.egidi,giovanni.manzini}@mfn.unipmn.it Abstract.

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search (Hidden) Markov models Why consider probabilistics

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Probabilistic Arithmetic Automata

Probabilistic Arithmetic Automata Probabilistic Arithmetic Automata Applications of a Stochastic Computational Framework in Biological Sequence Analysis Inke Herms PhD thesis defense Overview 1 Probabilistic Arithmetic Automata 2 Application

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

Fast profile matching algorithms A survey

Fast profile matching algorithms A survey Theoretical Computer Science 395 (2008) 137 157 www.elsevier.com/locate/tcs Fast profile matching algorithms A survey Cinzia Pizzi a,,1, Esko Ukkonen b a Department of Computer Science, University of Helsinki,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

CSE 20. Lecture 4: Introduction to Boolean algebra. CSE 20: Lecture4

CSE 20. Lecture 4: Introduction to Boolean algebra. CSE 20: Lecture4 CSE 20 Lecture 4: Introduction to Boolean algebra Reminder First quiz will be on Friday (17th January) in class. It is a paper quiz. Syllabus is all that has been done till Wednesday. If you want you may

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P

More information

L3: Blast: Keyword match basics

L3: Blast: Keyword match basics L3: Blast: Keyword match basics Fa05 CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with exactly the same number of hairs! Assignment 1 is online Due 10/6

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb Outline DP paradigm Discrete optimisation Viterbi algorithm DP: Knapsack Dynamic Programming Georgy Gimel farb (with basic contributions by Michael J. Dinneen) COMPSCI 69 Computational Science / Outline

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms  Hidden Markov Models Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Computational Models #1

Computational Models #1 Computational Models #1 Handout Mode Nachum Dershowitz & Yishay Mansour March 13-15, 2017 Nachum Dershowitz & Yishay Mansour Computational Models #1 March 13-15, 2017 1 / 41 Lecture Outline I Motivation

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

HIDDEN MARKOV MODELS

HIDDEN MARKOV MODELS HIDDEN MARKOV MODELS Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

On Spaced Seeds for Similarity Search

On Spaced Seeds for Similarity Search On Spaced Seeds for Similarity Search Uri Keich, Ming Li, Bin Ma, John Tromp $ Computer Science & Engineering Department, University of California, San Diego, CA 92093, USA Bioinformatics Lab, Computer

More information

Constraints and Automata (time-series constraints)

Constraints and Automata (time-series constraints) Constraints and Automata (time-series constraints) Nicolas Beldiceanu nicolas.beldiceanu@mines-nantes.fr ACP summer school, Cork, June 2016 Table of content Background Synthesizing automata with accumulators

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

Introduction to Automata

Introduction to Automata Introduction to Automata Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

Computational Models - Lecture 1 1

Computational Models - Lecture 1 1 Computational Models - Lecture 1 1 Handout Mode Ronitt Rubinfeld and Iftach Haitner. Tel Aviv University. February 29/ March 02, 2016 1 Based on frames by Benny Chor, Tel Aviv University, modifying frames

More information

Automata and Formal Languages - CM0081 Non-Deterministic Finite Automata

Automata and Formal Languages - CM0081 Non-Deterministic Finite Automata Automata and Formal Languages - CM81 Non-Deterministic Finite Automata Andrés Sicard-Ramírez Universidad EAFIT Semester 217-2 Non-Deterministic Finite Automata (NFA) Introduction q i a a q j a q k The

More information

Introduction to Finite Automaton

Introduction to Finite Automaton Lecture 1 Introduction to Finite Automaton IIP-TL@NTU Lim Zhi Hao 2015 Lecture 1 Introduction to Finite Automata (FA) Intuition of FA Informally, it is a collection of a finite set of states and state

More information

Streaming and communication complexity of Hamming distance

Streaming and communication complexity of Hamming distance Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information

State Complexity of Neighbourhoods and Approximate Pattern Matching

State Complexity of Neighbourhoods and Approximate Pattern Matching State Complexity of Neighbourhoods and Approximate Pattern Matching Timothy Ng, David Rappaport, and Kai Salomaa School of Computing, Queen s University, Kingston, Ontario K7L 3N6, Canada {ng, daver, ksalomaa}@cs.queensu.ca

More information

Linear Temporal Logic and Büchi Automata

Linear Temporal Logic and Büchi Automata Linear Temporal Logic and Büchi Automata Yih-Kuen Tsay Department of Information Management National Taiwan University FLOLAC 2009 Yih-Kuen Tsay (SVVRL @ IM.NTU) Linear Temporal Logic and Büchi Automata

More information

Mod-φ convergence II: dependency graphs

Mod-φ convergence II: dependency graphs Mod-φ convergence II: dependency graphs Valentin Féray (joint work with Pierre-Loïc Méliot and Ashkan Nikeghbali) Institut für Mathematik, Universität Zürich Summer school in Villa Volpi, Lago Maggiore,

More information

BMC Bioinformatics. Open Access. Abstract

BMC Bioinformatics. Open Access. Abstract BMC Bioinformatics BioMed Central Research article Optimal neighborhood indexing for protein similarity search Pierre Peterlongo* 1, Laurent Noé 2,3, Dominique Lavenier 4, Van Hoa Nguyen 1, Gregory Kucherov

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

Pairwise sequence alignment and pair hidden Markov models

Pairwise sequence alignment and pair hidden Markov models Pairwise sequence alignment and pair hidden Markov models Martin C. Frith April 13, 2012 ntroduction Pairwise alignment and pair hidden Markov models (phmms) are basic textbook fare [2]. However, there

More information

One-Parameter Processes, Usually Functions of Time

One-Parameter Processes, Usually Functions of Time Chapter 4 One-Parameter Processes, Usually Functions of Time Section 4.1 defines one-parameter processes, and their variations (discrete or continuous parameter, one- or two- sided parameter), including

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Lecture - 21 HMM, Forward and Backward Algorithms, Baum Welch

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin Chapter 0 Introduction Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin October 2014 Automata Theory 2 of 22 Automata theory deals

More information

On subset seeds for protein alignment

On subset seeds for protein alignment 1 On subset seeds for protein alignment arxiv:0901.3198v1 [q-bio.qm] 21 Jan 2009 Mikhail Roytberg, Anna Gambin, Laurent Noé, Sławomir Lasota, Eugenia Furletova, Ewa Szczurek, Gregory Kucherov A preliminary

More information

Network alignment and querying

Network alignment and querying Network biology minicourse (part 4) Algorithmic challenges in genomics Network alignment and querying Roded Sharan School of Computer Science, Tel Aviv University Multiple Species PPI Data Rapid growth

More information

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment. CoCoGen meeting Accuracy of the anchor-based strategy for genome alignment Raluca Uricaru LIRMM, CNRS Université de Montpellier 2 3 octobre 2008 1 / 31 Summary 1 General context 2 Global alignment : anchor-based

More information

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung: Assignments for lecture Bioinformatics III WS 03/04 Assignment 5, return until Dec 16, 2003, 11 am Your name: Matrikelnummer: Fachrichtung: Please direct questions to: Jörg Niggemann, tel. 302-64167, email:

More information

CSE 135: Introduction to Theory of Computation Nondeterministic Finite Automata (cont )

CSE 135: Introduction to Theory of Computation Nondeterministic Finite Automata (cont ) CSE 135: Introduction to Theory of Computation Nondeterministic Finite Automata (cont ) Sungjin Im University of California, Merced 2-3-214 Example II A ɛ B ɛ D F C E Example II A ɛ B ɛ D F C E NFA accepting

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased

More information

Accepted Manuscript. Lossless filter for multiple repetitions with Hamming distance

Accepted Manuscript. Lossless filter for multiple repetitions with Hamming distance Accepted Manuscript Lossless filter for multiple repetitions with Hamming distance Pierre Peterlongo, Nadia Pisanti, Frédéric Boyer, Alair Pereira do Lago, Marie-France Sagot PII: S1570-8667(07)00061-5

More information

Complexity of Biomolecular Sequences

Complexity of Biomolecular Sequences Complexity of Biomolecular Sequences Institute of Signal Processing Tampere University of Technology Tampere University of Technology Page 1 Outline ➀ ➁ ➂ ➃ ➄ ➅ ➆ Introduction Biological Preliminaries

More information

Hidden Markov Models and some applications

Hidden Markov Models and some applications Oleg Makhnin New Mexico Tech Dept. of Mathematics November 11, 2011 HMM description Application to genetic analysis Applications to weather and climate modeling Discussion HMM description Application to

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

More Dynamic Programming

More Dynamic Programming CS 374: Algorithms & Models of Computation, Spring 2017 More Dynamic Programming Lecture 14 March 9, 2017 Chandra Chekuri (UIUC) CS374 1 Spring 2017 1 / 42 What is the running time of the following? Consider

More information

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12:00 4001 Motif finding This exposition was developed by Knut Reinert and Clemens Gröpl. It is based on the following

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

arxiv: v2 [cs.ds] 5 Mar 2014

arxiv: v2 [cs.ds] 5 Mar 2014 Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,

More information

More Dynamic Programming

More Dynamic Programming Algorithms & Models of Computation CS/ECE 374, Fall 2017 More Dynamic Programming Lecture 14 Tuesday, October 17, 2017 Sariel Har-Peled (UIUC) CS374 1 Fall 2017 1 / 48 What is the running time of the following?

More information

Computational Complexity of Bayesian Networks

Computational Complexity of Bayesian Networks Computational Complexity of Bayesian Networks UAI, 2015 Complexity theory Many computations on Bayesian networks are NP-hard Meaning (no more, no less) that we cannot hope for poly time algorithms that

More information

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models. Ron Shamir, CG 08 Hidden Markov Models 1 Dr Richard Durbin is a graduate in mathematics from Cambridge University and one of the founder members of the Sanger Institute. He has also held carried out research at the Laboratory

More information

Jumbled String Matching: Motivations, Variants, Algorithms

Jumbled String Matching: Motivations, Variants, Algorithms Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

ECS455: Chapter 4 Multiple Access

ECS455: Chapter 4 Multiple Access ECS455: Chapter 4 Multiple Access 4.4 m-sequence 25 Dr.Prapun Suksompong prapun.com/ecs455 Office Hours: BKD, 6th floor of Sirindhralai building Tuesday 14:20-15:20 Wednesday 14:20-15:20 Friday 9:15-10:15

More information

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning

More information

ECE521 Lecture 19 HMM cont. Inference in HMM

ECE521 Lecture 19 HMM cont. Inference in HMM ECE521 Lecture 19 HMM cont. Inference in HMM Outline Hidden Markov models Model definitions and notations Inference in HMMs Learning in HMMs 2 Formally, a hidden Markov model defines a generative process

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science Zdeněk Sawa Department of Computer Science, FEI, Technical University of Ostrava 17. listopadu 15, Ostrava-Poruba 708 33 Czech republic September 22, 2017 Z. Sawa (TU Ostrava)

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi Contents List of Code Challenges xvii About the Textbook xix Meet the Authors................................... xix Meet the Development Team............................ xx Acknowledgments..................................

More information

cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska

cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska LECTURE 6 CHAPTER 2 FINITE AUTOMATA 2. Nondeterministic Finite Automata NFA 3. Finite Automata and Regular Expressions 4. Languages

More information

cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska

cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska LECTURE 1 Course Web Page www3.cs.stonybrook.edu/ cse303 The webpage contains: lectures notes slides; very detailed solutions to

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information