Seed-based sequence search: some theory and some applications

Size: px

Start display at page:

Download "Seed-based sequence search: some theory and some applications"

Mervin Wilson
5 years ago
Views:

1 Seed-based sequence search: some theory and some applications Gregory Kucherov CNRS/LIGM, Marne-la-Vallée joint work with Laurent Noé (LIFL LIlle) Journées GDR IM, Lyon, January -, 3

2 Filtration for approximate pattern matching

3 Filtration for approximate pattern matching

4 Filtration for approximate pattern matching

5 Filtration for approximate pattern matching

6 Filtration for sequence alignment

7 Filtration for sequence alignment

8 Filtration for sequence alignment

9 Contiguous seeds Example pattern size 8, allowed substitution errors (Hamming distance) ### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGCTCAAGA

10 Contiguous seeds Example pattern size 8, allowed substitution errors (Hamming distance) ###### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGCTCAAGA 6 consecutive matching characters (######) constitutealossless filter

11 Contiguous seeds Example pattern size 8, allowed substitution errors (Hamming distance) ####### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGCTCAAGA 6 consecutive matching characters (######) constitutealossless filter 7 consecutive matching characters (#######) constitutealossy filter ( 5% occurrences missed)

12 Contiguous seeds Example pattern size 8, allowed substitution errors (Hamming distance) ####### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGCTCAAGA 6 consecutive matching characters (######) constitutealossless filter 7 consecutive matching characters (#######) constitutealossy filter ( 5% occurrences missed) BLAST [Altschul et al., J.Mol.Biol. 99] 43 citations according to Google Scholar

13 Spaced seeds Definition (spaced seed) A spaced seed π is a pattern composed of symbols {#, -}, where # stands for a match ( ), and - (joker) stands for either match ( ) ormismatch (:) s(π) : span (length), w(π) : weight (number of #). Example seed π = ###--#-### (of weight 7!) is lossless for pattern length 8 and errors ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGTGACTGTGCAAGA

14 Spaced seeds Definition (spaced seed) A spaced seed π is a pattern composed of symbols {#, -}, where # stands for a match ( ), and - (joker) stands for either match ( ) ormismatch (:) s(π) : span (length), w(π) : weight (number of #). Example seed π = ###--#-### (of weight 7!) is lossless for pattern length 8 and errors ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGTGCAAGA

15 Spaced seeds Definition (spaced seed) A spaced seed π is a pattern composed of symbols {#, -}, where # stands for a match ( ), and - (joker) stands for either match ( ) ormismatch (:) s(π) : span (length), w(π) : weight (number of #). Example seed π = ###--#-### (of weight 7!) is lossless for pattern length 8 and errors ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGTGACTGTGCAAGA

16 Spaced seeds Definition (spaced seed) A spaced seed π is a pattern composed of symbols {#, -}, where # stands for a match ( ), and - (joker) stands for either match ( ) ormismatch (:) s(π) : span (length), w(π) : weight (number of #). Example seed π = ###--#-### (of weight 7!) is lossless for pattern length 8 and errors ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGTGACTGTGCAAGA seed π = ###-#--###-# (of weight 8!!) is lossless too

17 Spaced seeds (cont) Example Consider similarities of length 64 with Pr( ) =.7 andpr(:) =.3 (Bernoulli model)....atcagtgaatgcgtaagact... : : : :...ATCAGCGAATGTGCAAGAGT... A random similarity contains seed # =########### (Blast) with probability.3 seed ###-#--#-#--##-### (PatternHunter) with probability.47 seed # =########## with probability.4

18 Filtration: Sensitivity and Selectivity Sensitivity (recall) = measure of true similarities detected (true positives) Selectivity (precision, specificity) = measure of spurious candidates (false positives) text, pattern seed filter candidate occurrences (costly) check Selectivity measured by the number of # s In the lossless case, sensitivity= % true occurrences

19 Why spaced seeds are better? Some probabilistic observations: For spaced seeds, occurrences at subsequent positions are more independent events For contiguous vs spaced seeds of the same weight, the expected number of occurrences is (roughly) the same but the probabilities of at least one occurrence are very different

20 Why spaced seeds are better? Some probabilistic observations: For spaced seeds, occurrences at subsequent positions are more independent events For contiguous vs spaced seeds of the same weight, the expected number of occurrences is (roughly) the same but the probabilities of at least one occurrence are very different Question: Which of aababb and aaaaaa is more likely to appear first in an i.i.d. binary sequence? Answer:

21 Why spaced seeds are better? Some probabilistic observations: For spaced seeds, occurrences at subsequent positions are more independent events For contiguous vs spaced seeds of the same weight, the expected number of occurrences is (roughly) the same but the probabilities of at least one occurrence are very different Question: Which of aababb and aaaaaa is more likely to appear first in an i.i.d. binary sequence? Answer: aababb Spaced seeds have a smaller waiting time than contiguous seeds

22 Seed Analysis: Formalization π =###--#-### ###--#-### ATCAGTGAATGCGCAAGA : : ATCAGCGAATGTGCAAGA

23 Seed Analysis: Formalization π =###--#-### ###--#-### : :

24 Seed Analysis: Formalization π =###--#-### ###--#-### : : ATCAGTGAATGCGCAAGA ATCAGCGAATGTGCAAGA

25 Seed Analysis: Formalization π =###--#-### ###--#-### : :??? ATCAGTGAATGCGCAAGA ATCAGCGAATGTGCAAGA

26 Seed Analysis: Formalization π =###--#-### ###--#-### : :??? ATCAGTGAATGCGCAAGA ATCAGCGAATGTGCAAGA Lossless case: every similarity of length m with k mismatches ( s) and (m k) matches ( s) is covered by π

27 Seed Analysis: Formalization π =###--#-### ###--#-### : :??? ATCAGTGAATGCGCAAGA ATCAGCGAATGTGCAAGA Lossless case: every similarity of length m with k mismatches ( s) and (m k) matches ( s) is covered by π Lossy case: specify a probabilistic model F of similarity e.g. Pr() = p, Pr() = p sensitivity F (π) =Pr(seed covers a random similarity) specify a probabilistic model B of random alignment ( background ) selectivity B (π) =Pr(seed covers a random alignment)

28 Basic Problems Lossless case: for a given seed π, verifyifπ is lossless for similarity length m and number of errors k, construct lossless seeds of maximal weight (selectivity)

29 Basic Problems Lossless case: for a given seed π, verifyifπ is lossless for similarity length m and number of errors k, construct lossless seeds of maximal weight (selectivity) Lossy case: for a given seed π, modelsf and B, compute sensitivity F (π) and selectivity B (π), construct seeds that maximize sensitivity F (π) (minimize selectivity B (π))

30 Spaced Seeds: Historical Remarks Burkhardt, Kärkkäinen, CPM : spaced seeds for (lossless) approximate pattern matching Ma, Tromp, Li (PatternHunter): spaced seeds for (lossy) similarity search in DNA sequences YASS [Noé, Kucherov 4] references in noe/spaced seeds.html

31 (m, k)-problem m: length of similarity, k number of errors (substitutions) Definition ((m, k)-problem) A seed π {#,-} s is (m, k)-lossless iff all ( m k ) similarities are covered by π Reminder : similarities are represented as binary strings over {, }, where # matches and - matches both and

32 Verifying losslessness: existing algorithms [Burkhardt,Kärkkäinen ] Theorem One can verify if a seed π is (m, k)-lossless in time O(m k s(π) )

33 Verifying losslessness: existing algorithms [Burkhardt,Kärkkäinen ] Theorem One can verify if a seed π is (m, k)-lossless in time O(m k s(π) ) s(π) can be improved to s(π) w(π)

34 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-#

35 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-# detected similarity fragments: {,,, }

36 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-# detected similarity fragments: {,,, } q q q 3 q q 5 q 4 q 9 q 8 q 7 q 6 q f,

37 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-# detected similarity fragments: {,,, } q q q 3 q q 5 q 4 q 9 q 8 q 7 q 6 q f,

38 Seed automaton [Buhler,Keich,Sun 3] [Kucherov,Noé,Roytberg 5] seed π = #-#-# detected similarity fragments: {,,, } q q q 3 q q 5 q 4 q 9 q 8 q 7 q 6 q f, number of states can be bounded by O(w(π) s(π) w(π) )

39 Verifying losslessness: automata technique π =#-# m =7 k =

40 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f

41 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f

42 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f

43 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f q q q q 3 q f

44 Verifying losslessness: automata technique π =#-# m =7 k = q q q q 3 q f q q q q 3 q f Verify that every word of length m with k s is accepted

45 π =#-#-# m =7 k = q q q q 3 q f

46 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)}

47 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k

48 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

49 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

50 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

51 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

52 π =#-#-# m =7 k = q q q q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

53 π =#-#-# m =7 k = q 3 q q 3 q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

54 π =#-#-# m =7 k = q 3 4 q 3 q 3 3 q 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

55 π =#-#-# m =7 k = q q 3 4 q q 3 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)}

56 π =#-#-# m =7 k = q q 3 4 q q 3 3 q f cost() =cost( ) =, cost() = cost( ) = cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) =min{cost(q init... q)} Theorem π is (m, k)-lossless iff, cost( ) > k forward-update cost(q) =min q q{cost(q )+cost(q q)} running time O(m w(π) s(π) w(π) )

57 Sensitivity computation [Kucherov,Noé,Roytberg 5] π =#-#-# m =7 F: Pr() =.8, Pr() =. q q q q 3 q f

58 Sensitivity computation [Kucherov,Noé,Roytberg 5] π =#-#-# m =7 F: Pr() =.8, Pr() =. q q q q 3 q f cost() =cost( ) =., cost() = cost( ) =.8 cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) = {cost(q init... q)}

59 Sensitivity computation [Kucherov,Noé,Roytberg 5] π =#-#-# m =7 F: Pr() =.8, Pr() =. q q q q 3 q f cost() =cost( ) =., cost() = cost( ) =.8 cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) = {cost(q init... q)} Theorem sensitivity F (π) =cost(q Final )

60 Sensitivity computation [Kucherov,Noé,Roytberg 5] π =#-#-# m =7 F: Pr() =.8, Pr() =. q q q q 3 q f cost() =cost( ) =., cost() = cost( ) =.8 cost(q () q ()... q (i) )= cost(q (j) q (j+) ) cost(q) = {cost(q init... q)} Theorem sensitivity F (π) =cost(q Final ) forward-update cost(q) = q q {cost(q ) cost(q q)}

61 Common setting: dynamic programming on weighted automata over semirings [Mohri, Handbook of Weighted Automata 9] semiring S =(S,,,, ) verifying losslessness tropical semiring (S = N {}, =min, =+, =, =) computing sensitivity probability semiring (S = R, =+, =, =, =)

62 Common setting: dynamic programming on weighted automata over semirings [Mohri, Handbook of Weighted Automata 9] semiring S =(S,,,, ) verifying losslessness tropical semiring (S = N {}, =min, =+, =, =) computing sensitivity probability semiring (S = R, =+, =, =, =) counting the number of detected similarities (lossless setting) counting semiring (S = N, =+, =, =, =)

63 Common setting: dynamic programming on weighted automata over semirings [Mohri, Handbook of Weighted Automata 9] semiring S =(S,,,, ) verifying losslessness tropical semiring (S = N {}, =min, =+, =, =) computing sensitivity probability semiring (S = R, =+, =, =, =) counting the number of detected similarities (lossless setting) counting semiring (S = N, =+, =, =, =) detecting similarities exceeding a given (additive) score max-plus algebra (S = R { }, =max, =+, =, =)

64 Common setting: dynamic programming on weighted automata over semirings [Mohri, Handbook of Weighted Automata 9] semiring S =(S,,,, ) verifying losslessness tropical semiring (S = N {}, =min, =+, =, =) computing sensitivity probability semiring (S = R, =+, =, =, =) counting the number of detected similarities (lossless setting) counting semiring (S = N, =+, =, =, =) detecting similarities exceeding a given (additive) score max-plus algebra (S = R { }, =max, =+, =, =) similarity with maximum probability to be detected (Viterbi algorithm) Viterbi semiring (S =[, ], =max, =, =, =)

65 Lossless property is NP-complete [Nicolas,Rivals 5] [Ma,Li 6] From previous algorithm: one can verify is a seed π is (m, k)-lossless in time O(m w(π) s(π) w(π) )

66 Lossless property is NP-complete [Nicolas,Rivals 5] [Ma,Li 6] From previous algorithm: one can verify is a seed π is (m, k)-lossless in time O(m w(π) s(π) w(π) ) Theorem Verifying if a given seed π is (m, k)-lossless is NP-hard

67 Lossless property is NP-complete [Nicolas,Rivals 5] [Ma,Li 6] From previous algorithm: one can verify is a seed π is (m, k)-lossless in time O(m w(π) s(π) w(π) ) Theorem Verifying if a given seed π is (m, k)-lossless is NP-hard Similar result holds for the lossy case: Theorem Computing the sensitivity of a given seed π under a Bernoulli model of similarities is NP-hard

68 Design of lossless seeds similarities: lengthm, number of errors k seeds: span s, weightw (number of # s) design lossless seeds (for given s, w) thatmaximize k for a given m (or minimize m for a given k) given m and k, design lossless seeds that maximize w

69 Asymptotics [Kucherov,Noé,Roytberg 4] [Farach-Colton,Landau,Sahinalp,Tsur 5]

70 Asymptotics [Kucherov,Noé,Roytberg 4] [Farach-Colton,Landau,Sahinalp,Tsur 5] Theorem (fixed number of jokers) Let the number of jokers d be fixed. The maximal number k of errors detected by an optimal (m, k)-lossless seed is ( d + ) m s ± O(max{, m s })

71 Asymptotics [Kucherov,Noé,Roytberg 4] [Farach-Colton,Landau,Sahinalp,Tsur 5] Theorem (fixed number of jokers) Let the number of jokers d be fixed. The maximal number k of errors detected by an optimal (m, k)-lossless seed is ( d + ) m s ± O(max{, m s }) Theorem (fixed number of errors) Let the number k of errors be fixed. Then the maximal weight w of an optimal (m, k)-lossless seed is m Θ(m k k+ )

72 Asymptotics [Kucherov,Noé,Roytberg 4] [Farach-Colton,Landau,Sahinalp,Tsur 5] Theorem (fixed number of jokers) Let the number of jokers d be fixed. The maximal number k of errors detected by an optimal (m, k)-lossless seed is ( d + ) m s ± O(max{, m s }) Theorem (fixed number of errors) Let the number k of errors be fixed. Then the maximal weight w of an optimal (m, k)-lossless seed is m Θ(m k k+ ) Proofs are constructive but the results are only asymptotic

73 Practical design: periodic seeds [Kucherov,Noé,Roytberg 4] Notation: [π,π ] i =(π π ) i π Theorem (periodic seeds) If seed π solves a cyclic (m, k)-problem, then for all i, seed [π, - m s(π) ] i is (m(i +)+s(π), k)-lossless.

74 Practical design: periodic seeds [Kucherov,Noé,Roytberg 4] Notation: [π,π ] i =(π π ) i π Theorem (periodic seeds) If seed π solves a cyclic (m, k)-problem, then for all i, seed [π, - m s(π) ] i is (m(i +)+s(π), k)-lossless. Example ###-# solves the cyclic (7, )-problem. Then [###-#, --] i is ( + 7i, )-lossless for all i. For i =,, 3 this produces optimal (maximally weighted) seeds. E.g. ###-#--###-# is (8, )-lossless

75 Practical design: periodic seeds [Kucherov,Noé,Roytberg 4] Notation: [π,π ] i =(π π ) i π Theorem (periodic seeds) If seed π solves a cyclic (m, k)-problem, then for all i, seed [π, - m s(π) ] i is (m(i +)+s(π), k)-lossless. Example ###-# solves the cyclic (7, )-problem. Then [###-#, --] i is ( + 7i, )-lossless for all i. For i =,, 3 this produces optimal (maximally weighted) seeds. E.g. ###-#--###-# is (8, )-lossless From the asymptotic bound (previous slide), periodic seeds are not asymptotically optimal, as they have a constant fraction of jokers

76 Multiple seeds Example Using several seeds simultaneously in a complementary fashion m = 8, k =3 seed weight #### 4 ###-## 5

77 Multiple seeds Example Using several seeds simultaneously in a complementary fashion m = 8, k =3 seed weight #### 4 { ###-## 5 ##-#-#### 7 ###---#--##-#

78 Multiple seeds Example Using several seeds simultaneously in a complementary fashion m = 8, k =3 seed weight Pr(false positive) (i.i.d. on 4 letters) #### 4.39 { ###-## ##-#-#### 7. ###---#--##-# 3

79 Multiple seeds Example Using several seeds simultaneously in a complementary fashion m = 8, k =3 seed weight Pr(false positive) (i.i.d. on 4 letters) #### 4.39 { ###-## ##-#-#### 7. ###---#--##-# 3 ##-##-##### ###-####--## ###-##---#-### 9.3 ##----####-### 4 ###---#-#-##-## ###-#-#-#-----###

80 Multiple seeds (cont) Multiple seeds further improve the sensitivity in the lossy setting [PatternHunter II 3] Multiple seeds incur a computational overhead which is compensated by an increase in efficiency The automaton approach applies to multiple seeds too (but size of seed automata becomes larger of course) Seed design is even harder for multiple seeds, but the idea of periodic seeds can be applied [Kucherov,Noé,Roytberg 4]

Bioinformatics applications: Next-Generation Sequencing NGS technologies, appeared in 5, produce tens of Gb of sequence data in a single run these data constist of short (35-4 letters) reads of DNA

81 Bioinformatics applications: Next-Generation Sequencing NGS technologies, appeared in 5, produce tens of Gb of sequence data in a single run these data constist of short (35-4 letters) reads of DNA sequence many methods exist for mapping these reads to a reference genomic sequence, some of them (usually the most precise ones) are based on seeds (SHRiMP, MAQ, ZOOM, PASS, PerM,...) lossless seeds are applied (e.g. ZOOM), periodic seeds as well (PerM) specific seeding techniques have been developed: positioned seeds [Noé,Gîrdea,Kucherov ]

82 Anti-summary What I have not talked about

83 Anti-summary What I have not talked about More complex models for similarities and seeds: subset seeds, Markov and Hidden Markov models for similarities

84 Anti-summary What I have not talked about More complex models for similarities and seeds: subset seeds, Markov and Hidden Markov models for similarities Seeds for protein sequences (on letters)

85 Anti-summary What I have not talked about More complex models for similarities and seeds: subset seeds, Markov and Hidden Markov models for similarities Seeds for protein sequences (on letters) Implementation issues, extension step

86 Anti-summary What I have not talked about More complex models for similarities and seeds: subset seeds, Markov and Hidden Markov models for similarities Seeds for protein sequences (on letters) Implementation issues, extension step software YASS, iedera, hedera, SToRM:

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching