Succincter text indexing with wildcards

Size: px

Start display at page:

Download "Succincter text indexing with wildcards"

Julie Grant
5 years ago
Views:

1 University of British Columbia CPM 2011 June 27, 2011

2 Problem overview

3 Problem overview

4 Problem overview

5 Problem overview

6 Problem overview

7 Problem overview

8 Problem overview

9 Problem overview

10 Problem overview

11 Problem overview Problem Create a succinct index for: a text T of length n over an alphabet of size σ containing d wildcard positions to support efficient pattern matching queries.

12 Problem overview Problem Create a succinct index for: a text T of length n over an alphabet of size σ containing d wildcard positions to support efficient pattern matching queries. Assumption (in presentation) No adjacent wildcards in T.

13 Motivation SNPs in the human genome T 3 billion bases σ = 4 d 3 million positions also protein collections (σ = 20)

14 Our results in context Non succinct indexes Cole, Gottlieb, Lewenstein (2004) Lam et al., (2007) O(n log k n) words O(n) words

15 Our results in context Non succinct indexes Cole, Gottlieb, Lewenstein (2004) Lam et al., (2007) O(n log k n) words O(n) words Succint indexes Index space (bits) Query working space (bits) Tam et al., 2009 (3 + o(1))n log σ O(n log d)

16 Our results in context Non succinct indexes Cole, Gottlieb, Lewenstein (2004) Lam et al., (2007) O(n log k n) words O(n) words Succint indexes Index space (bits) Query working space (bits) Tam et al., 2009 (3 + o(1))n log σ O(n log d) (reduced working space by increasing query time) Tam et al., 2009 (3 + o(1))n log σ O(n log σ)

17 Our results in context Non succinct indexes Cole, Gottlieb, Lewenstein (2004) Lam et al., (2007) O(n log k n) words O(n) words Succint indexes Index space (bits) Query working space (bits) Tam et al., 2009 (3 + o(1))n log σ O(n log d) (reduced working space by increasing query time) Tam et al., 2009 (3 + o(1))n log σ O(n log σ) This work (2 + o(1))n log σ + O(n) O(m log n + md) (Ignoring smaller order terms)

18 3 Types of matches Type 1: pattern contained within a text segment [Lam et al., 2007, Tam et al., 2009]

19 3 Types of matches Type 1: pattern contained within a text segment Type 2: pattern spanning one wildcard [Lam et al., 2007, Tam et al., 2009]

20 3 Types of matches Type 1: pattern contained within a text segment Type 2: pattern spanning one wildcard Type 3: pattern contains at least one text segment [Lam et al., 2007, Tam et al., 2009]

21 Review: SAs and the Burrows Wheeler transform mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi Naïve BWT algorithm For a string S:

22 Review: SAs and the Burrows Wheeler transform mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi Naïve BWT algorithm For a string S: 1 append $ to S ($ < c, c Σ)

23 Review: SAs and the Burrows Wheeler transform mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi Naïve BWT algorithm For a string S: 1 append $ to S ($ < c, c Σ) 2 list all cyclic rotations of S

24 Review: SAs and the Burrows Wheeler transform $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Naïve BWT algorithm For a string S: 1 append $ to S ($ < c, c Σ) 2 list all cyclic rotations of S 3 sort rows of matrix (lexicographically)

25 Review: SAs and the Burrows Wheeler transform $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Naïve BWT algorithm For a string S: 1 append $ to S ($ < c, c Σ) 2 list all cyclic rotations of S 3 sort rows of matrix (lexicographically) 4 output last column of matrix

26 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [8, 11] [Ferragina & Manzini 2002]

27 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [8, 11] [Ferragina & Manzini 2002]

28 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [3, 4] [Ferragina & Manzini 2002]

29 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [3, 4] [Ferragina & Manzini 2002]

30 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [9, 9] [Ferragina & Manzini 2002]

31 Type 1 matching Build F, a compressed suffix array of T. Lemma All occ 1 Type 1 matches of P can be reported using: Query time O( P log σ + occ 1 log 1+ɛ n) Index space (1 + o(1))n log σ bits Work space O(log n) bits

32 Matching prefixes of text segments

33 Matching prefixes of text segments rank (BWT, i) queries determine a lexicographic range of text segments

34 Type 2 matching Use F and R to find candidate text segments

35 Type 2 matching Use Q to find compatible candidate [Bose et al., 2009]

36 Type 2 matching Lemma All occ 2 Type 2 matches of P can be reported using: Query time Index space Work space O( P log σ + ( P + occ 2 ) log d (2 + o(1))n log σ bits O(m log n) bits log log d )

37 Type 3 matching We will enhance F to support dictionary queries calculate matching statistics for each suffix, return text segments contained as a prefix

38 Review: computing matching statistics suffix length SA range c 1 [15,19] cc 2 [19,19] acc 3 [14,14] cacc 4 [18,18] acacc 3 [13,13] [Ohlebusch, Gog & Kügel 2010]

39 Finding dictionary matches Case 1

40 Finding dictionary matches Case 1

41 Finding dictionary matches Case 1 [Munro & Ramen 2002]

42 Finding dictionary matches Case 1 [Ramen et al., 2002]

43 Finding dictionary matches Case 1 [Grossi & Vitter 2000]

44 Finding dictionary matches Case 1

45 Finding dictionary matches Case 1

46 Finding dictionary matches Case 1 if BP[pos] = ) then find matching (

47 Finding dictionary matches Case 1 if BP[pos] = ) then find matching (

48 Finding dictionary matches Case 1 if BP[pos] = ) then find matching (

49 Finding dictionary matches Case 2

50 Finding dictionary matches Case 2 if BP[pos] = ( then find parent interval

51 Finding dictionary matches Case 2 if BP[pos] = ( then find parent interval

52 Summary of the full-text dictionary Theorem A full-text dictionary using (1 + o(1))n log σ + O(n) + O(d log n) bits supports the following operations efficiently: report/count text segments containing P report/count text segments prefixed by P report/count text segments contained in P

53 Type 3 matching Preliminaries Every text segment contained within P is a candidate

54 Type 3 matching Preliminaries Every text segment contained within P is a candidate A candidate is valid if it satisfies: 1 suffix condition 2 prefix condition

55 Algorithm overview We design a dynamic programming algorithm: P phases of algorithm (i = P,..., 2, 1) phase i considers candidates prefixing P[i.. P ]

56 Algorithm overview We design a dynamic programming algorithm: P phases of algorithm (i = P,..., 2, 1) phase i considers candidates prefixing P[i.. P ] potential matches are tracked in a working array W

57 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j + 1

58 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j verify suffix condition

59 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j verify suffix condition 2 if P must begin in text segment j 1 then verify prefix condition

60 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j verify suffix condition 2 if P must begin in text segment j 1 then verify prefix condition

61 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j verify suffix condition 2 if P must begin in text segment j 1 then verify prefix condition 3 else record partial match

62 Type 3 matching Case 2 Text segment j matches P[i.. P ] (case 2) P must contain text segment j + 1

63 Type 3 matching Case 2 Text segment j matches P[i.. P ] (case 2) P must contain text segment j verify suffix condition by checking W

64 Type 3 matching Case 2 Text segment j matches P[i.. P ] (case 2) P must contain text segment j verify suffix condition by checking W 2 if P must begin in text segment j 1 then verify prefix condition 3 else record partial match

65 Type 3 matching Lemma All occ 3 Type 3 matches of P can be reported using: Query time O( P log σ + γ) Index space (2 + o(1))n log σ bits Work space O(m log n + md) bits where γ is number of text segments contained in P

66 Main result Theorem Given a text T of length n containing d wildcards, all matches of a pattern P can be reported using: Query time Index space Work space O( P log σ + m log d log log d + occ log d 1 log n + occ 2 log log d + γ) (2 + o(1))n log σ + O(n)+O(m log n) bits O(m log n + md) bits

67 Main result Theorem Given a text T of length n containing d wildcards, all matches of a pattern P can be reported using: Query time Index space Work space O( P log σ + m log d log log d + occ log d 1 log n + occ 2 log log d + γ) (2 + o(1))n log σ + O(n)+O(m log n) bits O(m log n + md) bits

68 Main result Theorem Given a text T of length n containing d wildcards, all matches of a pattern P can be reported using: Query time Index space Work space O( P log σ + m log d log log d + occ log d 1 log n + occ 2 log log d + γ) (2 + o(1))n log σ + O(n)+O(m log n) bits O(m log n + md) bits

69 Significance of working space reduction Short read alignment to human genome T 3 billion bases d 3 million bases m Working space reduced from GBs [Tam et al., 2009] to 10 s of MBs

70 Questions Future Work Can index space be reduced to (1 + o(1))n log σ bits? YES, but with query time: O(max(m 2 log σ log log σ, current query time)) Thanks to Anne Condon and anonymous reviewers.

Compact Data Strutures

(To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda