Advanced Text Indexing Techniques. Johannes Fischer

Size: px

Start display at page:

Download "Advanced Text Indexing Techniques. Johannes Fischer"

Piers Foster
5 years ago
Views:

1 Advanced ext Indexing echniques Johannes Fischer SS 2009

3 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences ambridge University Press, U. Manber, E. W.. Myers: Suffix Arrays : A New Method for Online String Searches. SIAM J. omput. 22(5) (1993). V. Heun: Algorithmen auf Sequenzen. Skriptum, Ludwig-Maximilians-Universität München. Version 2, Available at R. ole,. Kopelowitz, M. Lewenstein: Suffix rays and Suffix rists. In: M. Buglieri et al. (Eds.): Proceeding of the 33rd Itl. olloquium on Automata, Languages and Programming (IALP 2006), Part 1. Lecture Notes in omputer Science (LNS) Springer, Suffix rees In this section we will introduce suffix trees, which, among many other things, can be used to solve the string matching task (find pattern P of length m in a text of length n in O(n + m) time). here are other methods (Boyer-Moore, e. g.), which solve this task in the same time. So why do we need suffix trees in the context of string matching? he advantage of suffix trees over the other string-matching algorithms (Boyer-Moore, KMP, etc.) is that suffix trees are an index of the text. So, if is static and there are several patterns to be matched against, the O(n)-task for building the index needs to be done only once, and subsequent matching-tasks can be done in O(m) time. If m << n, this is a clear advantage over the other algorithms. hroughout this section, let = t 1 t 2... t n be a text over an alphabet Σ of size Σ =: σ. Definition 1. A compact Σ + -tree is a rooted tree S = (V, E) with edge labels from Σ + that fulfills the following two constraints: For all v V : all outgoing edges from v start with a different a Σ. Apart from the root, all nodes have out-degree 1. Definition 2. Let S = (V, E) be a compact Σ + -tree. For v V, v denotes the concatenation of all path labels from the root of S to v. v is called the string-depth of v and is denoted by d(v). S is said to display α Σ iff v V, β Σ : v = αβ. If v = α for v V, α Σ, we also write α to denote v. words(s) denotes all strings in Σ that are displayed by S: words(s) = {α Σ : S displays α} For i {1, 2,..., n}, t i t i+1... t n is called the i-th suffix of and is denoted by i...n. In general, we use the notation i...j as an abbreviation of t i t i+1... t j. 1

4 Example 1. S: Σ = {A,, G, } depth = 1 string-depth = 2 A A G A v v = w w = AA words(s) = {ɛ, A, A, AA, AG, AGA, } We are now ready to define suffix trees. Definition 3. Let substring( ) denote the set of all substrings of, substring( ) = { i...j : 1 i j n}. he suffix tree of is a compact Σ + -tree S with words(s) = substring( ). For several reasons, we shall find it useful that each suffix ends in a leaf of S. his can be accomplished by adding a new character Σ to the end of, and build the suffix tree over. From now on, we assume that terminates with a, and we define to be lexicographically smaller than all other characters in Σ: < a for all a Σ. his gives a one-to-one correspondence between s suffixes and the leaves of S, which implies that we can label the leaves with a function l by the start index of the suffix they represent: l(v) = i v = i...n for all leaf nodes v. his also explains the name suffix tree. Implementation Remark: he outgoing edges at internal nodes v of the suffix tree can be implemented in two fundamentally different ways: 1. as arrays of size σ 2. as arrays of size s v, where s v denotes the number of v s children Example 2. Suffix tree implemented in the first way: 2

5 = A Σ = {A, G,,, } A G A Example 3. Suffix tree implemented in the second way: = A Σ = {A, G,,, } A A Approach (1) has the advantage that the outgoing edge whose edge label starts with α Σ can be located in O(1) time, but the complete suffix tree uses space O(nσ), which can be as bad as O(n 2 ). Hence, we assume that approach (2) is used, which implies that locating the correct outgoing edge takes O(log σ) time (using binary search). Note that the space consumption of approach (2) is always O(n), independent of σ. We state a final theorem, which we proved in the lecture Advanced Methods for Sequencing Analysis: heorem 1. he suffix tree for a text of length n over an integer alphabet can be built in O(n) time. A note on the alphabet size: Alphabets can be classified into different types, according to their size σ: 1. onstant alphabets with σ = O(1). 2. Good-natured alphabets with σ = o(n/ log n). 3

6 3. Integer alphabets with σ = O(n). 4. Unbounded alphabets, where no upper bound on σ exists. his list clearly forms a hierarchy; e.g., integer alphabets subsume constant and good-natured alphabets. In this lecture, all results are valid for good-natured alphabets, unless stated otherwise. Note that the restriction is not too severe, as alphabets usually grow much slower than the corresponding text (if they grow at all), so it is usually safe to assume σ = o(n/ log n). 1.3 Searching in Suffix rees Let P be a pattern of length m. hroughout the whole lecture, we will be concerned with the two following problems: Problem 1. ounting: Return the number of matches of P in. Formally, return the size of O P = {i [1, n] : i...i+m 1 = P } Problem 2. Reporting: Return all occurrences of P in, i. e., return the set O P. Example 4. = A Σ = {A, G,,, } A O = {8, 5, 4} With suffix trees, the counting-problem can be solved in O(m log σ) time: traverse the tree from the root downwards, in each step locating the correct outgoing edge, until P has been scanned completely. More formally, suppose that P 1...i 1 have already been parsed for some 1 i < m, and our position in the suffix tree S is at node v (v = P 1...i 1 ). We then find v s outgoing edge e whose label starts with P i. his takes O(log σ) time. We then compare the label of e character-bycharacter with P i...m, until we have read all of P (i = m), or until we have reached position j i for which P 1...j is a node v in S, in which case we continue the procedure at v. his takes a total of O(m log σ) time. Suppose the search procedure has brought us successfully to a node v, or to the incoming edge of node v. We then output the size of S v, the subtree of S rooted at v. his can be done in constant time, assuming that we have labeled all nodes in S with their subtree sizes. his answers the counting query. For the reporting query, we output the labels of all leaves in S v (recall that the leaves are labeled with text positions). heorem 2. he suffix tree allows to answer counting queries in O(m log σ) time, and reporting queries in O(m log σ + O P ) time. 4

7 1.4 Suffix- and LP-Arrays We will now introduce two arrays that are closely related to the suffix tree, the suffix array A and the lcp-array H. Definition 4. he suffix array A of is a permutation of {1, 2,..., n} such that A[i] is the i-th smallest suffix in lexicographic order: A[i 1]...n < A[i]...n for all 1 < i n. he second array H builds on the suffix array: Definition 5. he LP-array H of is defined such that H[1] = 0, and for all i > 1, H[i] holds the length of the longest common prefix of A[i]...n and A[i 1]...n. Example 5. he suffix array A and the lcp-array H for the string A: = A A = H= A Lemma 3. Both the lcp-array H and the suffiy array A can be computed in O(n) time. he following observations relate the suffix array A with the suffix tree S. Observation 1. If we do a lexicographically-driven depth-first search through S (visit the children in lexicographic order of the first character of their corresponding edge-label), then the leaf-labels seen in this order give the suffix-array A. Example 6. = A 9 suffix number 1 A remark: < lex a a Σ 5

8 Definition 6. Given a tree S = (V, E) and two nodes v, w V, the lowest common ancestor of v and w is the deepest node in S that is an ancestor of both v and w. his node is denoted by lca(v, w). Observation 2. he string-depth of the lowest common ancestor of the leaves labeled A[i] and A[i 1] is given by the corresponding entry H[i] of the lcp-array, in symbols: for all i > 1 : H[i] = d(lca( A[i]...n, A[i 1]...n )). In summary, these observations give a deep connection between S and H/A. 1.5 Searching in Suffix Arrays We can use a plain suffix array A to search for a pattern P, using the ideas of binary search, since the suffixes in A are sorted lexicographically and hence the occurrences of P in form an interval in A. he algorithm below performs two binary searches. he first search locates the starting position s of P s interval in A, and the second search determines the end position r. A counting query returns r s + 1, and a reporting query returns the numbers A[s], A[s + 1],..., A[r]. Algorithm 1: function SAsearch(P 1...m ) l 1; r n + 1; while l < r do q l+r 2 ; if P > lex A[q]... min{a[q]+m 1,n} then l q + 1; else r q; end end s l; l ; r n; while l < r do q l+r 2 ; if P = lex A[q]... min{a[q]+m 1,n} then l q; else r q 1; end end return [s, r]; Note that both while-loops in Alg. 1 make sure that either l is increased or r is decreased, so they are both guaranteed to terminate. In fact, in the first while-loop, r always points one position behind the current search interval, and r is decreased in case of equality (when P = A[q]... min{a[q]+m 1,n} ). his makes sure that the first while-loop finds the leftmost position of P in A. he second loop works symmetrically. heorem 4. he suffix array allows to answer counting queries in O(m log n) time, and reporting queries in O(m log n + O P ) time. 6

9 1.6 Range Minimum Queries (RMQs) Definition 7. Given an array H[1, n] of integers (or any other objects from a totally ordered universe) and two indices 1 i j n, RMQ H (i, j) returns the position of the minimum in H[i, j] : RMQ H (i, j) = argmin i k j H[k]. Example 7. i j H = RMQ H (i, j) In the lecture Advanced Methods for Sequence Analysis, we proved: heorem 5. A static array H can be preprocessed in linear time into a data structure of size O(n) that allows to answer RMQs on H in constant time. 1.7 Longest ommon Prefixes and Suffixes An indispensable tool in pattern matching are efficient implementations of functions that compute longest common prefixes and longest common suffixes of two strings (usually suffixes or prefixes of the same string). Definition 8. Given two strings and in Σ, lcp(, ) denotes the length of their longest common prefix, in symbols: lcp(, ) = max{k 0 : 1...k = 1...k }. Example 8. lcp(aag, A) = 2. Note that lcp( ) only gives the length of the matching prefix; if one is actually interested in the prefix itself, this can be obtained by 1...lcp(, ), or by 1...lcp(, ). As mentioned above, we will be particularly interested in longest common prefixes of suffixes from the same string : Definition 9. For a text of length n and two indices 1 i, j n, lcp (i, j) denotes the length of the longest common prefix of the suffixes starting at position i and j in, in symbols: lcp (i, j) = lcp( i...n, j...n ). Example 9. = AAAAA lcp (2, 7) = lcp(aaaa, A) = 1 lcp (4, 5) = lcp(aaaa, AAA) = 2 Note that the lcp-array H from Sect. 1.4 holds the lengths of longest common prefixes of lexicographically consecutive suffixes: H[i] = lcp( A[i]...n, A[i 1]...n ) = lcp(a[i], A[i 1]). Here and in the remainder of this chapter, A is again the suffix array of text. But how do we get the lcp-values of suffixes that are not in lexicographic neighborhood? he key to this is to employ RMQs over the lcp-array, as shown in the next lemma. Definition 10. he inverse suffix array A 1 is defined by A 1 [A[i]] = i for all 1 i n. 7

10 Example A = A 1 = Lemma 6. Let i j be two indices in with A 1 [i] < A 1 [j] (otherwise swap i and j). hen lcp(i, j) = H[RMQ H (A 1 [i] + 1, A 1 [j])]. Proof. First note that any common prefix ω of i...n and j...n must be a common prefix of A[k]...n for all A 1 [i] k A 1 [j], because these suffixes are lexicographically between i...n and j...n and must hence start with ω. Let m = RMQ H (A 1 [i] + 1, A 1 [j]) and l = H[m]. By the definition of H, i...i+l 1 is a common prefix of all suffixes A[k]...n for A 1 [i] k A 1 [j]. Hence, i...i+l 1 is a common prefix of i...n and j...n. Now assume that i...i+l is also a common prefix of i...n and j...n. hen, by the lexicographic order of A, i...i+l is also a common prefix of A[m 1]...n and A[m]...n. But i...i+l = l + 1, contradicting the fact that H[m] = l tells us that A[m 1]...n and A[m]...n share no common prefix of length more than l. Lemma 6 implies that with the inverse suffix array A 1, the lcp-array H, and constant-time RMQs on H, we can answer lcp-queries for arbitrary suffixes in O(1) time. 1.8 Accelerated Search in Suffix Arrays he simple binary search (Alg. 1) may perform many unnecessary character comparisons, as in every step it compares P from scratch. With the help of the lcp-function from the previous section, we can improve the search in suffix arrays from O(m log n) to O(m + log n) time. he idea is to remember the number of matching characters of P with A[l]...n and A[r]...n, if [l : r] denotes the current interval of the binary search procedure. Let λ and ρ denote these numbers, λ = lcp(p, A[l]...n ) and ρ = lcp(p, A[r]...n ). Initially, both λ and ρ are 0. Let us consider an iteration of the first while-loop in function SAsearch(P ), where we wish to determine whether to continue in [l : q] or [q, r]. (Alg. 1 would actually continue searching in [q + 1, r] in the second case, but this minor improvement is not possible in the accelerated search.) We are in the following situation: A = λ l q r P 1 P λ P 1 P ρ ρ 8

11 Without loss of generality, assume λ ρ (otherwise swap). We then look up ξ = lcp(a[l], A[q]) as the longest common prefix of the suffixes A[l]...n and A[q]...n. We look at three different cases: 1. ξ > λ A = l q r ξ λ α A[l]+λ = A[q]+λ α λ ρ Because P λ+1 > lex A[l]+λ = A[q]+λ, we know that P > lex A[q]...n, and can hence set l q, and continue the search without any character comparison. Note that ρ and in particular λ correctly remain unchanged. 2. ξ = λ A = ξ = λ l q r α α ρ In this case we continue comparing P λ+1 with A[q]+λ, P λ+2 with A[q]+λ+1, and so on, until P is matched completely, or a mismatch occurs. Say we have done this comparison up to P λ+k. If P λ+k > lex A[q]+λ+k 1, we set l q and λ k 1. Otherwise, we set r q and ρ k ξ < λ A = λ ξ l q r α P ξ+1 A[q]+ξ α ξ ρ First note that ξ ρ, as lcp(a[l], A[r]) ρ, and A[q]...n lies lexicographically between A[l]...n and A[r]...n. So we can set r q and ρ ξ, and continue the binary search without any character comparison. 9

12 his algorithm either halves the search interval (case 1 and 3) without any character comparison, or increases either λ or ρ for each successful character comparison. Because neither λ nor ρ are ever decreased, and the search stops when λ = ρ = m, we see that the total number of character comparisons (= total work of case 2) is O(m). So far we have proved the following theorem: heorem 7. ogether with lcp-information, the suffix array supports counting and reporting queries in O(m + log n) and O(m + log n + O P ) time, respectively (recall that O P is the set of occurences of P in ). 1.9 Suffix rays We now show how the O(m log σ)-algorithm for suffix trees and the O(m + log n)-algorithm can be combined to obtain a faster O(m+log σ) search-algorithm. he general idea is to start the search in the suffix tree where some additional information has been stored to speed up the search, and then, at an appropriate point, continue the search in the suffix array with a sufficiently small interval. Note that the accelerated search-algorithm from Sect. 1.8, if executed on a sub-interval I = A[x, y] of A instead of the complete array A[1, n], runs in O(m + log I ) time. his is O(m + log σ) for I = σ O(1). We first classify the nodes in s suffix tree S as follows: 1. Node v is called heavy if the numbers of leaves below v is at least σ. 2. Otherwise, node v is called light. heavy v light v v v... σ < σ Note that all heavy nodes in S have only heavy ancestors by definition, and hence the heavy nodes form a connected subtree of S. Heavy nodes v are further classified into (a) branching, if at least two of v s children are heavy, (b) non-branching, if exactly one of v s children is heavy, (c) terminal, if none of v s children are heavy. Example

13 = AAAAA σ = 3 branching non-branching A terminal A A 4 5 A A A A A A heavy A A light A A Lemma 8. he number of branching heavy nodes is O( n σ ). Proof : First count the terminal heavy nodes. By definition, every heavy node has σ leaves below itself. Note that every leaf in S must be below exactly one terminal heavy node. For the sake of contradiction, suppose there were more then n σ terminal heavy nodes. hen the tree S would contain > n σ σ = n leaves. ontradiction! Hence, the number of terminal heavy nodes is at most n σ. Now look at the subtree of S consisting of terminal and branching heavy nodes only. his is a tree with at most n σ leaves, and every internal node has at least two children. For such a tree we know that the number of internal nodes is bounded by the number of leaves. he claim follows. We now augment the suffix tree with additional information at every heavy node v: (a) At every branching heavy node v, we store an array B v of size σ, where we store pointers to v s children according to the first character on the corresponding edge (like in the bad suffix tree with O(nσ) space!). (b) At every non-branching heavy node v, we store a pointer h v to its single heavy child. (c) All terminal heavy nodes, and all light children of branching or non-branching heavy nodes are called interval nodes. Every interval node v is augmented with its suffix-interval I v, which contains the start- and end-position in the suffix array A of the leaf-labels below v. Furthermore, the suffix-intervals of adjacent light children of a non-branching heavy node are contracted into one single interval. hus, non-branching heavy nodes v have at most 3 children (one heavy child h v, and one interval node each to the left/right of h v ). Everything in the tree below the interval nodes can be deleted. he resulting tree, together with the suffix array A, is called the suffix tray. Example 12. he suffix tray for = AAAAA is shown in the following picture (the dotted part has been deleted from the tree): 11

14 = AAAAA σ = 3 A v 1 A B v11 = [v 2, v 10, v 11 ] v 2 I v3 = [1, 2] v 3 v 6 v 12 I v6 = [3, 5] h v2 v 11 I v11 = [7, 10] I v10 = [6, 6] v 4 v 5 v 6 v 8 v 1 v 10 v 3 v 7 v 9 v 2 A = Lemma 9. he size of the resulting data structure is O(n) Proof : From Lemma 8 we know that there are only O( n σ ) heavy branching nodes. So the total space for heavy branching nodes is O( n σ σ) = O(n). All other information is constant at each node. he claim follows. Now we describe the search procedure. We start as with normal suffix trees and try reading P from the root downwards. here are three different cases to consider (assume the characters P 1...i 1 have already been matched, and we have arrived at node v = P 1...i 1 of the suffix tray): (a) At branching heavy nodes v, we can find the the correctly labeled edge (i. e., the edge whose label starts with P i ) in O(1) time, by consulting B v [P i ]. (b) At non-branching heavy nodes v, we first check if the search continues at the heavy child, by comparing the next character P i with the first character a Σ on the incoming edge (v, h v ). If this is the case (P i = a), we compare all characters on the edge (v, h v ) to the pattern and, if necessary, continue the search procedure at the heavy child h v. If P i < a, we continue at the (only) interval node left of h v with case (c). Otherwise (P i a), there are two possibilities. If P i > a, we do the same for the only interval node right of h v with case (c). (c) At interval nodes v, we switch to the suffix array search algorithm from section 1.8, using I v as the start interval. Lemma 10. he length of the intervals stored at the interval nodes is O(σ 2 ). Proof : he intervals of at most σ 1 light nodes are contracted into a single interval, and each light node has at most σ 1 leaves below itself. heorem 11. he suffix tray supports counting and reporting queries in O(m + log σ) and O(m + log σ + O P ) time, respectively. 12

15 Proof : We either advance by one character in P with a constant amount of work, or we arrive at an interval node v, where we perform the accelerated binary search in O(m + log I v ) = O(m + log σ 2 ) = O(m + 2 log σ) = O(m + log σ) time, by Lemma he Burrows-Wheeler ransformation he Burrows-Wheeler ransformation was originally invented for text compression. Nonetheless, it was noted soonly that it is also a very useful tool in text indexing. In this hapter, we introduce the transformation and briefly review its merits for compression. he subsequent chapter on backwards search will then explain how it is used in the indexing scenario. 2.1 Recommended Reading G. Navarro and V. Mäkinen: ompressed Full-ext Indexes. AM omputing Surveys 39(1), Article no. 2 (61 pages), Section 5.3 M. Burrows and D. J. Wheeler: A Block-sorting Lossless Data ompression Algorithm. SR Research Report 124, D. Adjeroh,. Bell, and A. Mukherjee: he Burrows-Wheeler ransform: Data ompression, Suffix Arrays and Pattern Matching. Springer, he ransformation Definition 11. Let 1...n be a text of length n, where n = is a unique character lexicographically smaller than all other characters in Σ. hen the i-th cyclic shift of is i...n 1...i 1. We denote it by (i). Example = AAAA (6) = AAAA he Burrows-Wheeler-ransformation (bwt) is obtained by the following steps: 1. Write all cyclic shifts (i), 1 i n, column-wise next to each other. 2. Sort the columns lexicographically. 3. Output the last row. his is bwt. Example

16 = AAAA A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A sort columns lexicogr. A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A F (first) BW = L (last) (1) (6) (1) he text bwt in the last row is also denoted by L (last), and the text in the first row by F (first). Note: Every row in the bwt-matrix is a permutation of the characters in. Row F is a sorted list of all characters in. In row L = bwt, similar characters are grouped together. his is why bwt can be compressed more easily than. 2.3 onstruction of the BW he bwt-matrix needs not to be constructed explicitly in order to obtain bwt. Since is terminated with the special character, which is lexicographically smaller than any a Σ, the shifts (i) are sorted exactly like s suffixes. Because the last row consists of the characters preceding the corresponding suffixes, we have bwt i = A[i] 1 (= (A[i]) n ), where A denotes again s suffix array, and 0 is defined to be n (read cyclically!). Because the suffix array can be constructed in linear time (shown in the lecture Advanced Methods in Sequence Analysis), we get: heorem 12. he BW of a text length-n text over an integer alphabet can be constructed in O(n) time. Example

17 = AAAA A= A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A F (first) BW = L (last) 2.4 he Reverse ransformation he amazing property of the bwt is that it is not a random permutation of s letters, but that it can be transformed back to the original text. For this, we need the following definition: Definition 12. Let F and L be the strings resulting from the bwt. hen the last-to-front mapping lf is a function lf : [1, n] [1, n], defined by lf(i) = j (A[j]) = ( (A[i]) ) (n) ( A[j] = A[i] + 1). (Remember that (A[i]) is the i th column in the bwt-matrix, and ( (A[i]) ) (n) is that column rotated by one character downwards.) hus, lf(i) tells us the position in F where L[i] occurs. Example 16. = AAAA A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A F (first) BW = L (last) LF = Observation 3. Equal characters preserve the same order in F and L. hat is, if L[i] = L[j] and i < j, then lf(i) < lf(j). o see why this is so, recall that the bwt-matrix is sorted lexicographically. Because both the lf(i) th and the lf(j) th column start with the same character 15

18 a = L[i] = L[j], they must be sorted according to what follows this character a, say α and β. But since i < j, we know α < lex β, hence lf(i) < lf(j). F a a i j α β α β L lf(i) a lf(j) a his observation allows us to compute the lf-mapping without knowing the suffix array of. Definition 13. Let be a text of length n over an alphabet Σ, and let L = bwt be its bwt. Define : Σ [1, n] such that (a) is the number of occurrences in of characters that are lexicographically smaller than a Σ. Define occ : Σ [1, n] [1, n] such that occ(a, i) is the number of occurrences of a in L s length-i-prefix L[1, i]. Lemma 13. With the definitions above, lf(i) = (L[i]) + occ(l[i], i). Proof : Follows immediately from the observation above. his gives rise to the following algorithm to recover from L = bwt. 1. Scan L = bwt and compute array [1, σ]. 2. ompute the first row F from. 3. ompute occ(l[i], i). 4. Recover from right to left: we know that n =, and the corresponding cyclic shift (n) appears in column 1 in bwt. Hence, n 1 = L[1]. Shift (n 1) appears in column lf(1), and thus n 2 = L[lf(1)]. his continues until the whole text has been recovered: Example 17. n k = L[lf(lf(... (lf(1))... ))] }{{} k 1 applications of lf 16

19 A = F = A A A A L = A A A A occ(l[i], i) = n =, k = 1 L[1] = n 1 =, k = LF(1) = 6 L[6] = n 2 = A, k = LF(6) = 3 L[3] = n 3 =, k = LF(3) = 8 L[8] = n 4 =, k = LF(8) = 10 L[10] etc. reversed 2.5 ompression Storing bwt plainly needs the same space as storing the original text. However, because equal characters are grouped together in bwt, we can compress bwt in a second stage. We review two different compression methods in this section Move-to-front (MF) & Huffman oding Initialize a list Y containing each character in Σ in alphabetic order. In a left-to-right scan of bwt, (i = 1,..., n), compute a new array R[1, n]: Write the position of character i bwt in Y to R[i]. Move character i bwt to the front of Y. Encode the resulting string/array R with any kind of reversible compressor, e. g. Huffman, into a string R. Example

20 BW = A A A A i Y old BW i R[i] Y new 1 A 3 A 2 A 1 A 3 A 1 A 4 A 1 A 5 A A 2 A 6 A A 1 A 7 A A 1 A 8 A 2 A 9 A 3 A 10 A A 3 A Observation 4. MF produces many small numbers for equal characters that are close together in bwt. hese can be compressed using an order-0 compressor, e. g. Huffman, as in the next example. Example 19. R[i] character frequency = Both steps (Huffman & MF) are easy to reverse Run-Length Encoding We can also directly exploit that bwt consists of many equal-letter runs. Each such run a l can be encoded as a pair (a, l) with a Σ, l [1, n]. Example BW = AAAA RLE( BW )=(,4),(A,4),(,1),(,1),(A,1) 18

21 3 Backwards Search and FM-Indices We are now going to explore how the BW-transformed text is helpful for (indexed) pattern matching. Indices building on the BW are called FM-indices, most likely in honour of their inventors P. Ferragina and G. Manzini. From now on, we shall always assume that the alphabet Σ is goodnatured: σ = o(n/ log σ). 3.1 Recommended Reading G. Navarro and V. Mäkinen: ompressed Full-ext Indexes. AM omputing Surveys 39(1), Article no. 2 (61 pages), Sect. 4.1, 4.2, 5.1, 5.4, 6.1, 9.1, and Model of omputation and Space Measurement For the rest of this lecture, we work with the word-ram model of computation. his means that we have a processor with registers of width w (usually w = 32 or w = 64), where usual arithmetic operations (additions, shifts, comparisons, etc.) on w-bit wide words can be computed in constant time. Note that this matches all current computer architectures. We further assume that n, the input size, satisfies n 2 w, for otherwise we could not even address the whole input. From now on, we measure the space of all data structures in bits instead of words, in order to be able to differentiate between the various text indexes. For example, an array of n numbers from the range [1, n] occupies n log n bits, as each array cell stores a binary number consisting of log n bits. As another example, a length-n text over an alphabet of size σ occupies n log σ bits. In this light, all text indexes we have seen so far (suffix trees, suffix arrays, suffix trays) occupy O(n log n + n log σ) bits. Note that the difference between log n and log σ can be quite large, e. g., for the human genome with σ = 4 and n = we have log σ = 2, whereas log n 32. So the suffix array occupies about 16 times more memory than the genome itself! 3.3 Backward Search We first focus our attention on the counting problem (p. 4); i.e., on finding the number of occurrences of a pattern P 1...m in 1...n. Recall from hapter 2 that A denotes s suffix array. L/F denotes the first/last row of the bwt-matrix. lf( ) denotes the last-to-front mapping. (a) denotes the number of occurrences in of characters lexicographically smaller than a Σ. occ(a, i) denotes the number of occurrences of a in L[1, i]. Our aim is identify the interval of P in A by searching P from right to left (= backwards). o this end, suppose we have already matched P i+1...m, and know that the suffixes starting with P i+1...m form the interval [s i+1, e i+1 ] in A. In a backwards search step, we wish to calculate the interval [s i, e i ] of P i...m. First note that [s i, e i ] must be a sub-interval of [(P i ) + 1, (P i + 1)], where (P i + 1) denotes the character that follows P i in Σ. 19

22 (P i ) + 1 s i e i (P i + 1) s i+1 e i+1 A = P i P i...m P i P i+1...m F backwards search step So we need to identify, from those suffixes starting with P i, those which continue with P i+1...m. Looking at row L in the range from s i+1 to e i+1, we see that there are exactly e i s i + 1 many positions j [s i+1, e i+1 ] where L[j] = P i. s i+1 e i+1 A = P i+1...m F P i L = P i = P i From the BW decompression algorithm, we know that characters preserve the same order in F and L. Hence, if there are x occurrences of P i before s i+1 in L, then s i will start x positions behind (P i ) + 1. his x is given by occ(p i, s i+1 1). Likewise, if there are y occurrences of P i within L[s i+1, e i+1 ], then e i = s i + y 1. Again, y can be computed from the occ-function. (P i ) + 1 s i e i (P i + 1) s i+1 e i+1 A = P i P i...m P i P i+1...m F = occ(p i, s i+1 1) L = P i = P i 20

23 Algorithm 2: function backwards-search(p 1...m ) s 1; e n; for i = m... 1 do s (P i ) + occ(p i, s 1) + 1; e (P i ) + occ(p i, e); if s > e then return no match ; end end return [s, e]; his gives rise to the following, elegant algorithm for backwards search: he reader should compare this to the normal binary search algorithm in suffix arrays. Apart from matching backwards, there are two other notable deviations: 1. he suffix array A is not accessed during the search. 2. here is no need to access the input text. Hence, and A can be deleted once bwt has been computed. It remains to show how array and occ are implemented. Array is actually very small and can be stored plainly using σ log n bits. 1 Because σ = o(n/ log n), = o(n) bits. For occ, we have several options that are explored in the rest of this chapter. his is where the different FM-Indices deviate from each other. In fact, we will see that there is a natural trade-off between time and space: using more space leads to a faster computation of the occ-values, while using less space implies a higher query time. heorem 14. With backwards search, we can solve the counting problem in O(m t occ ) time, where t occ denotes the time to answer an occ( )-query. 3.4 First Ideas for Implementing Occ For answering occ(c, i), there are two simple possibilities: 1. Scan L every time an occ( )-query has to be answered. his occupies no space, but needs O(n) time for answering a single occ( )-query, leading to a total query time of O(mn) for backwards search. 2. Store all answers to occ(c, i) in a two-dimensional table. his table occupies O(nσ log n) bits of space, but allows constant-time occ( )-queries. otal time for backwards search is optimal O(m). For more more practical implementation between these two extremes, let us define the following: Definition 14. Given a bit-vector B[1, n], rank 1 (B, i) counts the number of 1 s in B s prefix B[1, i]. Operation rank 0 (B, i) is defined similarly for 0-bits. 1 More precisely, we should say σ log n bits, but we will usually omit floors and ceilings from now on. 21

24 We shall see presently that a bit-vector B, together with additional information for constanttime rank-operations, can be stored in n + o(n) bits. his can be used as follows for implementing occ: For each character c Σ, store an indicator bit vector B c [1, n] such that B c [i] = 1 iff L[i] = c. hen occ(c, i) = rank 1 (B c, i). he total space for all σ indicator bit vectors is thus σn + o(σn) bits. Note that for reporting queries, we still need the suffix array to output the values in A[s, e] after the backwards search. heorem 15. With backwards search and constant-time rank operations on bit-vectors, we can answer counting queries in optimal O(m) time. he space (in bits) is σn + o(σn) + σ log n. Example L = AAAA B = B A = B = ompact Data Structures on Bit Vectors We now show that a bit-vector B of length n can be augmented with a data structure of size o(n) bits such that rank-queries can be answered in O(1) time. First note that rank 0 (B, i) = i rank 1 (B, i), so considering rank 1 will be enough. We conceptually divide the bit-vector B into blocks of length s = log n 2 and super-blocks of length s = s 2 = Θ(log 2 n). B = s s he idea is to decompose a rank 1 -query into 3 sub-queries that are aligned with the block- or super-block-boundaries. o this end, we store three types of arrays: 1. For all of the n s super-blocks, M [i] stores the number of 1 s from B s beginning up to the end of the i th superblock. his table needs order of ( ) n n/s log n = O = o(n) }{{}}{{} log n #superblocks value from [1,n] bits. 22

25 2. For all of the n s blocks, M[i] stores the number of 1 s from the beginning of the superblock in which block i is contained up to the end of the i th block. his needs order of ( ) n log log n n/s log s = O = o(n) }{{}}{{} log n #blocks value from [1,s ] bits of space. 3. For all bit-vectors V of length s and all 1 i s, P [V ][i] stores the number of 1-bits in V [1, i]. Because there are only 2 s = 2 log n 2 such vectors V, the space for table P is order of bits. Example log n 2 }{{} }{{} s log s = O ( n log n log log n ) = o(n) }{{} #possible blocks #queries value from [1,s] s = 3 s = 9 B = M = M = P : V i A query rank 1 (B, i) is then decomposed into 3 sub-queries, as seen in the following picture: B = i rank 1 (1, i) 1 superblock query: precomputed in M 2 block-query: precomp. in M 3 in-block-query: precomp. in P 23

26 hus, computing the block number as q = i 1 s, and the super-block number as q = i 1 s, we can answer in constant time. rank 1 (B, i) = M [q ] + M[q] + P [ B[qs + 1, (q + 1)s] ] }{{} [i qs] }{{} i s block index in block Example 23. ontinuing the example above, we answer rank 1 (B, 17) as follows: the block number is q = = 5, and the super-block number is q = = 1. Further, i s block is B[ , 6 3] = B[16, 18] = 001, and the index in that block is = 2. Hence, rank 1 (B, 17) = M [1] + M[5] + P [001][2] = = 9. his finishes the description of the data structure for O(1) rank-queries. In addition to that, we also define the inverse of rank, a function that will be helpful in subsequent chapters: Definition 15. Given a bit-vector B[1, n], select 1 (B, i) returns the position of the i th 1-bit in B, or n + 1 if B contains less than i 1 s. Operation select 0 is defined similarly. Note that rank 1 (B, select(b, i)) = i. he converse select(b, rank(b, i)) is only true if B[i] = 1. Note also that select 0 cannot be computed easily from select 1 (as it was the case for rank), so select 1 and select 0 have to be considered seperately. Solving select-queries is only a little bit more complicated than solving rank-queries. We divide the range of arguments for select 1 into subranges of size κ = log 2 n, and store in N[i] the answer to select 1 (B, iκ). his table N[1, n κ ] needs O( n n κ log n) = O( log n ) bits, and divides B into blocks of different size, each containing κ 1 s (apart from the last). B = N[1] N[2] N[3] N[4] etc. A block is called long if it spans more than κ 2 = Θ(log 4 n) positions in B, and short otherwise. For the long blocks, we store the answers to all select 1 -queries explicitly. Because there are at most long blocks, this requires n log 4 n ( n ) O κ 2 κ log n = O( n/ log 4 n log 2 n log n ) = O }{{}}{{}}{{} #long blocks #arguments value from [1,n] ( ) n = o(n) bits. log n Short blocks contain κ 1-bits and span at most κ 2 positions in B. We divide again their range of arguments into sub-ranges of size κ = log 2 κ = Θ(log 2 log n). In N [i], we store the answer to select 1 (B, iκ ), relative to the beginning of the block where i occurs: N [i] = select 1 (B, iκ ) N[ iκ 1 ]. }{{ κ } block before i 24

27 Because the values in N are in the range [1, κ 2 ], table N [1, n κ ] needs ( ( ) n O κ log κ2) n = O log 2 log log n = o(n) log n bits. able N divides the blocks into miniblocks, each containing κ 1-bits. Miniblocks are long if they span more than κ 2 = Θ(log n) bits, and short otherwise. For long miniblocks, we store again the answers to all select-queries explicitly, relative to the beginning of the corresponding block. Because the miniblocks are contained in short blocks of length κ, the answer to such a select-query takes log κ bits of space. hus, the total space for the long miniblocks is O( n/ ( n log κ κ 3 ) log n }{{}}{{} log κ) = O = o(n) log n #long miniblocks #arguments bits. Finally, because short miniblocks are of length log n 2, we can use a global lookup table (analogous to P in the solution for rank) to answer select 1 -queries within short miniblocks. long blocks short blocks short miniblocks B = κ 1s κ 1s κ 1s κ 1s etc. κ 2 κ long miniblocks Answering select-queries is done as follows. ODO!!! he structures need to be duplicated for select 0. We summarize this section in the following theorem. heorem 16. An n-bit vector B can be augmented with data structures of size o(n) bits such that rank b (B, i) and select b (B, i) can be answered in constant time (b {0, 1}). 3.6 Wavelet rees Armed with constant-time rank-queries, we now develop a more space-efficient implementation of the occ-function, sacrificing the optimal query time. he idea is to use a wavelet tree on the BW-transformed text. he wavelet tree of a sequence L[1, n] over an alphabet Σ[1, σ] is a balanced binary search tree of height O(log σ). It is obtained as follows. We create a root node v, where we divide Σ into two halves Σ l = Σ[1, σ 2 ] and Σ r = Σ[ σ 2 + 1, σ] of roughly equal size. Hence, Σ l holds the lexicographically first half of characters of Σ, and Σ r contains the other characters. At v we store a bit-vector B v of length n (together with data structures for O(1) rank-queries), where a 25

28 0 of position i indicates that character L[i] belongs to Σ l, and a 1 indicates the it belongs to Σ r. his defines two (virtual) sequences L v and R v, where L v is obtained from L by concatenating all characters L[i] where B v [i] = 0, in the order as they appear in L. Sequence R v is obtained in a similar manner for positions i with B v [i] = 1. he left child l v is recursively defined to be the root of the wavelet tree for L v, and the right child r v to be the root of the wavelet tree for R v. his process continues until a sequence consists of only one symbol, in which case we create a leaf. Example 24. L=AAAA Σ ={,A,} Σ l = {} Σ l = {,A} AAAA AAAA Σ r = {A} Σ r = {} W AAAA Note that the sequences themselves are not stored explicitly; node v only stores a bit-vector B v and structures for O(1) rank-queries. heorem 17. he wavelet tree for a sequence of length n over an alphabet of size σ can be stored in n log σ (1 + o(1)) bits. Proof : We concatenate all bit-vectors at the same depth d into a single bit-vector B d of length n, and prepare it for O(1)-rank-queries (see Sect. 3.5). Hence, at any level, the space needed is n + o(n) bits. Because the depth of the tree is log σ the claim on the space follows. In order to know the sub-interval of a particular node v in the concatenated bit-vector B d at level d, we can store two indices α v and β v such that B d [α v, β v ] is the bit-vector B v associated to node v. his accounts for additional O(σ log n) bits. hen a rank-query is answered as follows (b {0, 1}): rank b (B v, i) = rank b (B d, α v + i 1) rank b (B d, α v 1), where it is assumed that i β v α v + 1, for otherwise the result is not defined. How does the wavelet tree help for implementing the occ-function? Suppose we want to compute occ(c, i), i. e., the number of occurrences of c Σ in L[1, i]. We start at the root r of the wavelet tree, and check if c belongs to the first or to the second half of the alphabet. In the first case, we know that the c s are stored in the left child of the root, namely L r. Hence, the number of c s in L[1, i] corresponds to the number of c s in L r [1, rank 0 (B r, i)]. If, on the hand, c belongs to the second half of the alphabet, we know that the c s are stored in the subsequence R r that corresponds to the right child of r, and hence compute the number of occurrences of c in R r [1, rank 1 (B r, i)] as the number of c s in L[1, i]. his leads to the following recursive procedure for computing occ(c, i), to be invoked with W-occ(c, i, 1, σ, r), where r is the root of the wavelet tree. (Recall that we assume that the characters in Σ can be accessed as Σ[1],..., Σ[σ].) 26

29 Algorithm 3: function W-occ(c, i, σ l, σ r, v) if σ l = σ r then return i; end σ m = σ l+σ r 2 ; if c Σ[σ m ] then return W-occ(c, rank 0 (B v, i), σ l, σ m, l v ); else return W-occ(c, rank 1 (B v, i), σ m + 1, σ r, r v ); end Due to the depth of the wavelet tree, the time for W-occ( ) is O(log σ). his leads to the following theorem. heorem 18. With backward-search and a wavelet-tree on bwt, we can answer counting queries in O(m log σ) time. he space (in bits) is O(σ log n) }{{} + n log σ }{{} + o(n log σ) }{{} + space for α v s wavelet tree rank data structure. 4 ompressed Suffix Arrays Until now, for enumerating the occurrences of a search pattern P in, we still have to sacrifice the O(n log n) bits for the suffix array A. Note that all other structures for backward-searching (array and the wavelet-tree for computing occ) occupy O(n log σ) bits, the same as the text. We will show in this section that O(n log σ) bits suffice also for representing A. he drawback of this compressed suffix array is that the time for retrieving an entry from A is not constant any more, but rises from O(1) to O(log ɛ n), for some arbitrarily small constant 0 < ɛ Recommended Reading K. Sadakane New ext Indexing Functionalities in the ompressed Suffix Arrays. J. Algorithms 48(2): (2003). G. Navarro and V. Mäkinen: ompressed Full-ext Indexes. AM omputing Surveys 39(1), Article no. 2 (61 pages), Sect. 4.4, 4.5, he ψ-function he most important component of the compressed suffix array (abbreviated as SA henceforth) is a function ψ that allows us to jump one character forward in the suffix array. Definition 16. Define ψ: [1, n] [1, n] such that ψ(i) = j A[j] = A[i] + 1, where position n + 1 is interpreted as the first position in (read text circularly!). 27

30 Example = A A A A A A A A= ψ= Note the similarity of the ψ-function to suffix links in suffix trees: both cut off the first character of the corresponding substring. We remark that ψ is actually the inverse function of the lf-mapping from Sect. 3.3: while ψ allows us to move from suffix A[i]...n to A[i]+1...n, with lf we can move from A[i]...n to A[i] 1...n (recall Def. 14 in Sect. 3.3), in symbols: ψ(lf(i)) = i = lf(ψ(i)). his is also the reason why ψ is increasing in areas where the corresponding suffixes start with the same character. For instance, in Ex. 25 we have that all suffixes from A[2, 9] start with letter A; and indeed, ψ[2, 9] = [7, 9, 10, 12, 13, 14, 16] is increasing. his is summarized in the following lemma, which can be proved similarly as Observation 3. Lemma 19. If i < j and A[i] = A[j], then ψ(i) < ψ(j). his lemma will be used in Sect. 4.6 to store ψ in a space-efficient form. 4.3 he Idea of the ompressed Suffix Array We now present the general approach to store A in a space-efficient form. Instead of storing every entry in A, in a new bit-vector B 0 [1, n] we mark the positions in A where the corresponding entry in A is even: B 0 [i] = 1 A[i] 0 (mod 2). Bit-vector B 0 is prepared for O(1) rank-queries (Sect. 3.5). We further store the ψ-values at positions i with B 0 [i] = 0 in a new array ψ 0 [1, n 2 ]. Finally, we store the even values of A in a new array A 1 [1, n 2 ], and divide all values in A 1 by 2. Example = A A A A A A A A= B 0 = ψ 0 = A 1 = Now, the three arrays, B 0, ψ 0 and A 1, completely substitute A: to retrieve value A[i], we first check if B 0 [i] = 1. If so, we know that A[i]/2 is stored in A 1, and that the exact position in A 1 is given by the number of 1-bits in B 0 up to position i. Hence, A[i] = 2A 1 [rank 1 (B 0, i)]. If, on the other hand, B 0 [i] = 0, we follow ψ(i) in order to get to the position of the (A[i] + 1)st suffix, which must be even (and is hence stored in A 1 ). he value ψ(i) is stored in ψ 0, and its position therein is equal to the number of 0-bits in B 0 up to position i. Hence, A[i] = A[ψ 0 (rank 0 (B 0, i))] 1, which can be calculated be the mechanism of the previous paragraph. As we shall see later, ψ 0 can be stored very efficiently (basically using O(n log σ) bits). Hence, we have almost halved the space with this approach (from n log n bits for A to n 2 log n 2 for A 1). 28

31 4.4 Hierarchical Decomposition We can use the idea from the previous section recursively in order to gain more space: instead of representing A 1 plainly, we replace it with bit-vector B 1, array ψ 1 and A 2. Array A 2 can in turn be replaced by B 2, ψ 2, and A 3, and so on. In general, array A k [1, n k ], with n k = n 2 k, implicitly represents s suffixes that are a multiple of 2 k, in the order as they appear in the original array A 0 := A. Example = A A A A A A A A= ψ 0 = B 0 = level 0 A 1 = ψ 1 = B 1 = level 1 A 2 = etc. A k can be seen as a suffix array of a new string k, where the i th character of k is the concatenation of 2 k characters i2 k...(i+1)2 k 1 (we assume that is padded with sufficiently enough -characters). his means that the alphabet for k is Σ 2k, i. e., all 2 k -tuples from Σ. Example 28. A 2 = [4, 1, 3, 2] can be regarded as the suffix array of 2 = (AAA) (A) (AA) () }{{}}{{}}{{}}{{} his way, on level k we only store B k and ψ k. Only on the last level h we store A h. We choose n h = log log σ log n such that the space for storing A h is ( ) ( n ) O (n h log n h ) = O (n h log n) = O 2 h log n n log σ = O log n = O(n log σ) bits. log However, storing B k and ψ k on all h levels would take too much space. Instead, we use only a constant number of ɛ levels, namely 0, hɛ, 2hɛ,...,h (constant 0 < ɛ 1). Example 29. n log n 29

32 = A A A A A A A A 0 = ψ 0 = n=16 h=4 ɛ= 1 2 B 0 = A 2 = ψ 2 = B 2 = A 4 = 1 Hence, bit-vector B k has a 1 at position i iff A k [i] is a multiple of 2 hɛ+k. Given all this, we have the following algorithm to compute A[i], to be invoked with lookup(i, 0). Algorithm 4: function lookup(i, k) if k = h then return A h [i]; end if k = ω k then return n k ; end if B k [i] = 1 then return 2 hɛ lookup(rank 1 (B k, i), k + hɛ); else return lookup(ψ k (rank 0 (B k, i), k)) 1; end Here, ω k stores the position of the last suffix, i. e., A k [ω k ] = n k. hecking if i = ω k is necessary in order to avoid following ψ k from the last suffixes to the first, because this would give incorrect results. Example 30. A[15] = lookup(15, 0) = lookup(ψ 0 (11), 0) 1 = lookup(6, 0) 1 = 2 2 lookup(3, 2) 1 = 2 2 (lookup(ψ 2 (2), 2) 1) 1 = 2 2 (lookup(1, 2) 1) 1 = 2 2 (4 1) 1 = 11 o analyze the running time of the lookup-procedure, we first note that on every level k, we need to follow ψ k at most 2 hɛ times until we hit a position i with B k [i] = 1 (second case of the last if-statement). Because the number of implemented levels, ɛ, is constant (remember ɛ is constant!), the total time of the lookup-procedure is ( O 2 hɛ) = O ((2 log log n) ɛ) σ = O (log ɛ σ n), which is sub-logarithmic for ɛ < 1. 30

Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT)

1 Recommended Reading Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) D. Gusfield: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. M. Crochemore,