Advanced Text Indexing Techniques. Johannes Fischer

Size: px
Start display at page:

Download "Advanced Text Indexing Techniques. Johannes Fischer"

Transcription

1 Advanced ext Indexing echniques Johannes Fischer SS 2009

2

3 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences ambridge University Press, U. Manber, E. W.. Myers: Suffix Arrays : A New Method for Online String Searches. SIAM J. omput. 22(5) (1993). V. Heun: Algorithmen auf Sequenzen. Skriptum, Ludwig-Maximilians-Universität München. Version 2, Available at R. ole,. Kopelowitz, M. Lewenstein: Suffix rays and Suffix rists. In: M. Buglieri et al. (Eds.): Proceeding of the 33rd Itl. olloquium on Automata, Languages and Programming (IALP 2006), Part 1. Lecture Notes in omputer Science (LNS) Springer, Suffix rees In this section we will introduce suffix trees, which, among many other things, can be used to solve the string matching task (find pattern P of length m in a text of length n in O(n + m) time). here are other methods (Boyer-Moore, e. g.), which solve this task in the same time. So why do we need suffix trees in the context of string matching? he advantage of suffix trees over the other string-matching algorithms (Boyer-Moore, KMP, etc.) is that suffix trees are an index of the text. So, if is static and there are several patterns to be matched against, the O(n)-task for building the index needs to be done only once, and subsequent matching-tasks can be done in O(m) time. If m << n, this is a clear advantage over the other algorithms. hroughout this section, let = t 1 t 2... t n be a text over an alphabet Σ of size Σ =: σ. Definition 1. A compact Σ + -tree is a rooted tree S = (V, E) with edge labels from Σ + that fulfills the following two constraints: For all v V : all outgoing edges from v start with a different a Σ. Apart from the root, all nodes have out-degree 1. Definition 2. Let S = (V, E) be a compact Σ + -tree. For v V, v denotes the concatenation of all path labels from the root of S to v. v is called the string-depth of v and is denoted by d(v). S is said to display α Σ iff v V, β Σ : v = αβ. If v = α for v V, α Σ, we also write α to denote v. words(s) denotes all strings in Σ that are displayed by S: words(s) = {α Σ : S displays α} For i {1, 2,..., n}, t i t i+1... t n is called the i-th suffix of and is denoted by i...n. In general, we use the notation i...j as an abbreviation of t i t i+1... t j. 1

4 Example 1. S: Σ = {A,, G, } depth = 1 string-depth = 2 A A G A v v = w w = AA words(s) = {ɛ, A, A, AA, AG, AGA, } We are now ready to define suffix trees. Definition 3. Let substring( ) denote the set of all substrings of, substring( ) = { i...j : 1 i j n}. he suffix tree of is a compact Σ + -tree S with words(s) = substring( ). For several reasons, we shall find it useful that each suffix ends in a leaf of S. his can be accomplished by adding a new character Σ to the end of, and build the suffix tree over. From now on, we assume that terminates with a, and we define to be lexicographically smaller than all other characters in Σ: < a for all a Σ. his gives a one-to-one correspondence between s suffixes and the leaves of S, which implies that we can label the leaves with a function l by the start index of the suffix they represent: l(v) = i v = i...n for all leaf nodes v. his also explains the name suffix tree. Implementation Remark: he outgoing edges at internal nodes v of the suffix tree can be implemented in two fundamentally different ways: 1. as arrays of size σ 2. as arrays of size s v, where s v denotes the number of v s children Example 2. Suffix tree implemented in the first way: 2

5 = A Σ = {A, G,,, } A G A Example 3. Suffix tree implemented in the second way: = A Σ = {A, G,,, } A A Approach (1) has the advantage that the outgoing edge whose edge label starts with α Σ can be located in O(1) time, but the complete suffix tree uses space O(nσ), which can be as bad as O(n 2 ). Hence, we assume that approach (2) is used, which implies that locating the correct outgoing edge takes O(log σ) time (using binary search). Note that the space consumption of approach (2) is always O(n), independent of σ. We state a final theorem, which we proved in the lecture Advanced Methods for Sequencing Analysis: heorem 1. he suffix tree for a text of length n over an integer alphabet can be built in O(n) time. A note on the alphabet size: Alphabets can be classified into different types, according to their size σ: 1. onstant alphabets with σ = O(1). 2. Good-natured alphabets with σ = o(n/ log n). 3

6 3. Integer alphabets with σ = O(n). 4. Unbounded alphabets, where no upper bound on σ exists. his list clearly forms a hierarchy; e.g., integer alphabets subsume constant and good-natured alphabets. In this lecture, all results are valid for good-natured alphabets, unless stated otherwise. Note that the restriction is not too severe, as alphabets usually grow much slower than the corresponding text (if they grow at all), so it is usually safe to assume σ = o(n/ log n). 1.3 Searching in Suffix rees Let P be a pattern of length m. hroughout the whole lecture, we will be concerned with the two following problems: Problem 1. ounting: Return the number of matches of P in. Formally, return the size of O P = {i [1, n] : i...i+m 1 = P } Problem 2. Reporting: Return all occurrences of P in, i. e., return the set O P. Example 4. = A Σ = {A, G,,, } A O = {8, 5, 4} With suffix trees, the counting-problem can be solved in O(m log σ) time: traverse the tree from the root downwards, in each step locating the correct outgoing edge, until P has been scanned completely. More formally, suppose that P 1...i 1 have already been parsed for some 1 i < m, and our position in the suffix tree S is at node v (v = P 1...i 1 ). We then find v s outgoing edge e whose label starts with P i. his takes O(log σ) time. We then compare the label of e character-bycharacter with P i...m, until we have read all of P (i = m), or until we have reached position j i for which P 1...j is a node v in S, in which case we continue the procedure at v. his takes a total of O(m log σ) time. Suppose the search procedure has brought us successfully to a node v, or to the incoming edge of node v. We then output the size of S v, the subtree of S rooted at v. his can be done in constant time, assuming that we have labeled all nodes in S with their subtree sizes. his answers the counting query. For the reporting query, we output the labels of all leaves in S v (recall that the leaves are labeled with text positions). heorem 2. he suffix tree allows to answer counting queries in O(m log σ) time, and reporting queries in O(m log σ + O P ) time. 4

7 1.4 Suffix- and LP-Arrays We will now introduce two arrays that are closely related to the suffix tree, the suffix array A and the lcp-array H. Definition 4. he suffix array A of is a permutation of {1, 2,..., n} such that A[i] is the i-th smallest suffix in lexicographic order: A[i 1]...n < A[i]...n for all 1 < i n. he second array H builds on the suffix array: Definition 5. he LP-array H of is defined such that H[1] = 0, and for all i > 1, H[i] holds the length of the longest common prefix of A[i]...n and A[i 1]...n. Example 5. he suffix array A and the lcp-array H for the string A: = A A = H= A Lemma 3. Both the lcp-array H and the suffiy array A can be computed in O(n) time. he following observations relate the suffix array A with the suffix tree S. Observation 1. If we do a lexicographically-driven depth-first search through S (visit the children in lexicographic order of the first character of their corresponding edge-label), then the leaf-labels seen in this order give the suffix-array A. Example 6. = A 9 suffix number 1 A remark: < lex a a Σ 5

8 Definition 6. Given a tree S = (V, E) and two nodes v, w V, the lowest common ancestor of v and w is the deepest node in S that is an ancestor of both v and w. his node is denoted by lca(v, w). Observation 2. he string-depth of the lowest common ancestor of the leaves labeled A[i] and A[i 1] is given by the corresponding entry H[i] of the lcp-array, in symbols: for all i > 1 : H[i] = d(lca( A[i]...n, A[i 1]...n )). In summary, these observations give a deep connection between S and H/A. 1.5 Searching in Suffix Arrays We can use a plain suffix array A to search for a pattern P, using the ideas of binary search, since the suffixes in A are sorted lexicographically and hence the occurrences of P in form an interval in A. he algorithm below performs two binary searches. he first search locates the starting position s of P s interval in A, and the second search determines the end position r. A counting query returns r s + 1, and a reporting query returns the numbers A[s], A[s + 1],..., A[r]. Algorithm 1: function SAsearch(P 1...m ) l 1; r n + 1; while l < r do q l+r 2 ; if P > lex A[q]... min{a[q]+m 1,n} then l q + 1; else r q; end end s l; l ; r n; while l < r do q l+r 2 ; if P = lex A[q]... min{a[q]+m 1,n} then l q; else r q 1; end end return [s, r]; Note that both while-loops in Alg. 1 make sure that either l is increased or r is decreased, so they are both guaranteed to terminate. In fact, in the first while-loop, r always points one position behind the current search interval, and r is decreased in case of equality (when P = A[q]... min{a[q]+m 1,n} ). his makes sure that the first while-loop finds the leftmost position of P in A. he second loop works symmetrically. heorem 4. he suffix array allows to answer counting queries in O(m log n) time, and reporting queries in O(m log n + O P ) time. 6

9 1.6 Range Minimum Queries (RMQs) Definition 7. Given an array H[1, n] of integers (or any other objects from a totally ordered universe) and two indices 1 i j n, RMQ H (i, j) returns the position of the minimum in H[i, j] : RMQ H (i, j) = argmin i k j H[k]. Example 7. i j H = RMQ H (i, j) In the lecture Advanced Methods for Sequence Analysis, we proved: heorem 5. A static array H can be preprocessed in linear time into a data structure of size O(n) that allows to answer RMQs on H in constant time. 1.7 Longest ommon Prefixes and Suffixes An indispensable tool in pattern matching are efficient implementations of functions that compute longest common prefixes and longest common suffixes of two strings (usually suffixes or prefixes of the same string). Definition 8. Given two strings and in Σ, lcp(, ) denotes the length of their longest common prefix, in symbols: lcp(, ) = max{k 0 : 1...k = 1...k }. Example 8. lcp(aag, A) = 2. Note that lcp( ) only gives the length of the matching prefix; if one is actually interested in the prefix itself, this can be obtained by 1...lcp(, ), or by 1...lcp(, ). As mentioned above, we will be particularly interested in longest common prefixes of suffixes from the same string : Definition 9. For a text of length n and two indices 1 i, j n, lcp (i, j) denotes the length of the longest common prefix of the suffixes starting at position i and j in, in symbols: lcp (i, j) = lcp( i...n, j...n ). Example 9. = AAAAA lcp (2, 7) = lcp(aaaa, A) = 1 lcp (4, 5) = lcp(aaaa, AAA) = 2 Note that the lcp-array H from Sect. 1.4 holds the lengths of longest common prefixes of lexicographically consecutive suffixes: H[i] = lcp( A[i]...n, A[i 1]...n ) = lcp(a[i], A[i 1]). Here and in the remainder of this chapter, A is again the suffix array of text. But how do we get the lcp-values of suffixes that are not in lexicographic neighborhood? he key to this is to employ RMQs over the lcp-array, as shown in the next lemma. Definition 10. he inverse suffix array A 1 is defined by A 1 [A[i]] = i for all 1 i n. 7

10 Example A = A 1 = Lemma 6. Let i j be two indices in with A 1 [i] < A 1 [j] (otherwise swap i and j). hen lcp(i, j) = H[RMQ H (A 1 [i] + 1, A 1 [j])]. Proof. First note that any common prefix ω of i...n and j...n must be a common prefix of A[k]...n for all A 1 [i] k A 1 [j], because these suffixes are lexicographically between i...n and j...n and must hence start with ω. Let m = RMQ H (A 1 [i] + 1, A 1 [j]) and l = H[m]. By the definition of H, i...i+l 1 is a common prefix of all suffixes A[k]...n for A 1 [i] k A 1 [j]. Hence, i...i+l 1 is a common prefix of i...n and j...n. Now assume that i...i+l is also a common prefix of i...n and j...n. hen, by the lexicographic order of A, i...i+l is also a common prefix of A[m 1]...n and A[m]...n. But i...i+l = l + 1, contradicting the fact that H[m] = l tells us that A[m 1]...n and A[m]...n share no common prefix of length more than l. Lemma 6 implies that with the inverse suffix array A 1, the lcp-array H, and constant-time RMQs on H, we can answer lcp-queries for arbitrary suffixes in O(1) time. 1.8 Accelerated Search in Suffix Arrays he simple binary search (Alg. 1) may perform many unnecessary character comparisons, as in every step it compares P from scratch. With the help of the lcp-function from the previous section, we can improve the search in suffix arrays from O(m log n) to O(m + log n) time. he idea is to remember the number of matching characters of P with A[l]...n and A[r]...n, if [l : r] denotes the current interval of the binary search procedure. Let λ and ρ denote these numbers, λ = lcp(p, A[l]...n ) and ρ = lcp(p, A[r]...n ). Initially, both λ and ρ are 0. Let us consider an iteration of the first while-loop in function SAsearch(P ), where we wish to determine whether to continue in [l : q] or [q, r]. (Alg. 1 would actually continue searching in [q + 1, r] in the second case, but this minor improvement is not possible in the accelerated search.) We are in the following situation: A = λ l q r P 1 P λ P 1 P ρ ρ 8

11 Without loss of generality, assume λ ρ (otherwise swap). We then look up ξ = lcp(a[l], A[q]) as the longest common prefix of the suffixes A[l]...n and A[q]...n. We look at three different cases: 1. ξ > λ A = l q r ξ λ α A[l]+λ = A[q]+λ α λ ρ Because P λ+1 > lex A[l]+λ = A[q]+λ, we know that P > lex A[q]...n, and can hence set l q, and continue the search without any character comparison. Note that ρ and in particular λ correctly remain unchanged. 2. ξ = λ A = ξ = λ l q r α α ρ In this case we continue comparing P λ+1 with A[q]+λ, P λ+2 with A[q]+λ+1, and so on, until P is matched completely, or a mismatch occurs. Say we have done this comparison up to P λ+k. If P λ+k > lex A[q]+λ+k 1, we set l q and λ k 1. Otherwise, we set r q and ρ k ξ < λ A = λ ξ l q r α P ξ+1 A[q]+ξ α ξ ρ First note that ξ ρ, as lcp(a[l], A[r]) ρ, and A[q]...n lies lexicographically between A[l]...n and A[r]...n. So we can set r q and ρ ξ, and continue the binary search without any character comparison. 9

12 his algorithm either halves the search interval (case 1 and 3) without any character comparison, or increases either λ or ρ for each successful character comparison. Because neither λ nor ρ are ever decreased, and the search stops when λ = ρ = m, we see that the total number of character comparisons (= total work of case 2) is O(m). So far we have proved the following theorem: heorem 7. ogether with lcp-information, the suffix array supports counting and reporting queries in O(m + log n) and O(m + log n + O P ) time, respectively (recall that O P is the set of occurences of P in ). 1.9 Suffix rays We now show how the O(m log σ)-algorithm for suffix trees and the O(m + log n)-algorithm can be combined to obtain a faster O(m+log σ) search-algorithm. he general idea is to start the search in the suffix tree where some additional information has been stored to speed up the search, and then, at an appropriate point, continue the search in the suffix array with a sufficiently small interval. Note that the accelerated search-algorithm from Sect. 1.8, if executed on a sub-interval I = A[x, y] of A instead of the complete array A[1, n], runs in O(m + log I ) time. his is O(m + log σ) for I = σ O(1). We first classify the nodes in s suffix tree S as follows: 1. Node v is called heavy if the numbers of leaves below v is at least σ. 2. Otherwise, node v is called light. heavy v light v v v... σ < σ Note that all heavy nodes in S have only heavy ancestors by definition, and hence the heavy nodes form a connected subtree of S. Heavy nodes v are further classified into (a) branching, if at least two of v s children are heavy, (b) non-branching, if exactly one of v s children is heavy, (c) terminal, if none of v s children are heavy. Example

13 = AAAAA σ = 3 branching non-branching A terminal A A 4 5 A A A A A A heavy A A light A A Lemma 8. he number of branching heavy nodes is O( n σ ). Proof : First count the terminal heavy nodes. By definition, every heavy node has σ leaves below itself. Note that every leaf in S must be below exactly one terminal heavy node. For the sake of contradiction, suppose there were more then n σ terminal heavy nodes. hen the tree S would contain > n σ σ = n leaves. ontradiction! Hence, the number of terminal heavy nodes is at most n σ. Now look at the subtree of S consisting of terminal and branching heavy nodes only. his is a tree with at most n σ leaves, and every internal node has at least two children. For such a tree we know that the number of internal nodes is bounded by the number of leaves. he claim follows. We now augment the suffix tree with additional information at every heavy node v: (a) At every branching heavy node v, we store an array B v of size σ, where we store pointers to v s children according to the first character on the corresponding edge (like in the bad suffix tree with O(nσ) space!). (b) At every non-branching heavy node v, we store a pointer h v to its single heavy child. (c) All terminal heavy nodes, and all light children of branching or non-branching heavy nodes are called interval nodes. Every interval node v is augmented with its suffix-interval I v, which contains the start- and end-position in the suffix array A of the leaf-labels below v. Furthermore, the suffix-intervals of adjacent light children of a non-branching heavy node are contracted into one single interval. hus, non-branching heavy nodes v have at most 3 children (one heavy child h v, and one interval node each to the left/right of h v ). Everything in the tree below the interval nodes can be deleted. he resulting tree, together with the suffix array A, is called the suffix tray. Example 12. he suffix tray for = AAAAA is shown in the following picture (the dotted part has been deleted from the tree): 11

14 = AAAAA σ = 3 A v 1 A B v11 = [v 2, v 10, v 11 ] v 2 I v3 = [1, 2] v 3 v 6 v 12 I v6 = [3, 5] h v2 v 11 I v11 = [7, 10] I v10 = [6, 6] v 4 v 5 v 6 v 8 v 1 v 10 v 3 v 7 v 9 v 2 A = Lemma 9. he size of the resulting data structure is O(n) Proof : From Lemma 8 we know that there are only O( n σ ) heavy branching nodes. So the total space for heavy branching nodes is O( n σ σ) = O(n). All other information is constant at each node. he claim follows. Now we describe the search procedure. We start as with normal suffix trees and try reading P from the root downwards. here are three different cases to consider (assume the characters P 1...i 1 have already been matched, and we have arrived at node v = P 1...i 1 of the suffix tray): (a) At branching heavy nodes v, we can find the the correctly labeled edge (i. e., the edge whose label starts with P i ) in O(1) time, by consulting B v [P i ]. (b) At non-branching heavy nodes v, we first check if the search continues at the heavy child, by comparing the next character P i with the first character a Σ on the incoming edge (v, h v ). If this is the case (P i = a), we compare all characters on the edge (v, h v ) to the pattern and, if necessary, continue the search procedure at the heavy child h v. If P i < a, we continue at the (only) interval node left of h v with case (c). Otherwise (P i a), there are two possibilities. If P i > a, we do the same for the only interval node right of h v with case (c). (c) At interval nodes v, we switch to the suffix array search algorithm from section 1.8, using I v as the start interval. Lemma 10. he length of the intervals stored at the interval nodes is O(σ 2 ). Proof : he intervals of at most σ 1 light nodes are contracted into a single interval, and each light node has at most σ 1 leaves below itself. heorem 11. he suffix tray supports counting and reporting queries in O(m + log σ) and O(m + log σ + O P ) time, respectively. 12

15 Proof : We either advance by one character in P with a constant amount of work, or we arrive at an interval node v, where we perform the accelerated binary search in O(m + log I v ) = O(m + log σ 2 ) = O(m + 2 log σ) = O(m + log σ) time, by Lemma he Burrows-Wheeler ransformation he Burrows-Wheeler ransformation was originally invented for text compression. Nonetheless, it was noted soonly that it is also a very useful tool in text indexing. In this hapter, we introduce the transformation and briefly review its merits for compression. he subsequent chapter on backwards search will then explain how it is used in the indexing scenario. 2.1 Recommended Reading G. Navarro and V. Mäkinen: ompressed Full-ext Indexes. AM omputing Surveys 39(1), Article no. 2 (61 pages), Section 5.3 M. Burrows and D. J. Wheeler: A Block-sorting Lossless Data ompression Algorithm. SR Research Report 124, D. Adjeroh,. Bell, and A. Mukherjee: he Burrows-Wheeler ransform: Data ompression, Suffix Arrays and Pattern Matching. Springer, he ransformation Definition 11. Let 1...n be a text of length n, where n = is a unique character lexicographically smaller than all other characters in Σ. hen the i-th cyclic shift of is i...n 1...i 1. We denote it by (i). Example = AAAA (6) = AAAA he Burrows-Wheeler-ransformation (bwt) is obtained by the following steps: 1. Write all cyclic shifts (i), 1 i n, column-wise next to each other. 2. Sort the columns lexicographically. 3. Output the last row. his is bwt. Example

16 = AAAA A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A sort columns lexicogr. A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A F (first) BW = L (last) (1) (6) (1) he text bwt in the last row is also denoted by L (last), and the text in the first row by F (first). Note: Every row in the bwt-matrix is a permutation of the characters in. Row F is a sorted list of all characters in. In row L = bwt, similar characters are grouped together. his is why bwt can be compressed more easily than. 2.3 onstruction of the BW he bwt-matrix needs not to be constructed explicitly in order to obtain bwt. Since is terminated with the special character, which is lexicographically smaller than any a Σ, the shifts (i) are sorted exactly like s suffixes. Because the last row consists of the characters preceding the corresponding suffixes, we have bwt i = A[i] 1 (= (A[i]) n ), where A denotes again s suffix array, and 0 is defined to be n (read cyclically!). Because the suffix array can be constructed in linear time (shown in the lecture Advanced Methods in Sequence Analysis), we get: heorem 12. he BW of a text length-n text over an integer alphabet can be constructed in O(n) time. Example

17 = AAAA A= A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A F (first) BW = L (last) 2.4 he Reverse ransformation he amazing property of the bwt is that it is not a random permutation of s letters, but that it can be transformed back to the original text. For this, we need the following definition: Definition 12. Let F and L be the strings resulting from the bwt. hen the last-to-front mapping lf is a function lf : [1, n] [1, n], defined by lf(i) = j (A[j]) = ( (A[i]) ) (n) ( A[j] = A[i] + 1). (Remember that (A[i]) is the i th column in the bwt-matrix, and ( (A[i]) ) (n) is that column rotated by one character downwards.) hus, lf(i) tells us the position in F where L[i] occurs. Example 16. = AAAA A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A F (first) BW = L (last) LF = Observation 3. Equal characters preserve the same order in F and L. hat is, if L[i] = L[j] and i < j, then lf(i) < lf(j). o see why this is so, recall that the bwt-matrix is sorted lexicographically. Because both the lf(i) th and the lf(j) th column start with the same character 15

18 a = L[i] = L[j], they must be sorted according to what follows this character a, say α and β. But since i < j, we know α < lex β, hence lf(i) < lf(j). F a a i j α β α β L lf(i) a lf(j) a his observation allows us to compute the lf-mapping without knowing the suffix array of. Definition 13. Let be a text of length n over an alphabet Σ, and let L = bwt be its bwt. Define : Σ [1, n] such that (a) is the number of occurrences in of characters that are lexicographically smaller than a Σ. Define occ : Σ [1, n] [1, n] such that occ(a, i) is the number of occurrences of a in L s length-i-prefix L[1, i]. Lemma 13. With the definitions above, lf(i) = (L[i]) + occ(l[i], i). Proof : Follows immediately from the observation above. his gives rise to the following algorithm to recover from L = bwt. 1. Scan L = bwt and compute array [1, σ]. 2. ompute the first row F from. 3. ompute occ(l[i], i). 4. Recover from right to left: we know that n =, and the corresponding cyclic shift (n) appears in column 1 in bwt. Hence, n 1 = L[1]. Shift (n 1) appears in column lf(1), and thus n 2 = L[lf(1)]. his continues until the whole text has been recovered: Example 17. n k = L[lf(lf(... (lf(1))... ))] }{{} k 1 applications of lf 16

19 A = F = A A A A L = A A A A occ(l[i], i) = n =, k = 1 L[1] = n 1 =, k = LF(1) = 6 L[6] = n 2 = A, k = LF(6) = 3 L[3] = n 3 =, k = LF(3) = 8 L[8] = n 4 =, k = LF(8) = 10 L[10] etc. reversed 2.5 ompression Storing bwt plainly needs the same space as storing the original text. However, because equal characters are grouped together in bwt, we can compress bwt in a second stage. We review two different compression methods in this section Move-to-front (MF) & Huffman oding Initialize a list Y containing each character in Σ in alphabetic order. In a left-to-right scan of bwt, (i = 1,..., n), compute a new array R[1, n]: Write the position of character i bwt in Y to R[i]. Move character i bwt to the front of Y. Encode the resulting string/array R with any kind of reversible compressor, e. g. Huffman, into a string R. Example

20 BW = A A A A i Y old BW i R[i] Y new 1 A 3 A 2 A 1 A 3 A 1 A 4 A 1 A 5 A A 2 A 6 A A 1 A 7 A A 1 A 8 A 2 A 9 A 3 A 10 A A 3 A Observation 4. MF produces many small numbers for equal characters that are close together in bwt. hese can be compressed using an order-0 compressor, e. g. Huffman, as in the next example. Example 19. R[i] character frequency = Both steps (Huffman & MF) are easy to reverse Run-Length Encoding We can also directly exploit that bwt consists of many equal-letter runs. Each such run a l can be encoded as a pair (a, l) with a Σ, l [1, n]. Example BW = AAAA RLE( BW )=(,4),(A,4),(,1),(,1),(A,1) 18

21 3 Backwards Search and FM-Indices We are now going to explore how the BW-transformed text is helpful for (indexed) pattern matching. Indices building on the BW are called FM-indices, most likely in honour of their inventors P. Ferragina and G. Manzini. From now on, we shall always assume that the alphabet Σ is goodnatured: σ = o(n/ log σ). 3.1 Recommended Reading G. Navarro and V. Mäkinen: ompressed Full-ext Indexes. AM omputing Surveys 39(1), Article no. 2 (61 pages), Sect. 4.1, 4.2, 5.1, 5.4, 6.1, 9.1, and Model of omputation and Space Measurement For the rest of this lecture, we work with the word-ram model of computation. his means that we have a processor with registers of width w (usually w = 32 or w = 64), where usual arithmetic operations (additions, shifts, comparisons, etc.) on w-bit wide words can be computed in constant time. Note that this matches all current computer architectures. We further assume that n, the input size, satisfies n 2 w, for otherwise we could not even address the whole input. From now on, we measure the space of all data structures in bits instead of words, in order to be able to differentiate between the various text indexes. For example, an array of n numbers from the range [1, n] occupies n log n bits, as each array cell stores a binary number consisting of log n bits. As another example, a length-n text over an alphabet of size σ occupies n log σ bits. In this light, all text indexes we have seen so far (suffix trees, suffix arrays, suffix trays) occupy O(n log n + n log σ) bits. Note that the difference between log n and log σ can be quite large, e. g., for the human genome with σ = 4 and n = we have log σ = 2, whereas log n 32. So the suffix array occupies about 16 times more memory than the genome itself! 3.3 Backward Search We first focus our attention on the counting problem (p. 4); i.e., on finding the number of occurrences of a pattern P 1...m in 1...n. Recall from hapter 2 that A denotes s suffix array. L/F denotes the first/last row of the bwt-matrix. lf( ) denotes the last-to-front mapping. (a) denotes the number of occurrences in of characters lexicographically smaller than a Σ. occ(a, i) denotes the number of occurrences of a in L[1, i]. Our aim is identify the interval of P in A by searching P from right to left (= backwards). o this end, suppose we have already matched P i+1...m, and know that the suffixes starting with P i+1...m form the interval [s i+1, e i+1 ] in A. In a backwards search step, we wish to calculate the interval [s i, e i ] of P i...m. First note that [s i, e i ] must be a sub-interval of [(P i ) + 1, (P i + 1)], where (P i + 1) denotes the character that follows P i in Σ. 19

22 (P i ) + 1 s i e i (P i + 1) s i+1 e i+1 A = P i P i...m P i P i+1...m F backwards search step So we need to identify, from those suffixes starting with P i, those which continue with P i+1...m. Looking at row L in the range from s i+1 to e i+1, we see that there are exactly e i s i + 1 many positions j [s i+1, e i+1 ] where L[j] = P i. s i+1 e i+1 A = P i+1...m F P i L = P i = P i From the BW decompression algorithm, we know that characters preserve the same order in F and L. Hence, if there are x occurrences of P i before s i+1 in L, then s i will start x positions behind (P i ) + 1. his x is given by occ(p i, s i+1 1). Likewise, if there are y occurrences of P i within L[s i+1, e i+1 ], then e i = s i + y 1. Again, y can be computed from the occ-function. (P i ) + 1 s i e i (P i + 1) s i+1 e i+1 A = P i P i...m P i P i+1...m F = occ(p i, s i+1 1) L = P i = P i 20

23 Algorithm 2: function backwards-search(p 1...m ) s 1; e n; for i = m... 1 do s (P i ) + occ(p i, s 1) + 1; e (P i ) + occ(p i, e); if s > e then return no match ; end end return [s, e]; his gives rise to the following, elegant algorithm for backwards search: he reader should compare this to the normal binary search algorithm in suffix arrays. Apart from matching backwards, there are two other notable deviations: 1. he suffix array A is not accessed during the search. 2. here is no need to access the input text. Hence, and A can be deleted once bwt has been computed. It remains to show how array and occ are implemented. Array is actually very small and can be stored plainly using σ log n bits. 1 Because σ = o(n/ log n), = o(n) bits. For occ, we have several options that are explored in the rest of this chapter. his is where the different FM-Indices deviate from each other. In fact, we will see that there is a natural trade-off between time and space: using more space leads to a faster computation of the occ-values, while using less space implies a higher query time. heorem 14. With backwards search, we can solve the counting problem in O(m t occ ) time, where t occ denotes the time to answer an occ( )-query. 3.4 First Ideas for Implementing Occ For answering occ(c, i), there are two simple possibilities: 1. Scan L every time an occ( )-query has to be answered. his occupies no space, but needs O(n) time for answering a single occ( )-query, leading to a total query time of O(mn) for backwards search. 2. Store all answers to occ(c, i) in a two-dimensional table. his table occupies O(nσ log n) bits of space, but allows constant-time occ( )-queries. otal time for backwards search is optimal O(m). For more more practical implementation between these two extremes, let us define the following: Definition 14. Given a bit-vector B[1, n], rank 1 (B, i) counts the number of 1 s in B s prefix B[1, i]. Operation rank 0 (B, i) is defined similarly for 0-bits. 1 More precisely, we should say σ log n bits, but we will usually omit floors and ceilings from now on. 21

24 We shall see presently that a bit-vector B, together with additional information for constanttime rank-operations, can be stored in n + o(n) bits. his can be used as follows for implementing occ: For each character c Σ, store an indicator bit vector B c [1, n] such that B c [i] = 1 iff L[i] = c. hen occ(c, i) = rank 1 (B c, i). he total space for all σ indicator bit vectors is thus σn + o(σn) bits. Note that for reporting queries, we still need the suffix array to output the values in A[s, e] after the backwards search. heorem 15. With backwards search and constant-time rank operations on bit-vectors, we can answer counting queries in optimal O(m) time. he space (in bits) is σn + o(σn) + σ log n. Example L = AAAA B = B A = B = ompact Data Structures on Bit Vectors We now show that a bit-vector B of length n can be augmented with a data structure of size o(n) bits such that rank-queries can be answered in O(1) time. First note that rank 0 (B, i) = i rank 1 (B, i), so considering rank 1 will be enough. We conceptually divide the bit-vector B into blocks of length s = log n 2 and super-blocks of length s = s 2 = Θ(log 2 n). B = s s he idea is to decompose a rank 1 -query into 3 sub-queries that are aligned with the block- or super-block-boundaries. o this end, we store three types of arrays: 1. For all of the n s super-blocks, M [i] stores the number of 1 s from B s beginning up to the end of the i th superblock. his table needs order of ( ) n n/s log n = O = o(n) }{{}}{{} log n #superblocks value from [1,n] bits. 22

25 2. For all of the n s blocks, M[i] stores the number of 1 s from the beginning of the superblock in which block i is contained up to the end of the i th block. his needs order of ( ) n log log n n/s log s = O = o(n) }{{}}{{} log n #blocks value from [1,s ] bits of space. 3. For all bit-vectors V of length s and all 1 i s, P [V ][i] stores the number of 1-bits in V [1, i]. Because there are only 2 s = 2 log n 2 such vectors V, the space for table P is order of bits. Example log n 2 }{{} }{{} s log s = O ( n log n log log n ) = o(n) }{{} #possible blocks #queries value from [1,s] s = 3 s = 9 B = M = M = P : V i A query rank 1 (B, i) is then decomposed into 3 sub-queries, as seen in the following picture: B = i rank 1 (1, i) 1 superblock query: precomputed in M 2 block-query: precomp. in M 3 in-block-query: precomp. in P 23

26 hus, computing the block number as q = i 1 s, and the super-block number as q = i 1 s, we can answer in constant time. rank 1 (B, i) = M [q ] + M[q] + P [ B[qs + 1, (q + 1)s] ] }{{} [i qs] }{{} i s block index in block Example 23. ontinuing the example above, we answer rank 1 (B, 17) as follows: the block number is q = = 5, and the super-block number is q = = 1. Further, i s block is B[ , 6 3] = B[16, 18] = 001, and the index in that block is = 2. Hence, rank 1 (B, 17) = M [1] + M[5] + P [001][2] = = 9. his finishes the description of the data structure for O(1) rank-queries. In addition to that, we also define the inverse of rank, a function that will be helpful in subsequent chapters: Definition 15. Given a bit-vector B[1, n], select 1 (B, i) returns the position of the i th 1-bit in B, or n + 1 if B contains less than i 1 s. Operation select 0 is defined similarly. Note that rank 1 (B, select(b, i)) = i. he converse select(b, rank(b, i)) is only true if B[i] = 1. Note also that select 0 cannot be computed easily from select 1 (as it was the case for rank), so select 1 and select 0 have to be considered seperately. Solving select-queries is only a little bit more complicated than solving rank-queries. We divide the range of arguments for select 1 into subranges of size κ = log 2 n, and store in N[i] the answer to select 1 (B, iκ). his table N[1, n κ ] needs O( n n κ log n) = O( log n ) bits, and divides B into blocks of different size, each containing κ 1 s (apart from the last). B = N[1] N[2] N[3] N[4] etc. A block is called long if it spans more than κ 2 = Θ(log 4 n) positions in B, and short otherwise. For the long blocks, we store the answers to all select 1 -queries explicitly. Because there are at most long blocks, this requires n log 4 n ( n ) O κ 2 κ log n = O( n/ log 4 n log 2 n log n ) = O }{{}}{{}}{{} #long blocks #arguments value from [1,n] ( ) n = o(n) bits. log n Short blocks contain κ 1-bits and span at most κ 2 positions in B. We divide again their range of arguments into sub-ranges of size κ = log 2 κ = Θ(log 2 log n). In N [i], we store the answer to select 1 (B, iκ ), relative to the beginning of the block where i occurs: N [i] = select 1 (B, iκ ) N[ iκ 1 ]. }{{ κ } block before i 24

27 Because the values in N are in the range [1, κ 2 ], table N [1, n κ ] needs ( ( ) n O κ log κ2) n = O log 2 log log n = o(n) log n bits. able N divides the blocks into miniblocks, each containing κ 1-bits. Miniblocks are long if they span more than κ 2 = Θ(log n) bits, and short otherwise. For long miniblocks, we store again the answers to all select-queries explicitly, relative to the beginning of the corresponding block. Because the miniblocks are contained in short blocks of length κ, the answer to such a select-query takes log κ bits of space. hus, the total space for the long miniblocks is O( n/ ( n log κ κ 3 ) log n }{{}}{{} log κ) = O = o(n) log n #long miniblocks #arguments bits. Finally, because short miniblocks are of length log n 2, we can use a global lookup table (analogous to P in the solution for rank) to answer select 1 -queries within short miniblocks. long blocks short blocks short miniblocks B = κ 1s κ 1s κ 1s κ 1s etc. κ 2 κ long miniblocks Answering select-queries is done as follows. ODO!!! he structures need to be duplicated for select 0. We summarize this section in the following theorem. heorem 16. An n-bit vector B can be augmented with data structures of size o(n) bits such that rank b (B, i) and select b (B, i) can be answered in constant time (b {0, 1}). 3.6 Wavelet rees Armed with constant-time rank-queries, we now develop a more space-efficient implementation of the occ-function, sacrificing the optimal query time. he idea is to use a wavelet tree on the BW-transformed text. he wavelet tree of a sequence L[1, n] over an alphabet Σ[1, σ] is a balanced binary search tree of height O(log σ). It is obtained as follows. We create a root node v, where we divide Σ into two halves Σ l = Σ[1, σ 2 ] and Σ r = Σ[ σ 2 + 1, σ] of roughly equal size. Hence, Σ l holds the lexicographically first half of characters of Σ, and Σ r contains the other characters. At v we store a bit-vector B v of length n (together with data structures for O(1) rank-queries), where a 25

28 0 of position i indicates that character L[i] belongs to Σ l, and a 1 indicates the it belongs to Σ r. his defines two (virtual) sequences L v and R v, where L v is obtained from L by concatenating all characters L[i] where B v [i] = 0, in the order as they appear in L. Sequence R v is obtained in a similar manner for positions i with B v [i] = 1. he left child l v is recursively defined to be the root of the wavelet tree for L v, and the right child r v to be the root of the wavelet tree for R v. his process continues until a sequence consists of only one symbol, in which case we create a leaf. Example 24. L=AAAA Σ ={,A,} Σ l = {} Σ l = {,A} AAAA AAAA Σ r = {A} Σ r = {} W AAAA Note that the sequences themselves are not stored explicitly; node v only stores a bit-vector B v and structures for O(1) rank-queries. heorem 17. he wavelet tree for a sequence of length n over an alphabet of size σ can be stored in n log σ (1 + o(1)) bits. Proof : We concatenate all bit-vectors at the same depth d into a single bit-vector B d of length n, and prepare it for O(1)-rank-queries (see Sect. 3.5). Hence, at any level, the space needed is n + o(n) bits. Because the depth of the tree is log σ the claim on the space follows. In order to know the sub-interval of a particular node v in the concatenated bit-vector B d at level d, we can store two indices α v and β v such that B d [α v, β v ] is the bit-vector B v associated to node v. his accounts for additional O(σ log n) bits. hen a rank-query is answered as follows (b {0, 1}): rank b (B v, i) = rank b (B d, α v + i 1) rank b (B d, α v 1), where it is assumed that i β v α v + 1, for otherwise the result is not defined. How does the wavelet tree help for implementing the occ-function? Suppose we want to compute occ(c, i), i. e., the number of occurrences of c Σ in L[1, i]. We start at the root r of the wavelet tree, and check if c belongs to the first or to the second half of the alphabet. In the first case, we know that the c s are stored in the left child of the root, namely L r. Hence, the number of c s in L[1, i] corresponds to the number of c s in L r [1, rank 0 (B r, i)]. If, on the hand, c belongs to the second half of the alphabet, we know that the c s are stored in the subsequence R r that corresponds to the right child of r, and hence compute the number of occurrences of c in R r [1, rank 1 (B r, i)] as the number of c s in L[1, i]. his leads to the following recursive procedure for computing occ(c, i), to be invoked with W-occ(c, i, 1, σ, r), where r is the root of the wavelet tree. (Recall that we assume that the characters in Σ can be accessed as Σ[1],..., Σ[σ].) 26

29 Algorithm 3: function W-occ(c, i, σ l, σ r, v) if σ l = σ r then return i; end σ m = σ l+σ r 2 ; if c Σ[σ m ] then return W-occ(c, rank 0 (B v, i), σ l, σ m, l v ); else return W-occ(c, rank 1 (B v, i), σ m + 1, σ r, r v ); end Due to the depth of the wavelet tree, the time for W-occ( ) is O(log σ). his leads to the following theorem. heorem 18. With backward-search and a wavelet-tree on bwt, we can answer counting queries in O(m log σ) time. he space (in bits) is O(σ log n) }{{} + n log σ }{{} + o(n log σ) }{{} + space for α v s wavelet tree rank data structure. 4 ompressed Suffix Arrays Until now, for enumerating the occurrences of a search pattern P in, we still have to sacrifice the O(n log n) bits for the suffix array A. Note that all other structures for backward-searching (array and the wavelet-tree for computing occ) occupy O(n log σ) bits, the same as the text. We will show in this section that O(n log σ) bits suffice also for representing A. he drawback of this compressed suffix array is that the time for retrieving an entry from A is not constant any more, but rises from O(1) to O(log ɛ n), for some arbitrarily small constant 0 < ɛ Recommended Reading K. Sadakane New ext Indexing Functionalities in the ompressed Suffix Arrays. J. Algorithms 48(2): (2003). G. Navarro and V. Mäkinen: ompressed Full-ext Indexes. AM omputing Surveys 39(1), Article no. 2 (61 pages), Sect. 4.4, 4.5, he ψ-function he most important component of the compressed suffix array (abbreviated as SA henceforth) is a function ψ that allows us to jump one character forward in the suffix array. Definition 16. Define ψ: [1, n] [1, n] such that ψ(i) = j A[j] = A[i] + 1, where position n + 1 is interpreted as the first position in (read text circularly!). 27

30 Example = A A A A A A A A= ψ= Note the similarity of the ψ-function to suffix links in suffix trees: both cut off the first character of the corresponding substring. We remark that ψ is actually the inverse function of the lf-mapping from Sect. 3.3: while ψ allows us to move from suffix A[i]...n to A[i]+1...n, with lf we can move from A[i]...n to A[i] 1...n (recall Def. 14 in Sect. 3.3), in symbols: ψ(lf(i)) = i = lf(ψ(i)). his is also the reason why ψ is increasing in areas where the corresponding suffixes start with the same character. For instance, in Ex. 25 we have that all suffixes from A[2, 9] start with letter A; and indeed, ψ[2, 9] = [7, 9, 10, 12, 13, 14, 16] is increasing. his is summarized in the following lemma, which can be proved similarly as Observation 3. Lemma 19. If i < j and A[i] = A[j], then ψ(i) < ψ(j). his lemma will be used in Sect. 4.6 to store ψ in a space-efficient form. 4.3 he Idea of the ompressed Suffix Array We now present the general approach to store A in a space-efficient form. Instead of storing every entry in A, in a new bit-vector B 0 [1, n] we mark the positions in A where the corresponding entry in A is even: B 0 [i] = 1 A[i] 0 (mod 2). Bit-vector B 0 is prepared for O(1) rank-queries (Sect. 3.5). We further store the ψ-values at positions i with B 0 [i] = 0 in a new array ψ 0 [1, n 2 ]. Finally, we store the even values of A in a new array A 1 [1, n 2 ], and divide all values in A 1 by 2. Example = A A A A A A A A= B 0 = ψ 0 = A 1 = Now, the three arrays, B 0, ψ 0 and A 1, completely substitute A: to retrieve value A[i], we first check if B 0 [i] = 1. If so, we know that A[i]/2 is stored in A 1, and that the exact position in A 1 is given by the number of 1-bits in B 0 up to position i. Hence, A[i] = 2A 1 [rank 1 (B 0, i)]. If, on the other hand, B 0 [i] = 0, we follow ψ(i) in order to get to the position of the (A[i] + 1)st suffix, which must be even (and is hence stored in A 1 ). he value ψ(i) is stored in ψ 0, and its position therein is equal to the number of 0-bits in B 0 up to position i. Hence, A[i] = A[ψ 0 (rank 0 (B 0, i))] 1, which can be calculated be the mechanism of the previous paragraph. As we shall see later, ψ 0 can be stored very efficiently (basically using O(n log σ) bits). Hence, we have almost halved the space with this approach (from n log n bits for A to n 2 log n 2 for A 1). 28

31 4.4 Hierarchical Decomposition We can use the idea from the previous section recursively in order to gain more space: instead of representing A 1 plainly, we replace it with bit-vector B 1, array ψ 1 and A 2. Array A 2 can in turn be replaced by B 2, ψ 2, and A 3, and so on. In general, array A k [1, n k ], with n k = n 2 k, implicitly represents s suffixes that are a multiple of 2 k, in the order as they appear in the original array A 0 := A. Example = A A A A A A A A= ψ 0 = B 0 = level 0 A 1 = ψ 1 = B 1 = level 1 A 2 = etc. A k can be seen as a suffix array of a new string k, where the i th character of k is the concatenation of 2 k characters i2 k...(i+1)2 k 1 (we assume that is padded with sufficiently enough -characters). his means that the alphabet for k is Σ 2k, i. e., all 2 k -tuples from Σ. Example 28. A 2 = [4, 1, 3, 2] can be regarded as the suffix array of 2 = (AAA) (A) (AA) () }{{}}{{}}{{}}{{} his way, on level k we only store B k and ψ k. Only on the last level h we store A h. We choose n h = log log σ log n such that the space for storing A h is ( ) ( n ) O (n h log n h ) = O (n h log n) = O 2 h log n n log σ = O log n = O(n log σ) bits. log However, storing B k and ψ k on all h levels would take too much space. Instead, we use only a constant number of ɛ levels, namely 0, hɛ, 2hɛ,...,h (constant 0 < ɛ 1). Example 29. n log n 29

32 = A A A A A A A A 0 = ψ 0 = n=16 h=4 ɛ= 1 2 B 0 = A 2 = ψ 2 = B 2 = A 4 = 1 Hence, bit-vector B k has a 1 at position i iff A k [i] is a multiple of 2 hɛ+k. Given all this, we have the following algorithm to compute A[i], to be invoked with lookup(i, 0). Algorithm 4: function lookup(i, k) if k = h then return A h [i]; end if k = ω k then return n k ; end if B k [i] = 1 then return 2 hɛ lookup(rank 1 (B k, i), k + hɛ); else return lookup(ψ k (rank 0 (B k, i), k)) 1; end Here, ω k stores the position of the last suffix, i. e., A k [ω k ] = n k. hecking if i = ω k is necessary in order to avoid following ψ k from the last suffixes to the first, because this would give incorrect results. Example 30. A[15] = lookup(15, 0) = lookup(ψ 0 (11), 0) 1 = lookup(6, 0) 1 = 2 2 lookup(3, 2) 1 = 2 2 (lookup(ψ 2 (2), 2) 1) 1 = 2 2 (lookup(1, 2) 1) 1 = 2 2 (4 1) 1 = 11 o analyze the running time of the lookup-procedure, we first note that on every level k, we need to follow ψ k at most 2 hɛ times until we hit a position i with B k [i] = 1 (second case of the last if-statement). Because the number of implemented levels, ɛ, is constant (remember ɛ is constant!), the total time of the lookup-procedure is ( O 2 hɛ) = O ((2 log log n) ɛ) σ = O (log ɛ σ n), which is sub-logarithmic for ɛ < 1. 30

Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT)

Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) 1 Recommended Reading Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) D. Gusfield: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. M. Crochemore,

More information

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Disclaimer Students attending my lectures are often astonished that I present the material in a much livelier form than in this script.

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Succinct Suffix Arrays based on Run-Length Encoding

Succinct Suffix Arrays based on Run-Length Encoding Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space

More information

Forbidden Patterns. {vmakinen leena.salmela

Forbidden Patterns. {vmakinen leena.salmela Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

Lecture 1 : Data Compression and Entropy

Lecture 1 : Data Compression and Entropy CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for

More information

Smaller and Faster Lempel-Ziv Indices

Smaller and Faster Lempel-Ziv Indices Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

Alphabet-Independent Compressed Text Indexing

Alphabet-Independent Compressed Text Indexing Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic

More information

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform A Four-Stage Algorithm for Updating a Burrows-Wheeler ransform M. Salson a,1,. Lecroq a, M. Léonard a, L. Mouchard a,b, a Université de Rouen, LIIS EA 4108, 76821 Mont Saint Aignan, France b Algorithm

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

arxiv: v2 [cs.ds] 3 Oct 2017

arxiv: v2 [cs.ds] 3 Oct 2017 Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract

More information

arxiv: v1 [cs.ds] 9 Apr 2018

arxiv: v1 [cs.ds] 9 Apr 2018 From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract

More information

String Indexing for Patterns with Wildcards

String Indexing for Patterns with Wildcards MASTER S THESIS String Indexing for Patterns with Wildcards Hjalte Wedel Vildhøj and Søren Vind Technical University of Denmark August 8, 2011 Abstract We consider the problem of indexing a string t of

More information

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Krishnendu Chatterjee Rasmus Ibsen-Jensen Andreas Pavlogiannis IST Austria Abstract. We consider graphs with n nodes together

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO

More information

Burrows-Wheeler Transforms in Linear Time and Linear Bits

Burrows-Wheeler Transforms in Linear Time and Linear Bits Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.

More information

Preview: Text Indexing

Preview: Text Indexing Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text

More information

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Dynamic Entropy-Compressed Sequences and Full-Text Indexes Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.

More information

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account:! cd ~cs9319/papers! Original readings of each lecture will be placed there. 2 Course

More information

Theory of Computation

Theory of Computation Thomas Zeugmann Hokkaido University Laboratory for Algorithmics http://www-alg.ist.hokudai.ac.jp/ thomas/toc/ Lecture 13: Algorithmic Unsolvability The Halting Problem I In the last lecture we have shown

More information

Lecture 2 September 4, 2014

Lecture 2 September 4, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Introduction to Theory of Computing

Introduction to Theory of Computing CSCI 2670, Fall 2012 Introduction to Theory of Computing Department of Computer Science University of Georgia Athens, GA 30602 Instructor: Liming Cai www.cs.uga.edu/ cai 0 Lecture Note 3 Context-Free Languages

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

Fast Fully-Compressed Suffix Trees

Fast Fully-Compressed Suffix Trees Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University

More information

Sorting suffixes of two-pattern strings

Sorting suffixes of two-pattern strings Sorting suffixes of two-pattern strings Frantisek Franek W. F. Smyth Algorithms Research Group Department of Computing & Software McMaster University Hamilton, Ontario Canada L8S 4L7 April 19, 2004 Abstract

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

A Simple Alphabet-Independent FM-Index

A Simple Alphabet-Independent FM-Index A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl

More information

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey

More information

Online Sorted Range Reporting and Approximating the Mode

Online Sorted Range Reporting and Approximating the Mode Online Sorted Range Reporting and Approximating the Mode Mark Greve Progress Report Department of Computer Science Aarhus University Denmark January 4, 2010 Supervisor: Gerth Stølting Brodal Online Sorted

More information

Optimal-Time Text Indexing in BWT-runs Bounded Space

Optimal-Time Text Indexing in BWT-runs Bounded Space Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned

More information

An O(N) Semi-Predictive Universal Encoder via the BWT

An O(N) Semi-Predictive Universal Encoder via the BWT An O(N) Semi-Predictive Universal Encoder via the BWT Dror Baron and Yoram Bresler Abstract We provide an O(N) algorithm for a non-sequential semi-predictive encoder whose pointwise redundancy with respect

More information

arxiv: v2 [cs.ds] 8 Apr 2016

arxiv: v2 [cs.ds] 8 Apr 2016 Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl

More information

Algorithms for pattern involvement in permutations

Algorithms for pattern involvement in permutations Algorithms for pattern involvement in permutations M. H. Albert Department of Computer Science R. E. L. Aldred Department of Mathematics and Statistics M. D. Atkinson Department of Computer Science D.

More information

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account: cd ~cs9319/papers Original readings of each lecture will be placed there. 2 Course

More information

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b) Introduction to Algorithms October 14, 2009 Massachusetts Institute of Technology 6.006 Spring 2009 Professors Srini Devadas and Constantinos (Costis) Daskalakis Quiz 1 Solutions Quiz 1 Solutions Problem

More information

A Space-Efficient Frameworks for Top-k String Retrieval

A Space-Efficient Frameworks for Top-k String Retrieval A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott

More information

COS597D: Information Theory in Computer Science October 19, Lecture 10

COS597D: Information Theory in Computer Science October 19, Lecture 10 COS597D: Information Theory in Computer Science October 9, 20 Lecture 0 Lecturer: Mark Braverman Scribe: Andrej Risteski Kolmogorov Complexity In the previous lectures, we became acquainted with the concept

More information

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits Chris Calabro January 13, 2016 1 RAM model There are many possible, roughly equivalent RAM models. Below we will define one in the fashion

More information

Succincter text indexing with wildcards

Succincter text indexing with wildcards University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview

More information

Integer Sorting on the word-ram

Integer Sorting on the word-ram Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words

More information

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd Data Compression Q. Given a text that uses 32 symbols (26 different letters, space, and some punctuation characters), how can we

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

Introduction to Turing Machines. Reading: Chapters 8 & 9

Introduction to Turing Machines. Reading: Chapters 8 & 9 Introduction to Turing Machines Reading: Chapters 8 & 9 1 Turing Machines (TM) Generalize the class of CFLs: Recursively Enumerable Languages Recursive Languages Context-Free Languages Regular Languages

More information

Text Indexing: Lecture 6

Text Indexing: Lecture 6 Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 7: Burrows Wheeler Compression Juha Kärkkäinen 21.11.2017 1 / 16 Burrows Wheeler Transform The Burrows Wheeler transform (BWT) is a transformation

More information

Optimal lower bounds for rank and select indexes

Optimal lower bounds for rank and select indexes Optimal lower bounds for rank and select indexes Alexander Golynski David R. Cheriton School of Computer Science, University of Waterloo agolynski@cs.uwaterloo.ca Technical report CS-2006-03, Version:

More information

Internal Pattern Matching Queries in a Text and Applications

Internal Pattern Matching Queries in a Text and Applications Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords

More information

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,

More information

Multiple Pattern Matching

Multiple Pattern Matching Multiple Pattern Matching Stephen Fulwider and Amar Mukherjee College of Engineering and Computer Science University of Central Florida Orlando, FL USA Email: {stephen,amar}@cs.ucf.edu Abstract In this

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17 12.1 Introduction Today we re going to do a couple more examples of dynamic programming. While

More information

Proof Techniques (Review of Math 271)

Proof Techniques (Review of Math 271) Chapter 2 Proof Techniques (Review of Math 271) 2.1 Overview This chapter reviews proof techniques that were probably introduced in Math 271 and that may also have been used in a different way in Phil

More information

CSE 202 Homework 4 Matthias Springer, A

CSE 202 Homework 4 Matthias Springer, A CSE 202 Homework 4 Matthias Springer, A99500782 1 Problem 2 Basic Idea PERFECT ASSEMBLY N P: a permutation P of s i S is a certificate that can be checked in polynomial time by ensuring that P = S, and

More information

SORTING SUFFIXES OF TWO-PATTERN STRINGS.

SORTING SUFFIXES OF TWO-PATTERN STRINGS. International Journal of Foundations of Computer Science c World Scientific Publishing Company SORTING SUFFIXES OF TWO-PATTERN STRINGS. FRANTISEK FRANEK and WILLIAM F. SMYTH Algorithms Research Group,

More information

Define M to be a binary n by m matrix such that:

Define M to be a binary n by m matrix such that: The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]

More information

On Compressing and Indexing Repetitive Sequences

On Compressing and Indexing Repetitive Sequences On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv

More information

Notes on Logarithmic Lower Bounds in the Cell Probe Model

Notes on Logarithmic Lower Bounds in the Cell Probe Model Notes on Logarithmic Lower Bounds in the Cell Probe Model Kevin Zatloukal November 10, 2010 1 Overview Paper is by Mihai Pâtraşcu and Erik Demaine. Both were at MIT at the time. (Mihai is now at AT&T Labs.)

More information

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases

More information

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte

More information

arxiv: v1 [cs.ds] 19 Apr 2011

arxiv: v1 [cs.ds] 19 Apr 2011 Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200 Fast String Kernels Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200 Alex.Smola@anu.edu.au joint work with S.V.N. Vishwanathan Slides (soon) available

More information

On improving matchings in trees, via bounded-length augmentations 1

On improving matchings in trees, via bounded-length augmentations 1 On improving matchings in trees, via bounded-length augmentations 1 Julien Bensmail a, Valentin Garnero a, Nicolas Nisse a a Université Côte d Azur, CNRS, Inria, I3S, France Abstract Due to a classical

More information

Alternative Algorithms for Lyndon Factorization

Alternative Algorithms for Lyndon Factorization Alternative Algorithms for Lyndon Factorization Suhpal Singh Ghuman 1, Emanuele Giaquinta 2, and Jorma Tarhio 1 1 Department of Computer Science and Engineering, Aalto University P.O.B. 15400, FI-00076

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

Lecture 6 September 21, 2016

Lecture 6 September 21, 2016 ICS 643: Advanced Parallel Algorithms Fall 2016 Lecture 6 September 21, 2016 Prof. Nodari Sitchinava Scribe: Tiffany Eulalio 1 Overview In the last lecture, we wrote a non-recursive summation program and

More information

SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS

SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS Manolis Christodoulakis 1, Costas S. Iliopoulos 1, Yoan José Pinzón Ardila 2 1 King s College London, Department of Computer Science London WC2R

More information

State of the art Image Compression Techniques

State of the art Image Compression Techniques Chapter 4 State of the art Image Compression Techniques In this thesis we focus mainly on the adaption of state of the art wavelet based image compression techniques to programmable hardware. Thus, an

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

Compact Indexes for Flexible Top-k Retrieval

Compact Indexes for Flexible Top-k Retrieval Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne

More information

k-protected VERTICES IN BINARY SEARCH TREES

k-protected VERTICES IN BINARY SEARCH TREES k-protected VERTICES IN BINARY SEARCH TREES MIKLÓS BÓNA Abstract. We show that for every k, the probability that a randomly selected vertex of a random binary search tree on n nodes is at distance k from

More information

CSci 311, Models of Computation Chapter 4 Properties of Regular Languages

CSci 311, Models of Computation Chapter 4 Properties of Regular Languages CSci 311, Models of Computation Chapter 4 Properties of Regular Languages H. Conrad Cunningham 29 December 2015 Contents Introduction................................. 1 4.1 Closure Properties of Regular

More information

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes Information Theory with Applications, Math6397 Lecture Notes from September 3, 24 taken by Ilknur Telkes Last Time Kraft inequality (sep.or) prefix code Shannon Fano code Bound for average code-word length

More information

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata CISC 4090: Theory of Computation Chapter Regular Languages Xiaolan Zhang, adapted from slides by Prof. Werschulz Section.: Finite Automata Fordham University Department of Computer and Information Sciences

More information

arxiv: v1 [cs.ds] 25 Nov 2009

arxiv: v1 [cs.ds] 25 Nov 2009 Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,

More information

arxiv: v1 [cs.ds] 8 Sep 2018

arxiv: v1 [cs.ds] 8 Sep 2018 Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology

More information

Complementary Contextual Models with FM-index for DNA Compression

Complementary Contextual Models with FM-index for DNA Compression 2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical

More information

1 Definition of a Turing machine

1 Definition of a Turing machine Introduction to Algorithms Notes on Turing Machines CS 4820, Spring 2017 April 10 24, 2017 1 Definition of a Turing machine Turing machines are an abstract model of computation. They provide a precise,

More information

Space-Efficient Construction Algorithm for Circular Suffix Tree

Space-Efficient Construction Algorithm for Circular Suffix Tree Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes

More information

Covering Linear Orders with Posets

Covering Linear Orders with Posets Covering Linear Orders with Posets Proceso L. Fernandez, Lenwood S. Heath, Naren Ramakrishnan, and John Paul C. Vergara Department of Information Systems and Computer Science, Ateneo de Manila University,

More information

Automata and Computability. Solutions to Exercises

Automata and Computability. Solutions to Exercises Automata and Computability Solutions to Exercises Fall 28 Alexis Maciel Department of Computer Science Clarkson University Copyright c 28 Alexis Maciel ii Contents Preface vii Introduction 2 Finite Automata

More information

1 Basic Definitions. 2 Proof By Contradiction. 3 Exchange Argument

1 Basic Definitions. 2 Proof By Contradiction. 3 Exchange Argument 1 Basic Definitions A Problem is a relation from input to acceptable output. For example, INPUT: A list of integers x 1,..., x n OUTPUT: One of the three smallest numbers in the list An algorithm A solves

More information

arxiv: v1 [cs.ds] 22 Nov 2012

arxiv: v1 [cs.ds] 22 Nov 2012 Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:

More information

Slides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2

Slides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2 Huffman Encoding, 1 EECS Slides for CIS 675 DPV Chapter 5, Part 2 Jim Royer October 13, 2009 A toy example: Suppose our alphabet is { A, B, C, D }. Suppose T is a text of 130 million characters. What is

More information

Suffix Array of Alignment: A Practical Index for Similar Data

Suffix Array of Alignment: A Practical Index for Similar Data Suffix Array of Alignment: A Practical Index for Similar Data Joong Chae Na 1, Heejin Park 2, Sunho Lee 3, Minsung Hong 3, Thierry Lecroq 4, Laurent Mouchard 4, and Kunsoo Park 3, 1 Department of Computer

More information

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem

More information

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Dynamic Programming: Shortest Paths and DFA to Reg Exps CS 374: Algorithms & Models of Computation, Spring 207 Dynamic Programming: Shortest Paths and DFA to Reg Exps Lecture 8 March 28, 207 Chandra Chekuri (UIUC) CS374 Spring 207 / 56 Part I Shortest Paths

More information

Automata and Computability. Solutions to Exercises

Automata and Computability. Solutions to Exercises Automata and Computability Solutions to Exercises Spring 27 Alexis Maciel Department of Computer Science Clarkson University Copyright c 27 Alexis Maciel ii Contents Preface vii Introduction 2 Finite Automata

More information

Optimal Color Range Reporting in One Dimension

Optimal Color Range Reporting in One Dimension Optimal Color Range Reporting in One Dimension Yakov Nekrich 1 and Jeffrey Scott Vitter 1 The University of Kansas. yakov.nekrich@googlemail.com, jsv@ku.edu Abstract. Color (or categorical) range reporting

More information

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson & J. Fischer) January 21,

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson & J. Fischer) January 21, Sequene Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson & J. Fisher) January 21, 201511 9 Suffix Trees and Suffix Arrays This leture is based on the following soures, whih are all reommended

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information