Optimal Parallel Sux Tree Construction. Ramesh Hariharan y. April 1, Abstract

Size: px

Start display at page:

Download "Optimal Parallel Sux Tree Construction. Ramesh Hariharan y. April 1, Abstract"

Allan Copeland
6 years ago
Views:

1 Optial Parallel Sux Tree Construction Raesh Hariharan y April 1, 1997 Abstract An O()-work, O()-space, O(log 4 )-tie CREW-PRAM algorith for constructing the sux tree of a string s of length drawn fro any xed alphabet set is obtained. This is the rst known work and space optial parallel algorith for this proble. It can be generalized to a string s drawn fro any general alphabet set to perfor in O(log 4 ) tie and O( log jj) work and space, after the characters in s have been sorted alphabetically, where jj is the nuber of distinct characters in s. In this case too, the algorith is work-optial. Keywords. Sux tree, strings, pattern atching, string periodicity. 1 Introduction A sux tree of a string s is a copacted trie of all suxes of s. It is a powerful data structure which nds applications in any string processing algoriths. Soe exaples are string atching, nding all squares or repetitions in a string [AP83], coputing substring statistics [AP85], approxiate string atching [LV86], text copression [RPE81], analyzing genetic sequences [CHM86], etc. A very powerful feature of the sux tree is that after construction and suitable preprocessing, the longest coon prex of any two substrings of s can be found in constant tie; further, the preprocessing required for this step can be perfored optially in logarithic tie. The rst sequential algorith for constructing the sux tree was obtained by Weiner [W73]. This algorith takes O( log jj) tie, where jj is the nuber of distinct characters in s and = jsj. McCreight [M76] gave a ore ecient construction with the sae asyptotic tie bound. For strings fro a xed alphabet, both algoriths take linear, i.e., O() tie. The rst parallel algorith for this proble was due to Landau and Vishkin [LV86]; it runs in O(log ) tie and does O( 2 ) work. Apostolico et al. [AILSV86] used the Work supported by NSF grants CCR , CCR and CCR ymax-planck Institut fur Inforatik, I Stadtwald, Saarbrucken, Gerany, Eail: raesh@pi-sb.pg.de. Work done when the author was a student at the Courant Institute of Matheatical Sciences, New York University 1

2 technique of \naing" [KMR72] to obtain an algorith which takes O(log ) tie and does O( log ) work in the arbitrary CRCW-PRAM odel. This algorith requires superlinear, i.e., O( 2 ) space, which can be reduced to O( 1+ ) at the expense of an O( 1 ) factor in the tie bound. A variant of this algorith runs in O(log2 ) tie, does O( log 2 ) work and takes linear space in the CREW-PRAM odel. Note that constructing the sux tree of s iplicitly involves ordering the characters in s by alphabet; therefore, no parallel algorith can have a better work bound than the algorith in [AILSV86] when the characters in s are drawn fro a general and potentially innite alphabet. However the algorith of Apostolico et al. [AILSV86] runs in the above entioned tie, space and work bounds even for binary alphabet. The existence of a linear work algorith for the case when the characters in s are drawn fro a xed alphabet has been an iportant open proble in string processing. The Main Result. We give the rst work-optial parallel algorith to construct the sux tree of a binary string s. The algorith does O() work and takes O(log 4 ) tie and O() space in the CREW-PRAM odel. This algorith can be generalized to handle strings s drawn fro any xed alphabet set to perfor in the sae bounds. For strings s drawn fro a general alphabet set, the algorith can be generalized to perfor in O(log 4 ) tie, O( log jj) work, and O( log jj) space after the characters in s have been sorted by alphabet, where jj is the nuber of distinct characters in s. Since constructing the sux tree of s iplicitly involves ordering the characters in s by alphabet, no parallel algorith can have a better work bound. An Application. As an application, we obtain the rst optial work algorith for ulti-pattern atching, i.e., atching a set of patterns of collective size M against a text of size N, where the patterns and text are drawn over soe xed alphabet set. This algorith takes O(log 4 M l ) tie, O(M + N) work and O(M + N) space in the CREW-PRAM odel, where M l is the length of the longest pattern. Previously, the algorith with the sallest work bound was due to Muthukrishnan and Pale [MP93]; this algorith took O(log M l ) tie, O(N p log M l + M) work and O(M 1+ + N) space in the arbitrary CRCW-PRAM odel. Other Recent Developents. Independent of this work, Sahinalp and Vishkin [SV93] obtained an algorith for sux tree construction which takes O(log 2 ) tie, O( log log ) work and O( 1+ ) space in the arbitrary CRCW-PRAM odel. A randoized version of their algorith takes O(log 2 ) tie, O( log ) work and O() space. Both algoriths work for strings drawn fro an alphabet of size polynoial in and use the \naing" technique. They also odify this algorith for the case of xed alphabet to run in O(log 2 ) tie, O() work and O( 1+ ) space in the arbitrary CRCW-PRAM odel. Note that this algorith is tie-wise better than ours but is not space optial and requires a stronger odel; further the techniques used by the (syetry breaking and naing) are very dierent fro the cobinatorial techniques we use. More recently, Farach and Muthukrishnan [FM93] have obtained a randoized algorith which takes O(log ) tie, O() work and O() space in the arbitrary CRCW-PRAM odel for strings drawn fro a constant sized alphabet. Main Techniques Used. The algorith of Apostolico et al. [AILSV86] for sux tree 2

3 construction uses soe interesting algorithic techniques but does not use any of the properties intrinsic to strings. Our approach is quite dierent and exploits cobinatorial and periodicity properties of strings along with the repetitive nature of the sux tree. It is interesting to note that any parallel algoriths use superlinear space in order to exploit the xed (or restricted) alphabet assuption [MP93, BD+91, KLP89] in order to reduce the work bound; in contrast, we exploit the xed alphabet assuption to obtain an optial work algorith using just linear space. The following are two of the essential ingredients of the algorith which ay also be of independent interest. Our schee hinges on an upper bound we show for the following cobinatorial question. If i x is the nuber of ties substring x occurs in s, then what is the su infi 1x ; i 0x g, over all binary strings x of length r, as a function of = jsj and r? Here 0x just denotes the string x preceded by the sybol 0, and siilarly for 1x. Note that this su is closely related to the nuber of nodes in the sux tree of s which have two sux links incident upon the. We show that this su is upper bounded by log r. We give a concurrent version of McCreight's sequential algorith for sux tree construction which constructs a single data structure in which the sux trees of the binary strings s 1 ; : : : ; s k are erged. This algorith runs in O(ax i fjs i jg log k) tie, does O( i js i j) work, and takes O( i js i j) space. In this algorith, one processor is assigned to each string s i ; each processor perfors McCreight's sequential algorith on its respective string. All processors work on the sae data structure. This leads to two probles, the ore critical of which is the fact that a processor which inserts a node x into this data structure ight have to wait for the sux link of x's parent to be in place before proceeding. Indeed long paths ay appear in the data structure, with each node in this path waiting for the sux link of its parent to be in place. We give a non-trivial aortization arguent to show that the total work done is indeed linear, in spite of processors waiting for sux links to be in place. Algorith Overview. The sux tree construction algorith uses the following schee. Copacted tries are built for substrings of s of successively increasing lengths in O(log ) stages; the nal trie will be the sux tree. In the rst stage, the substrings have size r, where r is polylogarithic in. In the subsequent stages, they have lengths 2r; 2 2 r; 2 3 r; : : :. The rst stage provides us with the rst challenge in solving this proble as it is not clear how even this stage can be accoplished in O() work. We use the concurrent version of McCreight's algorith described above. s is split into pieces of length 2r, each pair of adjacent pieces overlapping in exactly r characters. One processor is assigned to each piece. Each processor perfors McCreight's sequential algorith to insert all substrings of length r in its piece into a copacted trie. All processors insert substrings into the sae data structure. This takes O(r log ) tie and O() work. Each subsequent stage can be perfored easily by sorting up to ites in each stage. However, this leads to an O( log 2 ) work algorith. The ain challenge now is to get around the proble of sorting up to ites in each stage. In order to avoid perforing this coputation, we use the following schee. Each stage proceeds in two steps. 3

4 Recall that in a given stage, all substrings of a particular length are processed. In the rst step, the copacted trie is coputed for a carefully chosen subset of these substrings. Using the cobinatorial property entioned above and exploiting periodicity properties, we show that the the nuber of such substrings which are not periodic with sall periods is O( log 2 ), and that those substrings which are periodic with sall periods can be grouped into O( ) failies. We can then \sort" these chosen substrings using log 2 Cole's parallel erge sort algorith [C88] so that the total work done in sorting is O( ) log per stage, which is O() over all stages. In the second step, by exploiting the repetitive structure of the trie, we obtain the copacted trie for all substrings using the copacted trie for the substrings chosen above. These chosen substrings enable a graph dened on the leaves of the trie obtained in the previous stage to be partitioned into trees; this graph partitioning is critical to the ecient perforance of the second step. This paper is organized as follows. Section 2 gives the requisite denitions and describes soe preliinary procedures, Section 3 describes a key cobinatorial property of strings, Section 4 describes how the rst stage is perfored, and Sections 5, 6, 7, 8, and 9 describes the reaining stages. 2 Denitions and Preliinaries The copacted trie of a set of strings s 1 ; s 2 ; : : : ; s k, none of which is a prex of another string in the set, is a tree T dened as follows. T has a leaf lf(s i ) for each s i, 1 i k. Each internal node in T has at least two children. Associated with each edge e in T is a string. The strings associated with the edges on the path fro the root to lf(s i ) yield the string s i when concatenated in path order. Further, for any two leaves lf(s i ) and lf(s j ) with least coon ancestor x, the strings associated with the edges on the path fro the root to x, when concatenated in path order, give the longest coon prex of s i and s j. It follows that if e and e 0 are edges leading fro a node x to any two of its children, the strings associated with e and e 0 begin with distinct characters. In addition, the edges leading fro x to its children are ordered fro left to right in increasing lexicographic order of the associated strings. Fig.1. shows the copacted trie of strings 1011; 0100; 1101; 1110; The above denition of copacted tries is generalized to the case when soe of the strings s 1 ; : : : ; s k are identical by having one leaf per distinct string rather than one leaf per string. All identical strings are associated with the sae leaf in T. Let s be a binary string of length. The sux tree ST of s is dened as follows. Let p be the string s$, where $ is a new sybol. Then, clearly, no sux of p is a prex of another sux of p. ST is a copacted trie of all suxes of p except the sux `$'. Fig.1. shows the sux tree of the string The r-sux tree, r-st, of s, r 1, is dened to be copacted trie of all substrings of p of length r and all suxes of p of length at ost r, excluding the sux `$'. Clearly, for any r >, ST = r-st. In any r-st, let str(x) denote the string obtained by concatenating the strings associated with the edges on the path fro the root to the node x in r-st. An r-st 4

5 $ $ $ $ 110$ $ A TRIE A SUFFIX TREE Figure 1: Tries and Sux Trees will be represented as follows. The nodes will be stored in contiguous array locations. Each node x of the tree has pointers to its children (which are at ost 3 in nuber), a parent pointer par(x), a sux link pointer suf(x), and a eld len(x) which equals jstr(x)j. If x is an internal node then the sux link pointer suf(x) points to a node y such that jstr(y)j = jstr(x)j? 1 and str(y) is a sux of str(x). For every internal node x in any r-st, such a node y is guaranteed to exist. For the root root, suf(root) = par(root) = root and len(root) = 0. The string p[i : : : j] associated with an edge e is denoted by substr(e) and is represented by the pair of indices i; j. Let the characters in p be indexed fro 1 : : : + 1. Two strings u and v are said to be consecutive substrings of p if for soe index j there is an occurrence of u in p beginning at index j and an occurrence of v in p beginning at index j + 1. A subtree of soe r-st T induced by a subset L of its leaves is the tree obtained by reoving all subtrees of T containing only leaves outside L and then copacting paths of degree two nodes in the resulting tree into a single edge. We ipose the order < 0 < 1 < $ on the alphabet, where stands for the epty sybol, i.e., a blank. We say that u < v for strings u; v if string u is lexicographically less than v. We say that l 1 < l 2 for nodes l 1 ; l 2 of soe r-st if str(l 1 ) < str(l 2 ). We assue the CREW-PRAM odel of coputation in the rest of this paper. The following priitives will be useful. Coparing Leaves. The following priitive is due to Schieber and Vishkin [SV88]. A tree with nodes can be preprocessed in O(log ) tie and O() work following which the relative order of any two leaves can be deterined in constant tie and work. Coputing Induced Tries. Given an ordered set S of substrings of p and an oracle which gives the length of the longest coon prex of u; v for any two consecutive strings u; v 2 S, there is a schee for constructing the copacted trie for the strings in S in O(log jsj) tie and O(jSj) work [FM93]. This trie is called the trie induced by the strings in S. 5

6 u v v v w sallest period Figure 2: A Periodic String We also need the concept of periodicities in strings. x is said to be a period of string u if for all j, 1 j juj? x, u[j] = u[j + x]. u is said to be periodic if its sallest period x is at ost juj ; in fact, u is said to be periodic with period x (see Fig.2.). A 2 string u is said to be priitive if for no u 0 and k > 1 is u = (u 0 ) k. Lea 2.1 is classic and follows fro [LS62]. Lea 2.1 Suppose string u is periodic. Then there exists a priitive string v and a prex w of v such that u = v k w, for soe k 2. Further, x juj? jvj is a period of u if and only if x is a ultiple of jvj. In addition, the sallest period of the string u 0 = ua, where a 6= u[juj? jvj + 1], is at least juj? jvj + 1. Our goal is to construct the sux tree ST of s in polylogarithic tie and linear work, using just a linear aount of space. We do so by constructing the sequence of trees r-st; 2r-ST; 2 2 r-st; : : : ; 2 i r-st, where i is the sallest nuber such that 2 i r + 1 and r, to be xed later, is polylogarithic in. We assue that 8. 3 A Basic Property In this section, we describe a basic cobinatorial property of strings which will be critical to the perforance of the algorith. Consider any binary string v. Denitions. For a binary string u of length less than jvj, let i u denote the nuber of occurrences of u in v. For 1 r jvj?1, let n r = infi 0u ; i 1u g, where the suation is over all binary strings u of length r. jvj?1 r=1 n r is called the bifurcation nuber of v. Lea 3.1 n r n r?1, for 1 r jvj? 1. Proof. Let z be a binary string of length r? 1. We show below that infi 0z0 ; i 1z0 g + infi 0z1 ; i 1z1 g infi 0z ; i 1z g. Since n r = [infi 0z0 ; i 1z0 g + infi 0z1 ; i 1z1 g] and n r?1 = infi 0z ; i 1z g, where the suations are over all binary strings z of length r? 1, the lea follows. Note that i 0z0 + i 0z1 i 0z because if either 0z0 or 0z1 occurs starting at v[i], then 0z occurs starting at v[i]. Siilarly, i 1z0 + i 1z1 i 1z. Therefore, infi 0z0 ; i 1z0 g + infi 0z1 ; i 1z1 g infi 0z0 + i 0z1 ; i 1z0 + i 1z1 g infi 0z ; i 1z g, as claied. 2 Lea 3.2 The bifurcation nuber jvj?1 r=1 n r of v is at ost jvj log jvj. 6

7 Proof. Consider the sux tree ST of the v 0, the reverse of v. Each prex of v corresponds to a distinct leaf in ST. Consider those internal nodes x with at least two edges e and e 0 leading to children y and y 0 of x, respectively, such that the substr(e) begins with a 0 and substr(e 0 ) begins with a 1. Let l 0 x be the nuber of leaves in the subtree rooted at y and let l 1 x be the nuber of leaves in the subtree rooted at y 0. Clearly, the bifurcation nuber of v is exactly the su inflx; 1 l 0 xg over all such internal nodes in ST. We call inflx; 1 l 0 xg the count at node x. Clearly, for each leaf of ST, there are at ost log jvj internal nodes at which it can contribute to the count. The lea follows. 2 Corollary 3.3 n r jvj log jvj r, for 1 r jvj? 1. Reark. Leszek Gasieniec [G93] has shown that there exist strings v for which n r = jvj log jvj ( ), for r = (log jvj). So Corollary 3.3 is tight to within constant factors for r r = (log jvj). 4 Optial r-sux Tree construction in O(r log ) tie Given soe r, we show how T 0, the r-st of s, is constructed in O(r log ) tie and O() work on a coon CRCW-PRAM. Recall that the r-st is a copacted trie of all substrings of p of length r and all suxes of p of length at ost r, except the sux `$'. The value of r will be xed later at soe power of log. 4.1 The Algorith The procedure for constructing T 0 has two steps. Step 1. In this step, a tree T 0 0 is built in O(r log ) tie and O() work. s is partitioned into pieces of length 2r (the rightost piece could be shorter), each pair of adjacent pieces overlapping in exactly r characters. There are O( ) such pieces. Let these pieces be r denoted by the strings s 1 ; s 2 ; : : : ; s k. Let p i be the string s i $ i, where $ i 6= $ j, for any i 6= j. T0 0 is the copacted trie of all suxes of each of the p i 's, except the suxes $ i. In other words, 0 is a single data structure containing the sux trees of all the s i 's. To construct T0, 0 one processor is assigned to string p i. The processor associated with p i inserts the suxes of p i, except $ i, in left to right order into T 0 0 using the sequential algorith of McCreight [M76]. Each processor takes O(r log ) tie and does O(r) work. In fact, there are O(r) tie-steps and each processor takes O(1) tie and does O(1) work in each tie step; in addition, after each tie step, reorganizing the processors in a anner to be described takes O(log ) tie and O( ) work. Thus, the total tie r taken is O(r log ) and the total work done is O(). This construction will be described in detail shortly. 7

8 Step 2. T 0 0 is post-processed to give T 0 in three steps, each of which takes O(log ) tie and O() work. Step 2.1. T 0 0 is truncated as follows. All edges e in T 0 0 are considered in parallel. Suppose e is between a node x and its parent y such that len(x) > r and len(y) r. Then the edge e and the subtree rooted at x are reoved; in addition, if len(y) < r then a leaf x 0 is inserted with parent y such that str(x 0 ) is the prex of str(x) of length r. Step 2.2. Next, all subtrees containing only leaves x such that str(x) is a sux of p k 0, k 0 6= k, are reoved. We do not do the above operation for p k, for the following technical reason: the sybol $ k can serve as $, the last sybol of p (recall p = s$). Step 2.3. All chains of degree 2 are contracted. T 0 is the resulting tree. The total tie taken in Steps 1 and 2 is O(r log ) and the total work done is O(). Step 1 Description. As stated earlier, the processor associated with p i, 1 i k, inserts all suxes of p i, except $ i, in left to right order into T0 0 using the sequential algorith of McCreight [M76]. McCreight's Algorith. For the sake of copleteness, we briey recapitulate Mc- Creight's sequential algorith for constructing the sux tree T of a string s 0. Let p 0 = s 0 $. The algorith inserts the various suxes of p 0 in left to right order, i.e., in decreasing order of the lengths of the suxes. In this process, the characters in p 0 are scanned fro left to right in sequence and the nodes in the current sux tree are scanned in soe order. Three variables, i, j and x, are aintained at every step. i denotes the index of the rightost character in p 0 which has been read. j denotes the index in p 0 at which the current sux which the algorith seeks to insert begins. x denotes the current node in T being scanned by the algorith. This algorith alternates between two phases, the scanning phase and the rescanning phase. Initially the algorith starts in the scanning phase with i = 1, j = 1, and x being the root. We describe a snapshot of the algorith starting with a scanning phase and ending just before the next scanning phase. In a scanning phase, the characters p 0 [i]; p 0 [i+1]; : : : are copared with the characters in the strings associated with the edges along the appropriate path starting at x until a isatch occurs. Suppose a isatch occurs when character p 0 [i 0 ] is copared with a character in the substring associated with the edge e between node y and one of its children y 0. Then the edge e is broken at the appropriate point and a node z is inserted as a child of y and parent of y 0. A new leaf z 0 is inserted as a child of z; z 0 will be the leaf corresponding to the sux j. A rescanning phase begins now. Suxes j + 1; j + 2; : : : are inserted one by one in this rescanning phase. Sux j + g is inserted as follows after sux j + g? 1 has been inserted, where g 1. Suppose edge e 1 between node w and one of its children was broken in order to insert the sux j + g? 1; let f 1 be the node which was used to break edge e 1. Then the edge e 0 1 which has to be broken in order to insert sux j + g is found by traversing the appropriate path starting at node suf(w) until a node w 0 is reached such that len(w 0 ) len(z)? g. If len(w 0 ) > len(z)? g then e 0 1 is the edge between w 0 and its parent; suf(f 1 ) is then set to f 0, the node which is used to break 1 e0 1 ; further, the 8

9 rescanning phase the continues with sux j + g + 1. If len(w 0 ) = len(z)? g then the current rescanning phase coes to an end (without having inserted sux j + g yet) and the next scanning phase begins with i = i 0, j = j + g and x = w 0 ; further, suf(f 1 ) is set to w 0. An iportant fact to note is that in the rescanning phase, traversing an edge e takes constant tie, while in the scanning phase, characters in the strings associated with e are copared in sequence and this takes tie proportional to the nuber of coparisons. McCreight showed that the total tie spent in the above rescanning phase is O(h+i 0?i), where h is the nuber of suxes inserted in the rescanning phase; the entire algorith can then be easily seen to take O(js 0 j) tie. Probles Encountered. Note that since all the O( ) processors which run Mcr Creight's algorith on the strings p 1 ; : : : ; p k work on a coon data structure, these processors interfere with each other. The following two probles are encountered. 1. A nuber of processors ay siultaneously attept to break an edge e between node x and its child y and insert dierent nodes between these two nodes. 2. When a node y is inserted in T0, 0 the sux link suf(x) of the current parent node x of y ay not have been deterined. The analysis of McCreight's algorith requires that this link be in place when y is inserted. These probles are solved as follows. Solution to Proble 2. After a node y is inserted as a child of x, the processor which inserted y waits at x until the sux link of x is set before proceeding further. It needs to be shown that the total tie spent by a processor waiting at various nodes is bounded by O(r). This analysis will be described in Section 4.2. Denition. We dene the current string length of the processor associated with p l, 1 l k, at any instant as follows. If i is the index of the rightost character in p l scanned till that instant and j is the index at which the last sux of p l inserted begins then the current string length of the above processor is i? j. The following fact holds for McCreight's algorith. Fact 1 The current string length of the processor associated with p l equals str(x) + 1 iediately before it inserts a node x. Solution to Proble 1. The run of the algorith is divided into tie-steps. As we will show in Section 4.2, there are O(r) tie-steps in all. At each tie-step, each processor executes one step of McCreight's algorith in O(1) tie. Following this the processors are organized into ordered lists in a anner to be described; this will take O(log ) tie and O( ) work per tie-step. Thus, the total tie taken is O(r log ) and the total r work done is O(). At each tie-step, two ordered lists are aintained for each node x in T0. 0 The rst list, l 1, contains those processors which either are at node x in the rescanning ode, or are in the scanning ode coparing characters in the strings associated with one 9

10 of the edges leading down fro x. The second list, l 2, contains processors which are waiting at node x, i.e., waiting for the sux link of node x to be set (see solution to Proble 2). If suf(x) is dened then the list l 2 at x is epty. Each list is ordered by the current string lengths of the processors it contains. As we shall show, the only operations needed to update these lists at each tie-step are to partition each list into contiguous sublists, divide each list into a constant nuber of non-contiguous sublists and to erge a constant nuber of lists into a single list. These operations can be accoplished in O(log ) tie and O( ) work per tie-step. r Consider a node x. At each tie-step the following operations are perfored, in addition to those that are routinely perfored in McCreight's algorith. 1. The list of processors in l 1 at node x is split into two ordered lists, one containing processors which seek to go down the edge whose associated substring begins with a 0 and the other containing processors which seek to go down the edge whose associated substring begins with a 1. Let L 0 be the forer list and L 1 be the latter list. We concentrate our description on L 0. L 1 is processed siilarly. 2. All processors in L 0 read in the pointer to the appropriate child y of x. Those processors in L 0 which are in the scanning phase copare their next character; those which are in the rescanning phase check whether the nodes they seek to insert are to be between x and y or not. Following this, an ordered sublist L 0 of L 0 coprising those processors which seek to break the edge e between x and y is obtained. Since processors in l 1 were ordered by their current string lengths, by Fact 1, the processors in L 0 appear in top to botto order of the point at which they seek to break e. This also iplies that all processors which seek to break e at the sae point occur consecutively in L L 0 is divided into sublists; each processor in a sublist seeks to break edge e at the sae point. Let x 1 ; : : : ; x h be the distinct nodes, in order fro top to botto, which processors in L 0 seek to insert between x and y. Each sublist is further reorganized into groups of axial size such that all processors in the sae group have identical characters in the last scanned positions in their respective strings. Note that except for at ost two groups (corresponding to last scanned characters 0,1, respectively) in each sublist, all other groups are singleton groups (because the $ i 's are all utually distinct). Consider a particular x i. Let be the rst processor in the sublist associated with x i. creates the node x i and akes it a child of x i?1 if i > 1, and of x otherwise. In addition, if i = h, akes y a child of x h. Next, sets suf(z i ); suf(zi) 0 to x i, where z i ; z 0 were the last nodes inserted by the processors in the sublist of i L0 associated with x i (note that there are at ost two such nodes). The rst processor in each group in the sublist associated with x i creates a new child of x i ; For the purpose of analysis, each such processor is said to have inserted x i while all other processors in the above sublist are said to have sought to insert x i. 4. For each x i and newly created child v of x i, let P denote the group of processors such that the rst processor in P creates v. The rst processor in P either continues its rescanning phase or starts a new rescanning phase. Consider processor which is in P but not the rst processor in p. If is in the rescanning phase, then it begins a new scanning phase fro x i at the next tie-step (In McCreight's algorith, when the 10

11 node seeks to insert already exists, begins a new scanning phase fro that node. The above situation is siilar). Suppose was already in the scanning phase. Further, suppose is associated with p l, 1 l k. sought to insert node x i after perforing an unsuccessful coparison involving p l [len(x i )+1]. then continues its scanning phase in the next tie-step by coparing p l [len(x i ) + 2] with the second character in the string associated with the edge between x i and v; note that p l [len(x i ) + 1] is guaranteed to atch the rst character in this string. 5. The new lists l 1 and l 2 are obtained for the nodes x; x 1 ; : : : ; x h as follows. Let z; z 0 be the two nodes, if any, such that suf(z) = x and suf(z 0 ) = x (at least one of z; z 0 exists after the rst sux of each p l, 1 l k, has been inserted). Let y 0 be the parent of x at the beginning of the current tie-step. Note that y 0 ay no longer be the parent of x. The new list l 1 at x coprises processors derived fro the old lists l 1 at x, y 0, z and z 0 and the old lists l 2 at z; z 0. The new list l 1 at x i, 1 i h, coprises processors derived fro the old lists l 1 at x, z i and zi, 0 and the old lists l 2 at z i and z 0 i (z i ; z 0 i were dened in Step 3). If suf(x) is dened then the new list l 2 at x is epty; otherwise, the new list l 2 at x is obtained by inserting into the old list l 2 at x those processors in the sublist associated with x 1 which are the rst processors in their respective groups. The new list l 2 at x h is epty. The new list l 2 at x i, 1 i < h, contains those processors in the sublist associated with x i+1 which are the rst processors in their respective groups. Note that these processors are now in the rescanning phase. Note that the processors in the old list l 1 at x contribute only to the new lists l 1 at x; x 1 ; : : : ; x h ; y; suf(x). The old list l 2 at x either reains unchanged, or has soe processors fro the old list l 1 at x inserted into it, or becoes part of the new list l 1 at suf(x). Further, recall that those processors in the old list l 1 at x which ove to the the new lists l 1 at x 1 ; : : : ; x h appear in order in the old list l 1 at x, i.e., a processor which oves to the new list l 1 at x j appears after a processor which oves to the new list l 1 at x j 0 if j > j 0. In addition, note that the current string length of each processor in soe old list l 1 can change by at ost one and the current string length of each processor in soe old list l 2 reains unchanged. It can easily be checked that all new lists can be derived fro old lists by partitioning each old list into a nuber of contiguous sublists, dividing each old list into a constant nuber of non-contiguous sublists, and erging together a constant nuber of the lists. Therefore the new lists can easily be obtained in O(log ) tie and O( ) work. r The following fact is noteworthy; it follows fro point 4 above. Fact 2 Consider one scanning phase of a processor which processes p l, 1 l k. Suppose this scanning phase begins with a coparison at p l [i] and ends with an unsuccessful coparison at p l [j]. Then each character in p l [i : : : j] is copared exactly once in this scanning phase, one character in each tie step. Reark. We reark that an array storing, for each index j, 1 j, the leaf x of T 0 such that p[j : : : j + r? 1] = str(x) can be obtained in the process of constructing T 0 without any tie or work overhead. In addition, for each leaf x of T 0, a list of indices j 11

12 root H j Hj+1 H j Hj+1 H 1 H 2 H3 H h x y x 1 x 2 z suf (z) x j y x 3 sux link x h x j+1 (a) (b) (c) Figure 3: The sequences H i. such that p[j : : : j + r? 1] = str(x) can also be obtained in the process of constructing T The Analysis Since processors spend tie waiting in the l 2 lists at various nodes in T 0 0, it is not obvious that the total nuber of tie-steps is O(r). We show that this is indeed the case. Denitions. p(x) is dened to be the parent of x when x is inserted. Let root denote the root node. p(root) is dened to be root. For a node x in T 0, let at 0 x denote the tie-step in which node x is inserted in T 0 0. Consider a particular processor which is associated with soe piece of s. Let 1 ; 2 ; : : : ; h, h 2r, be the sequence of leaves inserted by and let x i = p( i ). 1 ; : : : ; h correspond to the suxes (in decreasing length order, respectively) of the piece of s processed by. The Node Sequences H i. We dene node sequences H i, for 1 i h. The rst node in each sequence is root. The last node in sequence H i will be x i. See Fig.3a. For each node x in sequence H i, we will dene a tie-step t x;i. t root;i is dened to be 1 for all i. If x follows y in H i then t x;i will be greater than t y;i and x will be a strict descendant of y in T0. 0 As we will show, the above denitions will guarantee that node x is in T0 0 by tie-step t x;i, i.e., at x t x;i. Sequence H 1. The rst node in H 1 is the root node root and t root;1 = 1. Consider the sequence S of nodes nally on the path fro root to x 1 in that order, with both endpoints excluded. The rst node z in S such that at z t root;1 + 1 is added to H 1 and t z;1 is dened to be t root; Next, the rst node z 0 in S following z such that at z 0 t root;

13 is added to H 1 and t z 0 ;1 is dened to be t root; This process is continued until no further nodes in S can be added to H 1. Then x 1 is added to H 1. t x1 ;1 is dened to be t root;1 + len(x 1 )? len(root) + 1 = len(x 1 ) + 2. Note that if a processor traverses the path fro root downwards towards x 1 starting at tie-step 1, then the only nodes it can encounter are the nodes added to H 1 above. The Reaining Sequences. Sequence H j+1, j 1, is dened next, assuing that H j has already been dened. H j+1 will contain as a subsequence (not necessarily contiguous) the nodes suf(x), where x is in H j. Suppose H j+1 has been dened up to suf(x) for soe x in H j. Let y = suf(x). Let z be the node following x in H j, if any. There are two cases to consider. If z is dened then we describe how to augent H j+1 until suf(z) has been included. If z is not dened then we describe how the rest of the sequence H j+1 is constructed. First, suppose x 6= x j, i.e., z is dened. See Fig.3b. Let t = axft y;j+1 ; t z;j g. Consider the sequence S of nodes nally on the path fro y to suf(z) in that order, with y and suf(z) excluded. The rst node z 0 in S such that at z 0 t + 1 is added to H j+1 and t z 0 ;j+1 is dened to be t + 1. Next, the rst node z 00 in S following z 0 such that at z 00 t + 2 is added to H j+1 and t z 00 ;j+1 is dened to be t + 2. This process is continued until no further nodes in S can be added to H j+1. Then suf(z) is added to H j+1. If S is non-epty then t suf (z);j+1 is dened to be t w;j+1 + 1, where w 2 S is the node preceding suf(z) in H j+1. If S is epty then t suf (z);j+1 = t + 1. Note that if a processor traverses the path fro y downwards towards suf(z) starting at tie-step t, then the only nodes it can encounter are the nodes added to H j+1 above. Next, suppose z is not dened, i.e., x = x j. If y = x j+1 then H j+1 is fully dened. So suppose y 6= x j+1. See Fig.3c. Then begins a new scanning phase fro y. In this case, the construction of H j+1 is siilar to that of H 1. Let t = t y;j+1. Consider the sequence S of nodes nally on the path fro y to x j+1 in that order, with both endpoints excluded. The rst node z 0 in S such that at z 0 t + 1 is added to H j+1 and t z 0 ;j+1 is dened to be t + 1. Next, the rst node z 00 in S following z 0 such that at z 00 t + 2 is added to H j+1 and t z 00 ;j+1 is dened to be t + 2. This process is continued until no further nodes in S can be added to H j+1. Then x j+1 is added to H j+1 and t xj+1 ;j+1 is dened to be t y;j+1 + len(x j+1 )? len(y) + 1. Again, note that if a processor traverses the path fro y downwards towards suf(z) starting at tie-step t, then the only nodes it can encounter are the nodes added to H j+1 above. Analyzing the Sequences. The following key lea holds for the above dened sequences. The proof of the lea is given in the Appendix 1. Here we only sketch the intuition. Lea 4.1 If x 2 H i, 1 i h, then at x t x;i. Intuition. The intuition behind this key lea is the following. First, note that if x does not have a sux link into it fro a node in H i?1 and if x 6= x i, the lea is true by denition. We give the intuition here as to why the lea is true for nodes which have sux links fro nodes in the previous sequence. 13

14 Let z 0 ; z f be two consecutive nodes in sequence H i. Assue that at zf t zf ;i and at suf (z0 ) t suf (z0 );i+1. We show that at suf (zf ) < t suf (zf );i+1. We illustrate the intuition with the easier case, i.e., when z 0 = p(z f ). Then the processor which inserts z f will wait at z 0 at ost until tie-step axfat zf ; at suf (z0 )g axft zf ;i; t suf (z0 );i+1g. After this tie-step, the above processor starts its search for suf ( z f ). By the way H i+1 is dened, the only nodes that the processor can encounter in this process are those in H i+1 strictly between suf(z 0 ) and suf(z f ); let f be the nuber of such nodes. Therefore, by tie-step axft zf ;i; t suf (z0 );i+1 + f + 1g t suf (zf );i+1, the above processor would either have located suf(z f ), if it already exists, or inserted suf(z f ), otherwise. The harder case is when z 0 6= p(z f ). Suppose there exist nodes z 1 ; : : : ; z f?1 such that p(z f ) = z f?1, p(z f?1 ) = z f?2 and so on until p(z 1 ) = z 0. This is really the bad case because the processor that inserted z f could be waiting at z f?1, the processor that inserted z f?1 could be waiting at z f?2 and so on. However, the desired result can again be obtained by just repeating the arguent of the previous paragraph with z 0 ; z 1, then z 1 ; z 2, and so on until z f?1 ; z f, and cobining the results of each of these arguents. We use Lea 4.1 as follows to coplete the analysis. The reaining portion is akin in spirit to the analysis of McCreight's algorith. Denitions. Let be a sequence coprising nodes derived fro the sequences H 1 ; : : : ; H h, dened as follows. Associated with each node x in is a value seq(x) which is i if x is derived fro H i. begins at x h and ends at root. See Fig.4a. seq(x h ) = h and the seq values of the nodes in are non-increasing. If x is a node in, seq(x) = i, x 6= root, then the node y in which follows x is deterined as follows. If i = 1 then y is the node preceding x in H 1 and seq(y) = 1. Suppose i > 1. If there is no node x 0 2 H i?1 such that suf(x 0 ) = x (i.e., x = x i ), then y is the node preceding x in H i and seq(y) = i. Otherwise, suppose that there is a node x 0 2 H i?1 such that suf(x 0 ) = x. Let z 0 be the node which iediately precedes x 0 in H i?1 and let z = suf(z 0 ); Then if t z;i > t x 0 ;i?1, y = z, otherwise, y = x 0. In the forer case seq(y) = i and in the latter case seq(y) = i? 1. is divided into axial subsequences such that for all consecutive x; y's in a particular subsequence, seq(y) = seq(x)? 1, i.e., suf(y) = x. Note that the last subsequence is a singleton subsequence containing only root as suf(root) = root. If x; y are nodes in soe H i, i > 1, y preceding x, then i (x; y) is dened to be the nuber of nodes in H i between y and x. Lea 4.2 Let y 1 ; y 2 ; : : : ; y j be one of the axial subsequences of dened above and suppose seq(y 1 ) = e, i.e., seq(y 2 ) = e? 1; : : : ; seq(y j ) = e? j + 1 (see Fig.4b). Let z be the node which follows y j in, i.e., seq(z) = e? j + 1. If j = 1 then t y1 ;e t z;e + 2(len(y 1 )? len(z)). If j > 1 then t y1 ;e t z;e?j+1 + 2(len(y 1 )? len(z)) + 4j? 3. Proof. First, we show that when t yf ;e?f +1 t z;e?j+1 + 2(len(y j )? len(z)) + 2(j? f + 1)? (len(y f )? len(y 0 f))? 1, for f = j : : : 1, where y 0 f is the node preceding y f in H e?f +1. This clai is shown by induction on f, f = j : : : 1, in the subsequent paragraphs. The lea follows when j = 1 because then y 0 1 = z and len(y 1)? len(z) > 0 and therefore, 14

15 root H 1 H h H e?j+1 H e z a axial subsequence y j yj?1 y 2 y 1 x h (a) (b) Figure 4: The path and a axial subsequence. t y1 ;e t z;e + (len(y 1 )? len(z)) + 1 t z;e + 2(len(y 1 )? len(z)); for j > 1, the lea follows because t y1 ;e t z;e?j+1 + 2(len(y j )? len(z)) + 2j? (len(y 1 )? len(y1)) 0? 1 t z;e?j+1 + 2(len(y j )? len(z)) + 2j? 1 t z;e?j+1 + 2(len(y 1 ) + j? 1? len(z)) + 2j? 1 t z;e?j+1 + 2(len(y 1 )? len(z)) + 4j? 3. As the base case, consider f = j. Then y 0 = z. If y j j = x e?j+1 then t yj ;e?j+1 t z;e?j+1 + (len(y j )? len(z)) + 1. If y j 6= x e?j+1 then by the construction of, t yj ;e?j+1 = t z;e?j+1 + e?j+1 (y j ; z) + 1 t z;e?j+1 + (len(y j )? len(z)) + 1. Next, assue that t yf ;e?f +1 t z;e?j+1 +2(len(y j )?len(z))+2(j?f +1)?(len(y f )? len(y 0 ))?1, f > 1, and consider t f y f?1 ;e?f +2. t yf?1 ;e?f +2 = t yf ;e?f +1+ e?f +2 (y f?1 ; suf(y 0 ))+ f 1 t yf ;e?f +1 + (len(y 0 ) f?1? len(suf(y0 ))) + 1 f t z;e?j+1 + 2(len(y j )? len(z)) + 2(j? f + 1)?(len(y f )?len(yf))?1+(len(y 0 0 f?1)?len(suf(yf)))+1. 0 Since len(y f?1 )?len(y 0 f?1) = (len(y f )?len(yf))?(len(y 0 0 f?1)?len(suf(yf))), 0 we get t yf?1 ;e?f +2 t z;e?j+1 +2(len(y j )? len(z)) + 2(j? (f? 1) + 1)? (len(y f?1 )? len(y 0 f?1)))? 1, as claied. 2 Lea 4.3 t xh ;h 14r. Proof. Let u i ; v i be the extree nodes in the ith subsequence and let size(i) be the nuber of nodes in the ith subsequence. Let f be the nuber of subsequences. u f ; v f = root and u 1 = x h. The su [4size(i)? 3] over all non-singleton subsequences is clearly 5h. Then, by Lea 4.2, t xh ;h? t root;seq(root) 1i<f [2(len(u i )? len(u i+1 ))] + 5h 2len(x h ) + 5h 4r + 10r = 14r. 2 15

16 Corollary 4.4 The total nuber of tie-steps taken by processor is at ost 14r. Theore 4.5 There exists an algorith which constructs the r-st of s in O(r log ) tie and O() work. 5 Copleting the Sux Tree Given an r-sux tree T 0 of s, we show how to copute the coplete sux tree ST of s. This is done in log? log r iterations. In the ith iteration, the 2 i r-st of s is obtained. Denitions. Let T i denote the 2 i r-st of s. Let leaf i (j) be the leaf in T i such that there is an occurrence of str(leaf i (j)) beginning at index j in p. For any leaf l 2 T i, let indices i (l) be the set of indices j such that there is an instance of str(l) beginning at j, i.e., indices i (l) = fjjleaf i (j) = lg. A Naive Algorith. First, we describe a naive algorith which coputes ST in O( log 2 ) work. At the end of iteration i?1, for each index j, 1 j, we keep track of leaf i?1 (j). Consider the ith iteration. T i?1 is preprocessed in O() work and O(log ) tie for order queries on leaves (see Section 2). Following this, the relative order and longest coon prex of any two substrings of p of length at ost 2 i r can be deterined in constant tie and work. Next, all leaves l of T i?1 such that str(l) is not a sux of p are processed in parallel. Consider one such leaf l. The substrings of p of length 2 i r beginning at indices in indices i?1 (l) are sorted; subsequently, one representative is selected fro each equivalence class of identical substrings. Each representative substring is said to represent all the substrings in its equivalence class. Let L i (l) denote the ordered list of these representative substrings. The trie induced by the strings in L i (l) is constructed using the induced trie construction algorith entioned in Section 2 and the root of this trie is erged with l (note that following this erger, l ay no longer be a node in T i ). Coputing L i (l) takes O(log ) tie and O( log ) work over all leaves l of T i?1. Coputing the trie induced by the strings in L i (l) takes O(log ) tie and O() work, over all leaves l of T i?1. The overall algorith is ade to run in O(log 2 ) tie and O( log 2 ) work by starting with r = 1 and perforing O(log ) of the above iterations. Reducing the Work. There are two ain coponents in the above algorith which lead to superlinear work. The rst is obtaining the ordered list L i (l) of representative substrings. The second involves preprocessing T 0 ; T 1 ; : : : for order queries. In our schee too, we construct, for each leaf l of T i?1, the trie induced by the strings in L i (l) in iteration i. This is done as before by rst obtaining L i (l) for each leaf l of T i?1 and then using the induced trie construction algorith. In order to restrict the total work done to O(), we use the fact that ST has a highly repetitive structure. Exploiting this repetitive structure, we show how to obtain the ordered lists L i (l) for all leaves l of T i?1, by sorting and preprocessing (for order queries) only O( ) ites rather than log 2 up to ites. In the process, we ake critical use of the property dened in Corollary 16

17 a a a u u u u u u 0 1 $ 0-link 1-link $-link Figure 5: 1-,0-, and $-links (u is a string and a = 0=1). 3.3 and also of periodicity properties of strings. The overall algorith then takes linear work. The New Schee. T 0 is constructed with r = 2dlog 3 e in O(log 4 ) tie and O() work. Note that by Corollary 3.3, n r = O( ). Next, log iterations are perfored, log 2 each iteration taking O(log 3 ) tie. In iteration i, the ordered list L i (l) is coputed for all leaves l of T i?1 in two steps. First, L i (l) is obtained for a subset of the leaves l of T i?1 ; these leaves are called sources. Next, L i (l) is obtained for all leaves of T i?1 using the coputation in the rst step and the repetitive nature of ST. ST is repetitive in the following sense. If nodes x; y are such that y = suf(x) and there are no other sux links pointing to y then the subtree of ST rooted at y is \identical" to that at x. If nodes x; y; z are such that y = suf(x) and y = suf(z) then the subtree of ST rooted at y is a \erger" of the subtrees rooted at x and z. The issues which need to be addressed next are how sources are chosen, how L i (l) is obtained eciently for sources l, and how L i (l) is obtained for all reaining leaves l. We discuss each of the above three issues in turn. First, we need soe denitions and priitives. Denitions. Let H i?1 be the set of leaves of T i?1. We dene 0-links, 1-links, and $-links as follows (see Fig.5). The 0-link of l 2 H i?1 points to the leaf l 0 in H i?1, if any, such that str(l) and str(l 0 ) are consecutive substrings of p and str(l 0 ) ends in 0. The 1-link of l 2 H i?1 points to the leaf l 0 in H i?1, if any, such that str(l) and str(l 0 ) are consecutive substrings of p and str(l 0 ) ends in 1. The $-link of l 2 H i?1 points to the leaf l 0 in H i?1, if any, such that str(l) and str(l 0 ) are consecutive substrings of p and str(l 0 ) ends in $. 0-links, 1-links, and $-links are collectively called next-links. Let J i?1 denote the digraph whose vertices are the leaves in H i?1 and whose edges are the next-links aong these leaves. In Section 6, we consider the digraph G = J 0 = (H 0 ; E), where E is the set of nextlinks between vertices in H 0, and show how to choose O( log ) = O( ) leaves of T r log 2 0, called origin leaves. In iteration i, source leaves will be deterined using these origin leaves. In Section 7, we show how to obtain L i (l) for source leaves in iteration i. In Section 8 we show how to obtain L i (l) for the rest of the leaves in iteration i. In both sections, for each leaf l of T i?1, we show how a list L 0 i(l) with the following description is obtained 17

18 rst. L 0 i (l) is an ordered list of one representative string fro each equivalence class of the set of substrings of p of length k:2 i?1 r which begin at indices in indices i?1 (l), for soe k, 2 k 3. The value of k is dierent for dierent leaves l. In particular, for source leaves, k = 3. Note that if k = 2 for leaf l, L 0 i(l) = L i (l). Clearly, if the longest coon prex of adjacent strings in L 0 i(l) is known, which will indeed be the case, L i (l) can easily be obtained fro L 0(l) in constant tie and i O(jL0 i (l)j) work. Henceforth, we refer to L 0 i(l) as the list at l. In Section 9, we show how soe data structures dened in Sections 7 and 8 are aintained. 6 Choosing Origin Leaves Recall fro the reark at the end of Section 4.1 that an array storing leaf 0 (j), for each index j, 1 j, was obtained while constructing T 0. The next-links for the set of leaves H 0 in T 0 can be set up easily fro the above inforation and therefore, G can be obtained easily. Equivalently, G can be obtained during the construction of T 0 itself. To select origin leaves, we reove edges fro G until each connected coponent in the resulting graph is a rooted tree of O(log 2 ) height. There will be O( ) such log 2 trees. The roots of each such tree are chosen as origin leaves. In iteration i, a siilar (but iplicit) reoval of edges fro graph J i?1 will be perfored. These reovals will result in connected coponents which are rooted trees of O(log 2 ) height; the roots of these trees will be the sources in iteration i. Before describing how origin leaves are deterined, we need the following denitions and leas. Denitions. Consider an index j, 2 j. Let u = p[j : : : j + r? 1], v = p[j? 1]u and let v 0 equal v with the rst character copleented. j is called a -index if both the following conditions hold. 1. v and v 0 occur at least once each in p. 2. Either v occurs fewer ties than v 0 in p, or v occurs as any ties as v 0 and p[j? 1] = 0. For technical reasons, index 1 is also dened to be a -index. The leaf leaf 0 (j) is called a -node in G if j is a -index. The leaf leaf i?1 (j) is called a -node in J i?1 if j is a -index. Recall fro the reark in Section 4.1 that for each leaf x of T 0, a list of indices j such that p[j : : : j + r? 1] = str(x) was obtained in the process of constructing T 0. Fro this inforation and using the next-links aong the leaves of T 0, all -indices and -nodes in G can be deterined easily using the following lea. Lea 6.1 With the exception of leaf 0 (1), a node in G is a -node only if it has in-degree 2 in G. Further, any node in J h which has in-degree 2 is a -node, where 0 h log. In addition, if there are edges fro nodes l 0 ; l 00 to l in J h, 0 h log, 18

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Constant-Space String-Matching in Sublinear Average Tie (Extended Abstract) Maxie Crocheore Universite de Marne-la-Vallee Leszek Gasieniec y Max-Planck Institut fur Inforatik Wojciech Rytter z Warsaw University