Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Constant-Space String-Matching in Sublinear Average Tie (Extended Abstract) Maxie Crocheore Universite de Marne-la-Vallee Leszek Gasieniec y Max-Planck Institut fur Inforatik Wojciech Rytter z Warsaw University and University of Liverpool Abstract Given two strings: pattern P of length and text T of length n. The stringatching proble is to nd all occurrences of the pattern P in the text T. We present a siple string-atching algoriths which works in average o(n) tie with constant additional space for one-diensional texts and two-diensional arrays. This is the rst attept to the sall-space string-atching proble in which sublinear tie algoriths are delivered. More precisely we show that all occurrences of one- or two-diensional patterns can be found in O( n r ) average tie with constant eory, where r is the repetition size (size of the longest repeated subword) of P. Institut Gaspard Monge, Universite de Marne-la-Vallee, France (ac@univ-lv.fr). ymax-planck Institut fur Inforatik, I Stadtwald, D{66123 Saarbrucken, Gerany (leszek@pi-sb.pg.de). zinstitute of Inforatics, Warsaw University, Poland and Departent of Coputer Science, University of Liverpool, U.K. Supported by the grant KBN 8T11C208 (rytter@iuw.edu.pl).

1 Introduction The string-atching proble is dened as follows. Assue we are given two strings: pattern P of length and text T of length n. The pattern occurs at position i in text T i P = T [i::i +? 1]. We consider algoriths that deterine all occurrences of the pattern P in the text T. The coplexity of the string atching algorith is easured by the nuber of sybol coparisons of pattern and text sybols. The algoriths solving string-atching proble in linear tie and constant space are perhaps the ost interesting ones aong all designed for the entire proble. The rst algorith which uses a constant aount of additional eory was proposed by Galil and Seiferas in [8]. Later Crocheore and Perrin in [4] have presented an algorith that achieves a saller (at ost 2n) nuber of coparisons while preserving the sall aount of eory. Then, another iproveent ( 3 ) on the nuber of coparisons 2 was presented by Breslauer in [2]. In the eantie, alternative algoriths were introduced by Gasieniec, Plandowski and Rytter in [9] (2 + ") and [10] (1 + "). Besides there are known algoriths which ake a sublinear nuber of coparisons on the average. The rst such ethod was proposed in [11] for strings. An attept to 2d-diensional pattern atching fast on the average is due to Baeza-Yates and Regnier in [1]. However all known sublinear average tie algoriths use a linear-size additional eory to keep a table of shifts as in the Boyer-Moore algorith, (see e.g. [11], [7]), or for the representation of a directed subword graph or equivalent data structures (see e.g. [3] and [6]). The latter algoriths have the best possible O( n log ) average tie coplexity due to lower bound of Yao [12]. One can try to nd a trade-o between sall space and good average tie applying techniques fro [3] to the subwords of the pattern P. This ight lead to an algorith which uses O(s) space (size of the preprocessed subwords) and has O( n log s ) average s tie. Until now there was no algorith both perforing an average sublinear nuber of coparisons and using only constant eory space. In this paper we present the novel idea of such an algorith for one-diensional strings as well as for two-diensional arrays. The idea of the algoriths is based on the use of subword repetitions. For the siplicity of the presentation we assue that all strings considered in the paper are built over a binary alphabet = fa, bg. We say that the word w 2 has a period q (0 < q jwj) if w[i] = w[i + q] for

all positions 1 i jwj? q. The shortest period of w is called the period of w. If it satises q jwj=2, then the word w is called periodic; otherwise, w is called nonperiodic. 2 Nonperiodic one-diensional patterns In this section we assue that the pattern P is nonperiodic. Let us denote by rep size(p ) the size of the length of a largest subword of P. Exaple 1. The repeated subword in an exaple text given below is indicated here in bold. rep size(ababbaababaaababbaababba) = 9. The nuber of logarithic-size subwords of a text is large enough to guarantee that at least one of the repeats. This iplies easily the following fact. Lea 1 For each pattern P of size rep size(p ) = (log ). Denote r = rep size(p ), and let w be a longest repeated subword. Assue P [p? r::p? 1] = P [q? r::q? 1]; p q? r and P [p] 6= P [q]: In Exaple 1 we have (w; r; p; q) = (babbaabab; 9; 11; 23). The positions p; q are isatches w.r.t. the repetition of the word w. In general if there are no isatch positions based on repetition w to the right of two copies of w then we try to nd the to the left reversing the string-atching process. In case no isatch is found neither to the right nor to the left it eans that the repetition occurs at the borders of the pattern. This case is handled siilarly to the periodic case discussed in the next section. We say that a position i in T is a isatch position i T [i + p? 1] 6= T [i + q? 1]. We call a window any interval of positions [i::i+r?1] on the T, for 1 i n?r+1. Assue w.l.o.g. that we already know the 4-tuple (w; r; p; q).

Denote by Leftost Misatch(W ) the procedure that nds the rst (fro the left) isatch position in a given window W. If there is no such a isatch position then a special value nil is returned. Lea 2 (1). If Leftost Misatch(W ) = nil, no position of P in T is in W, (2). Otherwise, no position of P in T is in W? fleftost Misatch(W )g. The isatch is used as a constant-size deterinistic saple. 2 Denote by Naive Check(i) the procedure that tests a possible occurrence of P starting at a given position i in T and that tests the equality of corresponding sybols fro left to right. In the worst case, coparisons are done, but for rando binary texts T the average tie is really sall. We assue that sybols of the text are uniforly distributed. Lea 3 On rando texts each of the procedures Naive Check and Leftost Misatch akes on the average less than 2 coparisons. The su i 2 i is bounded by 2. 2 Lea 4 Assue that pattern P is nonperiodic. Then, for a rando text T, we can nd all the occurrences of P in T in O( n rep size(p ) n ), which is O( ), average tie using constant log additional eory. The worst-case running tie of the algorith is O(n). There are O(n=r) iterations in the algorith Nonperiodic Pattern Searching below. Each iteration uses at ost 4 coparisons on the average both for execution of Naive Check and Leftost Misatch, due to Lea 3. The coparisons done during dierent iterations can be dependent on each other, but the independence is not needed according to the fact that the average value of a su of rando variables is the su of their average value. Therefore the algorith akes altogether at ost O(n=r) coparisons on the average.

ALGORITHM Nonperiodic Pattern Searching; f nonperiodic pattern g; i:= 1; r:= rep size(p ); while i n? do begin W := [i::i + r? 1]; i 0 := Leftost Misatch(W ) if i 0 6= nil then end if Naive Check(i 0 ) then report atch at i 0 ; i:= i + r; Siilarly to the algorith presented in [10] we can guarantee the linear worst-case tie of the algorith Nonperiodic Pattern Searching since the shifts are based on a longest repeated subword of the pattern. This copletes the proof. 2 3 Periodic one-diensional patterns Assue now that P is periodic, so obviously its repetition size is large. Lea 5 If P is periodic then rep size(p ) 2. In this situation we cannot use the approach based on 4-tuples (w; r; p; q). Thus we derive a slightly dierent algorith, which is even ore ecient than the one used in nonperiodic case. Lea 6 Assue P is periodic. Then for a rando text T we can nd all occurrences of P in T in O( n ) average tie using constant additional eory. The worst-case tie of the algorith is linear. Assue p is the period of P, where p jp j=2. We can partition the positions in

T into disjoint consecutive large windows; each window consists of =2 consecutive positions of T (the last one can be saller). The rst large window is [1::=2]. n The algorith akes iterations. We process each large window as follows. Assue that the current window is [i + 1::i + =2]. =2 Phase 1. nd the rightost isatch in T according to the period p in the segent [i+1::i+]. If a isatch is found then switch to the next window [i+=2+1::i+] and execute Phase 1 again, otherwise Phase 2. search naively for an occurrence of P starting in the current window The probability that we do not have a isatch in Phase 1 is exponentially sall, so the expected cost of the second phase is very sall even if we search for the occurence naively. The expected tie to nd a isatch in the rst phase is O(1). There are O(n=) iterations, so the total cost is as required. This copletes the proof. 2 The algorith for the nonperiodic case when repetition is placed on borders is handled in the sae way but with windows of size O(r). Lea 4 and Lea 6 iply the following result. Theore 7 n For a rando text T we can nd all occurrences of P in T in O( ) rep size(p average ) tie (which is O( n )) using constant additional eory. The worst-case tie of log the algorith is linear. 4 Two-diensional pattern-atching In this section we show that also for the 2d-pattern atching proble the eciency of a search depends on the repetition size. Assue the pattern P and the text T are and n n sybol arrays, respectively. Denote N = n 2 ; M = 2. We say that the pattern occurs in T at position (i; j) i P [x; y] = T [i + x?1; j + y? 1] for all integers 1 x; y. A 2-diensional pattern P has a period [a; b] if P [i; j] = P [i + a; j + b], for all 1 i? a and 1 j? b.

If pattern P has a period [a; b] such that axfa; bg 2 Denote by 1rep size(p ) the axiu repetition size of a row of P. then it is called periodic. Theore 8 Assue P and T are two-diensional texts. For a rando two-diensional text T there is an algorith that nds all the occurrences of P in T tie O( which is O( N log M periodic row then the algorith perfors only O( N ) coparisons. N 1rep size(p ), )), average tie using constant additional eory. If P contains a Siilarly as in 1-diensional case we consider periodic and nonperiodic case separately. The algorith is alost the sae as for one diension. We can construct a 2-diensional version of the algorith Nonperiodic Pattern Searching. In the case where all rows of the pattern are nonperiodic, the algorith takes the rst row of the pattern and looks for it scanning each row of T partitioned into windows of size 1rep size(p ). For each window at least one position involves a test for an occurrence of the whole pattern. Instead of Naive Check(i 0 ), a version for 2 diensions 2d-Naive Check(i 0 ; j 0 ) is used. According to lea 1 we have altogether N=1rep size(p ) windows, and in each of the the average nuber of coparisons is constant. Hence the total nuber of coparisons is O(N=1rep size(p )), which is O( N ) since 1rep size(p ) = (log M). log M In the case where pattern P has at least one periodic row, the algorith chooses one such row and then proceeds in a siilar way as in 1-diensional case. Each row of T is partitioned into large windows. There are O( N ) such windows, and in each of the the algorith akes a constant nuber of coparisons on the average. Hence the total nuber of coparisons is O( N ). This copletes the proof. 2 In the case of a periodic pattern P the text search can be done faster. Theore 9 If the pattern P is periodic the search for it in T can be done in tie O( N M ). Since the pattern P is periodic it has two repeated subrectangles of size at least (see g. 1, and the shaded areas naed A), which denes a set of pairs of 2 2 equal sybols of size (M). We consider right botto quadrants D and E of these rectangles. The 2-diensional sapling is using this set as follows. Assue that there

> /2 pattern P subsquare D text T A > /2 0 0 1 A 1 P 00 11 > /4 short period 0000 1111 0000 1111 0000 1111 0000 1111 isatch C > /4 the window x subsquare D y subsquare E large repeated squares subsquare E Figure 1: Sapling in 2-diensions, if there is isatch between position x and y then there is no occurrence of P starting in the indicated window. is a pair of dierent sybols (x; y) in the text T whose positions dier exactly by a vector that is a short period in P. Let sybol x belong to square D and let y belong to E. Then there is no any occurrence of pattern P in the window B. Using the latter observation the text T is divided into windows of size at least 4 4 = (M) (corresponding to rst quadrant of A). The search in every window starts fro the test of equality of sybols in pairs between windows E and D. Since the text is rando the algorith akes only a constant nuber of tests on the average in every window, and this nally gives the O( N ) desired bound. 2 M We can dene 2-diensional repetition size of 2d-pattern P (2drep size(p), in short) as the largest repeated subsquare area of P. Siilarly to 1-diensional case we can prove that. Theore 10 For a rando two-diensional text T there is an algorith that nds all the occurrences of P in T in O( N 2drep size(p ) ) average tie using constant additional eory. 5 Suary The ain result of the paper is a constant space algorith that perfors O(n= log()) coparisons on the average for one-diensional as well as for two-diensional texts.

In the case of periodic patterns the average behavior of the algorith is even better, reaching the asyptotic bound of O( n ). Our paper initiates a discussion about pattern atching algoriths using sall space and that are fast on the average. In this paper we have done soe steps towards the goal but we think that the ost interesting proble is still open: what is the exact average coplexity of constant-space string atching? Or respectively: what is the space bound needed by any algorith aking O( n log()) coparisons on the average. References [1] R. Baeza-Yates and M. Regnier, Fast Algoriths for two-diensional and Multiple Pattern Matching, In Proc. of 2nd Scandinavian Workshop on Algorith Theory, SWAT'90, LNCS 447, pp. 332-347. [2] D. Breslauer, Saving Coparisons in the Crocheore{Perrin String Matching Algorith. In Proc. of 1st European Syp. on Algoriths, p. 61{72, 1993. [3] M. Crocheore, A. Czuaj, L. Gasieniec, S. Jaroinek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string atching algoriths, Algorithica (1994) 12, pp.247{267. [4] M. Crocheore and D. Perrin, Two-way string-atching. J. Assoc. Coput. Mach., 38(3), p. 651{675, 1991. [5] M. Crocheore and W. Rytter, Periodic Prexes in Texts. In Proc. of Sequences'91 Workshop Sequences II: Methods in Counication, Security and Coputer Science, p. 153{165, Springer{Verlag, 1993. [6] M. Crocheore and W. Rytter, Text algoriths. Oxford University Press [7] Z. Galil, On iproving the worst case running tie of the Boyer-Moore string searching algorith. CACM 22, (1979) 505-508 [8] Z. Galil and J. Seiferas, Tie-space-optial string atching. J. Coput. Syste Sci., 26, p. 280{294, 1983. [9] L. Gasieniec, W. Plandowski and W. Rytter, The zooing ethod: a recursive approach to tie-space ecient string-atching. Theoret. Coput. Sci. 1996

[10] L. Gasieniec, W. Plandowski and W. Rytter, Sequential sapling: a new approach to constant space pattern-atching. CPM 1995 [11] D.E. Knuth, J.H. Morris and V.R. Pratt, Fast pattern atching in strings. SIAM J. Coput., 6, p. 322{350, 1977. [12] A.C. Yao, The Coplexity of Pattern Matching for a Rando String, SIAM Journal on Coputing, 8(3), pp. 368{387, August 1979.