Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Similar documents
List Scheduling and LPT Oliver Braun (09/05/2017)

Optimal Parallel Sux Tree Construction. Ramesh Hariharan y. April 1, Abstract

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

Block designs and statistics

Homework 3 Solutions CSE 101 Summer 2017

Fixed-to-Variable Length Distribution Matching

A note on the multiplication of sparse matrices

arxiv: v2 [cs.ds] 14 Jan 2016

A Simple Regression Problem

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40

Finite fields. and we ve used it in various examples and homework problems. In these notes I will introduce more finite fields

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Expected Behavior of Bisection Based Methods for Counting and. Computing the Roots of a Function D.J. KAVVADIAS, F.S. MAKRI, M.N.

Convex Programming for Scheduling Unrelated Parallel Machines

arxiv: v1 [cs.ds] 3 Feb 2014

Non-Parametric Non-Line-of-Sight Identification 1

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

When Short Runs Beat Long Runs

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Feature Extraction Techniques

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

SPECTRUM sensing is a core concept of cognitive radio

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Computational and Statistical Learning Theory

CS Lecture 13. More Maximum Likelihood

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Multicollision Attacks on Some Generalized Sequential Hash Functions

Sharp Time Data Tradeoffs for Linear Inverse Problems

Distance Optimal Target Assignment in Robotic Networks under Communication and Sensing Constraints

Bootstrapping Dependent Data

Left-to-right maxima in words and multiset permutations

Bipartite subgraphs and the smallest eigenvalue

Estimating Entropy and Entropy Norm on Data Streams

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors

Understanding Machine Learning Solution Manual

Fairness via priority scheduling

Midterm 1 Sample Solution

Characterization of the Line Complexity of Cellular Automata Generated by Polynomial Transition Rules. Bertrand Stone

time time δ jobs jobs

Polygonal Designs: Existence and Construction

1 Proof of learning bounds

Computational and Statistical Learning Theory

. The univariate situation. It is well-known for a long tie that denoinators of Pade approxiants can be considered as orthogonal polynoials with respe

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

Randomized Recovery for Boolean Compressed Sensing

Kinematics and dynamics, a computational approach

N-Point. DFTs of Two Length-N Real Sequences

Necessity of low effective dimension

Computable Shell Decomposition Bounds

Genetic Quantum Algorithm and its Application to Combinatorial Optimization Problem

On the Maximum Number of Codewords of X-Codes of Constant Weight Three

3.8 Three Types of Convergence

Lower Bounds for Quantized Matrix Completion

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

4 = (0.02) 3 13, = 0.25 because = 25. Simi-

Page 1 Lab 1 Elementary Matrix and Linear Algebra Spring 2011

A Generalized Permanent Estimator and its Application in Computing Multi- Homogeneous Bézout Number

Ch 12: Variations on Backpropagation

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS

Testing Properties of Collections of Distributions

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

lecture 36: Linear Multistep Mehods: Zero Stability

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Curious Bounds for Floor Function Sums

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

USEFUL HINTS FOR SOLVING PHYSICS OLYMPIAD PROBLEMS. By: Ian Blokland, Augustana Campus, University of Alberta

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

26 Impulse and Momentum

Introduction to Discrete Optimization

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Statistical properties of contact maps

A Note on the Applied Use of MDL Approximations

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

An Algorithm for Quantization of Discrete Probability Distributions

Faster and Simpler Algorithms for Multicommodity Flow and other. Fractional Packing Problems. Abstract

Reed-Muller Codes. m r inductive definition. Later, we shall explain how to construct Reed-Muller codes using the Kronecker product.

Lecture 21. Interior Point Methods Setup and Algorithm

Fast Structural Similarity Search of Noncoding RNAs Based on Matched Filtering of Stem Patterns

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

Convolutional Codes. Lecture Notes 8: Trellis Codes. Example: K=3,M=2, rate 1/2 code. Figure 95: Convolutional Encoder

Maximum Entropy Interval Aggregations

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 22 Oct 1998

Design of Spatially Coupled LDPC Codes over GF(q) for Windowed Decoding

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Probability Distributions

Combining Classifiers

COS 424: Interacting with Data. Written Exercises

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Transcription:

Constant-Space String-Matching in Sublinear Average Tie (Extended Abstract) Maxie Crocheore Universite de Marne-la-Vallee Leszek Gasieniec y Max-Planck Institut fur Inforatik Wojciech Rytter z Warsaw University and University of Liverpool Abstract Given two strings: pattern P of length and text T of length n. The stringatching proble is to nd all occurrences of the pattern P in the text T. We present a siple string-atching algoriths which works in average o(n) tie with constant additional space for one-diensional texts and two-diensional arrays. This is the rst attept to the sall-space string-atching proble in which sublinear tie algoriths are delivered. More precisely we show that all occurrences of one- or two-diensional patterns can be found in O( n r ) average tie with constant eory, where r is the repetition size (size of the longest repeated subword) of P. Institut Gaspard Monge, Universite de Marne-la-Vallee, France (ac@univ-lv.fr). ymax-planck Institut fur Inforatik, I Stadtwald, D{66123 Saarbrucken, Gerany (leszek@pi-sb.pg.de). zinstitute of Inforatics, Warsaw University, Poland and Departent of Coputer Science, University of Liverpool, U.K. Supported by the grant KBN 8T11C208 (rytter@iuw.edu.pl).

1 Introduction The string-atching proble is dened as follows. Assue we are given two strings: pattern P of length and text T of length n. The pattern occurs at position i in text T i P = T [i::i +? 1]. We consider algoriths that deterine all occurrences of the pattern P in the text T. The coplexity of the string atching algorith is easured by the nuber of sybol coparisons of pattern and text sybols. The algoriths solving string-atching proble in linear tie and constant space are perhaps the ost interesting ones aong all designed for the entire proble. The rst algorith which uses a constant aount of additional eory was proposed by Galil and Seiferas in [8]. Later Crocheore and Perrin in [4] have presented an algorith that achieves a saller (at ost 2n) nuber of coparisons while preserving the sall aount of eory. Then, another iproveent ( 3 ) on the nuber of coparisons 2 was presented by Breslauer in [2]. In the eantie, alternative algoriths were introduced by Gasieniec, Plandowski and Rytter in [9] (2 + ") and [10] (1 + "). Besides there are known algoriths which ake a sublinear nuber of coparisons on the average. The rst such ethod was proposed in [11] for strings. An attept to 2d-diensional pattern atching fast on the average is due to Baeza-Yates and Regnier in [1]. However all known sublinear average tie algoriths use a linear-size additional eory to keep a table of shifts as in the Boyer-Moore algorith, (see e.g. [11], [7]), or for the representation of a directed subword graph or equivalent data structures (see e.g. [3] and [6]). The latter algoriths have the best possible O( n log ) average tie coplexity due to lower bound of Yao [12]. One can try to nd a trade-o between sall space and good average tie applying techniques fro [3] to the subwords of the pattern P. This ight lead to an algorith which uses O(s) space (size of the preprocessed subwords) and has O( n log s ) average s tie. Until now there was no algorith both perforing an average sublinear nuber of coparisons and using only constant eory space. In this paper we present the novel idea of such an algorith for one-diensional strings as well as for two-diensional arrays. The idea of the algoriths is based on the use of subword repetitions. For the siplicity of the presentation we assue that all strings considered in the paper are built over a binary alphabet = fa, bg. We say that the word w 2 has a period q (0 < q jwj) if w[i] = w[i + q] for

all positions 1 i jwj? q. The shortest period of w is called the period of w. If it satises q jwj=2, then the word w is called periodic; otherwise, w is called nonperiodic. 2 Nonperiodic one-diensional patterns In this section we assue that the pattern P is nonperiodic. Let us denote by rep size(p ) the size of the length of a largest subword of P. Exaple 1. The repeated subword in an exaple text given below is indicated here in bold. rep size(ababbaababaaababbaababba) = 9. The nuber of logarithic-size subwords of a text is large enough to guarantee that at least one of the repeats. This iplies easily the following fact. Lea 1 For each pattern P of size rep size(p ) = (log ). Denote r = rep size(p ), and let w be a longest repeated subword. Assue P [p? r::p? 1] = P [q? r::q? 1]; p q? r and P [p] 6= P [q]: In Exaple 1 we have (w; r; p; q) = (babbaabab; 9; 11; 23). The positions p; q are isatches w.r.t. the repetition of the word w. In general if there are no isatch positions based on repetition w to the right of two copies of w then we try to nd the to the left reversing the string-atching process. In case no isatch is found neither to the right nor to the left it eans that the repetition occurs at the borders of the pattern. This case is handled siilarly to the periodic case discussed in the next section. We say that a position i in T is a isatch position i T [i + p? 1] 6= T [i + q? 1]. We call a window any interval of positions [i::i+r?1] on the T, for 1 i n?r+1. Assue w.l.o.g. that we already know the 4-tuple (w; r; p; q).

Denote by Leftost Misatch(W ) the procedure that nds the rst (fro the left) isatch position in a given window W. If there is no such a isatch position then a special value nil is returned. Lea 2 (1). If Leftost Misatch(W ) = nil, no position of P in T is in W, (2). Otherwise, no position of P in T is in W? fleftost Misatch(W )g. The isatch is used as a constant-size deterinistic saple. 2 Denote by Naive Check(i) the procedure that tests a possible occurrence of P starting at a given position i in T and that tests the equality of corresponding sybols fro left to right. In the worst case, coparisons are done, but for rando binary texts T the average tie is really sall. We assue that sybols of the text are uniforly distributed. Lea 3 On rando texts each of the procedures Naive Check and Leftost Misatch akes on the average less than 2 coparisons. The su i 2 i is bounded by 2. 2 Lea 4 Assue that pattern P is nonperiodic. Then, for a rando text T, we can nd all the occurrences of P in T in O( n rep size(p ) n ), which is O( ), average tie using constant log additional eory. The worst-case running tie of the algorith is O(n). There are O(n=r) iterations in the algorith Nonperiodic Pattern Searching below. Each iteration uses at ost 4 coparisons on the average both for execution of Naive Check and Leftost Misatch, due to Lea 3. The coparisons done during dierent iterations can be dependent on each other, but the independence is not needed according to the fact that the average value of a su of rando variables is the su of their average value. Therefore the algorith akes altogether at ost O(n=r) coparisons on the average.

ALGORITHM Nonperiodic Pattern Searching; f nonperiodic pattern g; i:= 1; r:= rep size(p ); while i n? do begin W := [i::i + r? 1]; i 0 := Leftost Misatch(W ) if i 0 6= nil then end if Naive Check(i 0 ) then report atch at i 0 ; i:= i + r; Siilarly to the algorith presented in [10] we can guarantee the linear worst-case tie of the algorith Nonperiodic Pattern Searching since the shifts are based on a longest repeated subword of the pattern. This copletes the proof. 2 3 Periodic one-diensional patterns Assue now that P is periodic, so obviously its repetition size is large. Lea 5 If P is periodic then rep size(p ) 2. In this situation we cannot use the approach based on 4-tuples (w; r; p; q). Thus we derive a slightly dierent algorith, which is even ore ecient than the one used in nonperiodic case. Lea 6 Assue P is periodic. Then for a rando text T we can nd all occurrences of P in T in O( n ) average tie using constant additional eory. The worst-case tie of the algorith is linear. Assue p is the period of P, where p jp j=2. We can partition the positions in

T into disjoint consecutive large windows; each window consists of =2 consecutive positions of T (the last one can be saller). The rst large window is [1::=2]. n The algorith akes iterations. We process each large window as follows. Assue that the current window is [i + 1::i + =2]. =2 Phase 1. nd the rightost isatch in T according to the period p in the segent [i+1::i+]. If a isatch is found then switch to the next window [i+=2+1::i+] and execute Phase 1 again, otherwise Phase 2. search naively for an occurrence of P starting in the current window The probability that we do not have a isatch in Phase 1 is exponentially sall, so the expected cost of the second phase is very sall even if we search for the occurence naively. The expected tie to nd a isatch in the rst phase is O(1). There are O(n=) iterations, so the total cost is as required. This copletes the proof. 2 The algorith for the nonperiodic case when repetition is placed on borders is handled in the sae way but with windows of size O(r). Lea 4 and Lea 6 iply the following result. Theore 7 n For a rando text T we can nd all occurrences of P in T in O( ) rep size(p average ) tie (which is O( n )) using constant additional eory. The worst-case tie of log the algorith is linear. 4 Two-diensional pattern-atching In this section we show that also for the 2d-pattern atching proble the eciency of a search depends on the repetition size. Assue the pattern P and the text T are and n n sybol arrays, respectively. Denote N = n 2 ; M = 2. We say that the pattern occurs in T at position (i; j) i P [x; y] = T [i + x?1; j + y? 1] for all integers 1 x; y. A 2-diensional pattern P has a period [a; b] if P [i; j] = P [i + a; j + b], for all 1 i? a and 1 j? b.

If pattern P has a period [a; b] such that axfa; bg 2 Denote by 1rep size(p ) the axiu repetition size of a row of P. then it is called periodic. Theore 8 Assue P and T are two-diensional texts. For a rando two-diensional text T there is an algorith that nds all the occurrences of P in T tie O( which is O( N log M periodic row then the algorith perfors only O( N ) coparisons. N 1rep size(p ), )), average tie using constant additional eory. If P contains a Siilarly as in 1-diensional case we consider periodic and nonperiodic case separately. The algorith is alost the sae as for one diension. We can construct a 2-diensional version of the algorith Nonperiodic Pattern Searching. In the case where all rows of the pattern are nonperiodic, the algorith takes the rst row of the pattern and looks for it scanning each row of T partitioned into windows of size 1rep size(p ). For each window at least one position involves a test for an occurrence of the whole pattern. Instead of Naive Check(i 0 ), a version for 2 diensions 2d-Naive Check(i 0 ; j 0 ) is used. According to lea 1 we have altogether N=1rep size(p ) windows, and in each of the the average nuber of coparisons is constant. Hence the total nuber of coparisons is O(N=1rep size(p )), which is O( N ) since 1rep size(p ) = (log M). log M In the case where pattern P has at least one periodic row, the algorith chooses one such row and then proceeds in a siilar way as in 1-diensional case. Each row of T is partitioned into large windows. There are O( N ) such windows, and in each of the the algorith akes a constant nuber of coparisons on the average. Hence the total nuber of coparisons is O( N ). This copletes the proof. 2 In the case of a periodic pattern P the text search can be done faster. Theore 9 If the pattern P is periodic the search for it in T can be done in tie O( N M ). Since the pattern P is periodic it has two repeated subrectangles of size at least (see g. 1, and the shaded areas naed A), which denes a set of pairs of 2 2 equal sybols of size (M). We consider right botto quadrants D and E of these rectangles. The 2-diensional sapling is using this set as follows. Assue that there

> /2 pattern P subsquare D text T A > /2 0 0 1 A 1 P 00 11 > /4 short period 0000 1111 0000 1111 0000 1111 0000 1111 isatch C > /4 the window x subsquare D y subsquare E large repeated squares subsquare E Figure 1: Sapling in 2-diensions, if there is isatch between position x and y then there is no occurrence of P starting in the indicated window. is a pair of dierent sybols (x; y) in the text T whose positions dier exactly by a vector that is a short period in P. Let sybol x belong to square D and let y belong to E. Then there is no any occurrence of pattern P in the window B. Using the latter observation the text T is divided into windows of size at least 4 4 = (M) (corresponding to rst quadrant of A). The search in every window starts fro the test of equality of sybols in pairs between windows E and D. Since the text is rando the algorith akes only a constant nuber of tests on the average in every window, and this nally gives the O( N ) desired bound. 2 M We can dene 2-diensional repetition size of 2d-pattern P (2drep size(p), in short) as the largest repeated subsquare area of P. Siilarly to 1-diensional case we can prove that. Theore 10 For a rando two-diensional text T there is an algorith that nds all the occurrences of P in T in O( N 2drep size(p ) ) average tie using constant additional eory. 5 Suary The ain result of the paper is a constant space algorith that perfors O(n= log()) coparisons on the average for one-diensional as well as for two-diensional texts.

In the case of periodic patterns the average behavior of the algorith is even better, reaching the asyptotic bound of O( n ). Our paper initiates a discussion about pattern atching algoriths using sall space and that are fast on the average. In this paper we have done soe steps towards the goal but we think that the ost interesting proble is still open: what is the exact average coplexity of constant-space string atching? Or respectively: what is the space bound needed by any algorith aking O( n log()) coparisons on the average. References [1] R. Baeza-Yates and M. Regnier, Fast Algoriths for two-diensional and Multiple Pattern Matching, In Proc. of 2nd Scandinavian Workshop on Algorith Theory, SWAT'90, LNCS 447, pp. 332-347. [2] D. Breslauer, Saving Coparisons in the Crocheore{Perrin String Matching Algorith. In Proc. of 1st European Syp. on Algoriths, p. 61{72, 1993. [3] M. Crocheore, A. Czuaj, L. Gasieniec, S. Jaroinek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string atching algoriths, Algorithica (1994) 12, pp.247{267. [4] M. Crocheore and D. Perrin, Two-way string-atching. J. Assoc. Coput. Mach., 38(3), p. 651{675, 1991. [5] M. Crocheore and W. Rytter, Periodic Prexes in Texts. In Proc. of Sequences'91 Workshop Sequences II: Methods in Counication, Security and Coputer Science, p. 153{165, Springer{Verlag, 1993. [6] M. Crocheore and W. Rytter, Text algoriths. Oxford University Press [7] Z. Galil, On iproving the worst case running tie of the Boyer-Moore string searching algorith. CACM 22, (1979) 505-508 [8] Z. Galil and J. Seiferas, Tie-space-optial string atching. J. Coput. Syste Sci., 26, p. 280{294, 1983. [9] L. Gasieniec, W. Plandowski and W. Rytter, The zooing ethod: a recursive approach to tie-space ecient string-atching. Theoret. Coput. Sci. 1996

[10] L. Gasieniec, W. Plandowski and W. Rytter, Sequential sapling: a new approach to constant space pattern-atching. CPM 1995 [11] D.E. Knuth, J.H. Morris and V.R. Pratt, Fast pattern atching in strings. SIAM J. Coput., 6, p. 322{350, 1977. [12] A.C. Yao, The Coplexity of Pattern Matching for a Rando String, SIAM Journal on Coputing, 8(3), pp. 368{387, August 1979.