arxiv: v2 [cs.ds] 14 Jan 2016

Size: px

Start display at page:

Download "arxiv: v2 [cs.ds] 14 Jan 2016"

Gwendoline Wood
5 years ago
Views:

1 On the Average-case Coplexity of Pattern Matching with Wildcards Carl Barton Blizard Institute, Bart s and the London School of Medicine and Dentistry, Queen Mary University of London, UK c.barton@qul.ac.uk arxiv: v [cs.ds] 4 Jan 06 Abstract Pattern atching with wildcards is the proble of finding all factors of a text t of length n that atch a pattern x of length, where wildcards (characters that atch everything) ay be present. In this paper we present a nuber of fast average-case algoriths for pattern atching where wildcards are restricted to either the pattern or the text, however, the results are easily adapted to the case where wildcards are allowed in both. We analyse the average-case coplexity of these algoriths and show the first non-trivial tie bounds. These are the first results on the average-case coplexity of pattern atching with wildcards which, as a by product, provide with first provable separation in tie coplexity between exact pattern atching and pattern atching with wildcards in the word RAM odel. Introduction Pattern atching with wildcards is a string atching proble where the alphabet consists of standard letters and a wildcard φ which atches every character in the alphabet. Given a text t of length n and a pattern x of length < n the proble consists of finding all factors of the text that atch the pattern. Pattern atching with wildcards naturally arises in a nuber of probles in bioinforatics, they are priarily used to odel single nucleotide polyorphiss (SNP), being used in the identification of diseases and in genoe wide association studies as arkers for gene apping. An early result in pattern atching with wildcards was the Fast Fourier Transfor (FFT) based algorith of Fischer and Paterson [9] with runtie O(n log log σ) for an alphabet of size σ. A subsequent study by Pinter [5] outlined why the intransitivity of the atch relation prevents algoriths such as KMP [9] being used or easily odified when wildcards are present. After Fischer and Patterson, uch work focused on iproving the algoriths by reoving the dependency on the alphabet size, with randoized O(n log n) and O(n log ) solutions being proposed in [6] and [7] respectively. Deterinistic O(n log ) solutions were proposed soon after, first by Cole and Hariharan [7] and then siplified by Clifford and Clifford [5]. The only known lower bound for the worst case of this proble is due to Muthukrishnan and Pale [3] who showed that in the worst case the proble is equivalent to coputing boolean convolutions. The indexing version of the proble was studied first in [5] where an index supporting queries in O( + α) tie was presented where wildcards are allowed in the pattern. A short coing of this approach is that in the worst-case α ay be Θ(hn) where h is the nuber of groups of wildcards. In [6] Cole et al. presented an index which given a text with k wildcards and an integer d, allows searching for any pattern with at ost d wildcards. For a pattern containing g d wildcards, the atching takes O( + g log k n log log n + occ) tie ; when wildcards are restricted to either the pattern or the text the query tie becoes We use the notation of [0] as it is ore understandable than [6]. licensed under Creative Coons License CC-BY Leibniz International Proceedings in Inforatics Schloss Dagstuhl Leibniz-Zentru für Inforatik, Dagstuhl Publishing, Gerany

2 On the Average-case Coplexity of Pattern Matching with Wildcards Proble Query Tie Wildcards in t O( log n + γ + occ) Wildcards in x O( + hβ) Wildcards in t and x O( log n + hβ + γ + occ) Opt wildcards in t O( log n + log n + γ log n + occ) Opt wildcards in x O( + ghβ) Opt wildcards in t and x O( log n + log n + ghβ + γ log n + occ) Table Properties of the indexes presented in [0] O( + g log log n + occ) and O( + log k n log log n + occ) respectively. A drawback of the index of Cole et al. is that once the index has been built it can only be used to search for patterns with at ost d wildcards. Very recently a nuber of indexes were presented by Bille et al. [4], where a linear space index with query tie O( + σ g log log n + occ) and a linear query tie index with space coplexity O(σ d n log d log n). The results of Billie et al. can be iproved using recent work on weighted ancestor queries [3]. In the area of linear size indexes for strings with wildcards, La et al. [0] have presented indexes for a nuber of probles shown in Table, where for a pattern x consisting of strings x 0, x,.., x j interleaved by j wildcards, t i is analogously defined for a text t of length n, occ(u, v) denotes the nuber of occurrences of u in v, γ = Σ l+ j= occ(t j, x), h is the nuber groups of consecutive wildcards in the text, g is the total nuber of wildcards in the text, β = in i h+ {occ(x i, t)}. Succinct indexes have been presented in [6] with a space usage of (( + o())n log σ + O(n) + O(h log n) + O(j log j)) bits for a text containing h groups of j wildcards in total. The authors of [4] proposed a copressed index where wildcards can only occur in the text with space usage nh y + o(n log σ) + O(h log n) bits, where H y is the y-th-order epirical entropy (y = o(log σ n)) of the text. The first non-trivial o(n log n) bit indexes were recently presented in []. In this paper we focus on the average-case coplexity of the proble. To the best of our knowledge no results are known. Preliinaries An alphabet Σ is a finite non-epty set, of size σ, whose eleents are called letters. A string on an alphabet Σ is a finite, possibly epty, sequence of eleents of Σ. The zero-letter sequence is called the epty string, and is denoted by ε. The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by x. All strings of length q are denoted by Σ q and refer to any x Σ q as a q-gra. We denote by x[i], for all 0 i < x, the letter at index i of x. Each index i, for all 0 i < x, is a position in x when x ε. It follows that the i-th letter of x is the letter at position i in x, and that x = x[0.. x ]. A string x is a factor of a string y if there exist two strings u and v, such that y = uxv. Consider the strings x, y, u, and v, such that y = uxv. If u = ε, then x is a prefix of y. If v = ε, then x is a suffix of y. A wildcard character is a special character that does not belong to alphabet Σ, and atches with itself as well as with any character of Σ; it is denoted by φ. Two characters a

3 Carl Barton 3 and b of alphabet Σ {φ} are said to correspond (denoted by a φ b) if they are equal or at least one of the is the wildcard letter. Let x be a non-epty string and y be a string. We say that there exists an occurrence of x in y or, ore siply, that x occurs in y when x is a factor of y. Every occurrence of x can be characterised by a position in y. Thus we say that x occurs at the starting position i in y when y[i.. i + x ] = x. It is soeties ore suitable to consider the ending position i + x. To be consistent with previous works on pattern atching with wildcards we consider the word RAM odel of coputation with word size Ω(log n) In this paper the probles we consider are the following. Proble (Wildcards in the Text). Given a text t of length n drawn fro Σ {φ}, and a pattern x of length drawn fro Σ. Find all i such that t[i.. i + ] φ x[0.. ]. Proble (Wildcards in the Pattern). Given a text t of length n drawn fro Σ, and a pattern x of length drawn fro Σ {φ}. Find all i such that t[i.. i+ ] φ x[0.. ]. 3 Background on Average-case Analysis Here we give soe background inforation on the literature concerning average-case coplexity. The ter average-case has been used to refer to various different assuptions when discussing online pattern atching in strings, here we discuss these different assuptions and justify the odel we use. In the literature soeties it is assued that the pattern is randoly drawn fro the alphabet, whilst others consider that the pattern is arbitrary. An arbitrary pattern is assued in [8, 7, 9], however, in later work such as [0,,,,, 4, ] the assuption of a rando pattern is ade. Clearly the notion of average-case coplexity with arbitrary patterns is stronger than with rando patterns and that is what we consider here. Soething else we consider is the fixed or adaptive nature of the sequence of probing positions applied to the text. We refer to the sequence of probing positions as the inspection schee. In [7] it was shown that having a predeterined inspection schee negatively effects the runtie of exact pattern atching algoriths for < n < when copared with an adaptive inspection schee. In this paper we explore the effect this property can have on the average-case perforance of algoriths for pattern atching with wildcards and refer to algoriths which exaine the character inside a window in a fixed order as fixed algoriths. For the purpose of showing lower bounds in this paper we consider a siplified odel of coputation for fixed algoriths and define the as follows. Definition (Fixed Algorith). Consider t partitioned into non-overlapping blocks of size. Let (i, i,.., i ) be an arbitrary but fixed perutation of (0,,, 3,.., ). An algorith is fixed with respect to (i, i,.., i ) if, for every block, the sequence of probing positions is (i, i,.., i ). We wish to find all factors of t corresponding to x that are entirely within a block. We consider the following siplified definition of non-fixed algoriths. Definition (Non-fixed Algorith). Consider t partitioned into non-overlapping blocks of size. An algorith is non-fixed if characters of t can be inspected in any order. We wish to find all factors of t corresponding to x that are entirely within a block. The order characters are read in the text.

4 4 On the Average-case Coplexity of Pattern Matching with Wildcards We show a tight upper bound on the best perforance for fixed algoriths and for non-fixed algoriths we show upper and lower bounds which atch within a logarithic factor for all but the ost extree values of g. The upper bounds for fixed and non-fixed algoriths are quite different. It is iportant to point out that when discussing the average-case coplexity of online string atching probles it is custoary to ake a distinction between tie taking to preprocess the pattern and the search tie. average-case optial custoarily refers to achieving the optial search tie, not necessarily considering the preprocessing tie required to achieve it. Our Contribution: In this article, we present fast average-case algoriths for pattern atching with wildcards restricted to the text or pattern. We present algoriths with low preprocessing and average-case search tie O( n log σ ) and O( n(g+log σ ) g ) for Probles and which are optial in the non-fixed and fixed odels respectively. We show an algorith with optial average-case search tie for Proble in the non-fixed odel that has coplexity between O( n log σ g ) and O( n log σ log ). 4 Algoriths In the following section we present a nuber of filtering algoriths for pattern atching in the presence of wildcards. The presented algoriths consist of two distinct schees: the filtering schee, which deterines if the currently considered text window potentially has a valid occurrence; in case the window ay contain a valid occurrence, we are required to check the window for valid occurrences of the pattern; this is done through the verification schee. Intuitively, the algoriths consider a sliding window of length of the text, and reads q-gras backwards fro the centre of the window until it is likely to have found a difference in every possible occurrence, allowing us to skip the window. That is, we wish to ake the probability of a verification being triggered sufficiently unlikely whilst also ensuring we can shift the window a reasonable aount. The verification schee used in this algorith consists of naively checking all possible alignents of the pattern against the text. Clearly each check takes no ore than O() tie and there are possible start positions for a window of size so O( ) in total. For the rest of the article we refer to this verification schee as VER(i, x) where i is the start of the window and x is the pattern. For the rest of the article we assue that the text t is of length n and is rando and uniforly drawn fro Σ or Σ {φ}, depending on the proble, and that Σ is a finite but not necessarily constant alphabet. 4. Filtering Schee The filtering schee of the presented algorith requires the preprocessing and indexing of the pattern x. We first present the preprocessing required and then present the searching technique itself. Preprocessing. We outline two indexing schees that will be used in this paper. The first is a basic q-gra index adapted to allow wildcards in the query, which siply considers if a atching q-gra exists in the indexed text. To build this index we generate all strings of

5 Carl Barton 5 length q fro the alphabet Σ {φ}, and check each against the set of q length factors fro the pattern. Since there are (σ + ) q different generated strings, at ost factors of length q in the pattern and each check requires O(q) tie, the total tie spent is O(q(σ + ) q ). We can store the results of this processing in a binary array where each string is converted to a nuerical representation for efficient lookup. A dictionary built in this way has a search tie of O(q). We refer to this schee as q-basic. The second indexing schee is based on the siple q-gra index but allows wildcards in the text and for ultiple queries allows us to enforce a weak ordering on the q-gras. The intuition for this schee is that if we index a text with g wildcards, it ay be the case that all of these wildcards occur sequentially. In this situation if q < g then the basic q-gra index will report a atch for every single query. This akes it ipossible for us to use the basic q- gra index as part of a filtering schee unless q > g which gives an exponential dependency on g. To get around this issue we build a q-gra index by considering every length q factor of the pattern and generating all atching strings over Σ. Let M[0.. σ q, 0.. ] be a -d array and q be the q-gra starting at x[i], then for every s Σ q such that s = q we store M[ρ(s)][i] 0 where ρ(s) is the nuerical representation of s.. This procedure is done for every q-gra in the pattern. Clearly for each q-gra considered there are at ost σ q atching factors and no ore than q-gras will be considered giving O(σ q ) tie. Siilarly M is of size at ost σ q and each list is not longer than giving O(σ q ) space. We preprocess each array in M for next saller value queries. Next saller value queries take O() tie after linear preprocessing [3]. We refer to this schee as q-weak. Consider a sequence s, s,.., s l of l consecutive, non-overlapping q-gras read fro a text t, we ake use of the observation that any occurrence of the pattern containing those q-gras ust contain those q-gras in the sae order that they appear in the text. Whilst it is possible to check if all possible start locations have the q-gras occurring consecutively this search ay be costly. We weaken the restriction on q-gras and require only that for soe s i occurring at position j in the pattern that there exists an occurrence of s i+ at soe h > j + q. If there are ultiple occurrences then we take the iniu starting position greater than j + q. Given a q-weak index, l query q-gras such that s i occurs at position j, this corresponds to checking if s i+ exists and then perforing a next saller value query on M[ρ(s i+ )][j + q]. We denote this query by Weak-Order-Query(s, s,.., s l ) and with this we get the following. Lea 3. Let x be a pattern of length with g wildcards, t be a text of length n with no wildcards and let s, s,.., s l be a sequence of l consecutive, non-overlapping q-gras read fro t. Given a q-weak-fast index of x we can perfor Weak-Order-Query(s, s,.., s l ) in O(lq + l). 4. Wildcards in Text Only We begin by considering the proble where wildcards ay appear only in the text and refer to the algorith of this section as Algorith Wt. First we will establish the probability of a rando string of length x log σ+ containing wildcards atching soe factor of the pattern. This choice of length will becoe clear later. Lea 4. Let u be a rando string of length x log σ+ over Σ {φ} and let v be a string of length over Σ. The probability of u atching a factor of v is at ost x. Applying index q-basic we can see that for q = 3 log σ+ this is a polynoial preprocessing schee and the axiu exponent occurs when σ =. For σ = we have that

6 6 On the Average-case Coplexity of Pattern Matching with Wildcards 3 3 log and so the total preprocessing is O( 9.3 log ). Asyptotically on the alphabet size this becoes O( 3 log ). Consider a sliding window of length placed over the text and that for each window we will check if the suffix of length 3 log σ+ corresponds with any factor of the pattern. If the suffix corresponds with a factor of the pattern there ay be a atch and we are required to verify this window. To verify the window we run algorith VER(i, x) in tie O( ). For each window on the text we have a total possible running tie of O( + log σ+ ). The probability of the suffix corresponding with a factor of the pattern is / by Lea 4, so the expected tie spent at each window is O(log σ+ ) or O(log σ ). If the window is verified then it is possible to shift by characters, otherwise the window is not verified n and can we shift by q characters. This eans there there are at ost q windows, each has an expected runtie of O(log σ ), giving us an overall average-case running tie of O(n log σ /). Fro the above discussion, and noting that Yao s lower bound [7] for exact string atching is O(n log σ /), we get the following result: Theore 5. Algorith Wt has optial average-case search tie O(n log σ /) with 3 log σ+ 3 log σ+ O((σ + ) logσ ) preprocessing tie and O((σ + ) ) space. 4.3 Wildcards in the Pattern In this section we consider the case where wildcards ay appear in both the pattern and text. For a pattern x we denote the nuber of wildcards in the pattern by g. We refer to the algorith of this section as Algorith Wp. We could use dictionary q-basic but for this proble we would have that q > g. This eans that the space coplexity of the dictionary would becoe Ω((σ + ) g ) which is far too high. Instead we will use dictionary q-weak-fast. We now establish the probability of rando string atching a pattern with g wildcards. Lea 6. Let u be a rando string of size g + x log σ over Σ and let v be a string of size > g + log σ over Σ {φ} with g wildcards. The probability of u atching a factor v is no ore than x. For this algorith we create a sliding window on the text of length and build a q-weak-fast index over x with q = 3 log σ. For each window on the text we denote the start position by i and read g q + q-gras starting at t[i + g q ] and perfor a Weak-Order-Query on the. If the Weak-Order-Query returns true then we perfor VER(i, x) to verify all possible start positions in the current window and shift by positions. If the Weak-Order-Query returns false we shift by g q. The iniu shift the algorith akes is g 3 log σ, n so there will be at ost g 3 log σ windows on the text and at each window we ay do O( + g + log σ ) work in the worst case. The probability that we will need to verify a window is / by Lea 6, this gives us an expected tie of O(g + log σ ) per window. n For the algorith to achieve the claied runtie it ust be the case that g 3 log σ = O( n ). To satisfy this it follows that g + 3 log σ c for soe 0 < c <. This places the following condition on the wildcard ratio of our algorith: g < c 3 log σ We also have an additional restriction, we ust be able to guarantee that we are reading enough new rando characters after each shift that Lea 6 still holds. This places the

7 Carl Barton 7 additional restriction that ust be at least twice the length of the shortest shift. So it ust hold that > (g + 3 log σ ) to ensure that in all cases we read enough new q-gras fro a window for the above analysis to hold. After rearrangeent this places the following restriction on our algorith: g < 3 log σ Clearly the second condition places the strictest condition on the wildcard ratio. Fro the above discussion we achieve the following result: Theore 7. Algorith Wp has average-case search tie O( n(g+log σ ) ) with O( 4 ) preprocessing tie and O( 4 ) space, for g < 3 log σ. The wildcard ratio we specify is quite perissive as 3 log σ = o() so for any g/ < / it is possible to pick a sufficiently large value of such that the algorith can run in the claied running tie. In the following theore we show that for any fixed algorith it is ipossible to do any better than this. We show that for any integer g, there exists a lower bound of Ω( ng ) character inspections for any fixed algoriths. Theore 8. Algorith Wp has average-case search tie O( n(g+log σ ) ) and no fixed algorith can do better. 5 A General Lower Bound In the previous section we have considered the average-case coplexity of each algorith. Fixed algoriths consider the characters in each window of the text in a fixed order, an approach that is ubiquitous in string algoriths. We have shown that for any fixed inspection schee there exists patterns that perfor badly on average. In this section we consider nonfixed algoriths and derive an average-case lower bound for any algorith solving proble with an arbitrary pattern and any value for g. This bound gives a provable separation in coplexity between exact string atching and wildcard atching. Clearly a lower bound for this proble also lower bounds the proble where wildcards appear in both the pattern and the text. When considering this proble the order characters are inspected becoes very iportant. Soe inspection schees do not have uch effect on the expected nuber of candidates not yet ruled out. It is the case that inspections that, should wildcards not be present, would lead to a large reduction in the expected nuber of candidates ay give very little inforation when wildcards exist. Consider that we have a pattern of length with g < wildcard characters and a text of length n. Partition the text into non-overlapping blocks of size, and only consider that we ust report all atches within each block. This is an optiistic assuption as this excludes those atches which overlap two blocks. In the following section we will deterine a lower bound for the nuber of character inspections required for one block. The lower bound for the general proble can then be derived. For each block we call 0,.., possible starting positions candidates and when we inspect a character fro the block we call this a block access. The candidates affected by a block access are intersected by it. Given a block access to position z in block b we can only rule out candidate c if there exists soe y such that x[y] b[z] and c + y = z. For all non-wildcard positions intersected by a given block access i j, there is a probability of at ost /σ that the candidate will not be ruled out. For those candidates where this block

8 8 On the Average-case Coplexity of Pattern Matching with Wildcards access intersects a wildcard there is probability it will not be ruled out. We outline a few optiistic assuptions used in our analysis. Any block access intersects all candidates. Intersections are distributed uniforly across all candidates. The effect of this is that g candidates have a chance of being ruled out at every block access. After k block accesses in this odel we have ade ( g)k intersections and we assue that these are distributed uniforly across all candidates. This is optiistic as this eans that we always intersect candidates with the highest probability of occurrence, soething which ay not actually be possible. This eans that we ay only overestiate the nuber of candidates ruled out and our result is a lower bound. Clearly the expected nuber of candidate not ruled out is then be evaluated as follows: i=0 σ ( g)k = σ ( g)k For each candidate we need to either rule it out as a possible starting position or declare a atch. So the optial is to deterine when we would expect to have ruled out every candidate position or read characters. We iniise the following so that we expect to have at ost one candidate left or until we have read all positions. Rearranging this we get the following: σ ( g)k log σ log σ g ( g)k Now we know that the lower bound for each block is Ω( log σ g ) and there are n/ blocks. The result below then follows: Theore 9. The average-case lower bound for wildcard atching with wildcards only in the pattern is Ω( n log σ g ). Clearly this results suggests that pattern atching with wildcards is on average asyptotically harder than exact string atching when g = f(), where f() is soe function asyptotically saller than O(). Now we discuss a strategy for inspecting positions of a block and show that a greedy schee perfors an optial nuber of character coparisons. By greedy we ean that at each step the block access which would ost greatly reduce the expected nuber of reaining positions is chosen. For a candidate i let i be the nuber of ties it has been intersected at a non-wildcard position. Now for each position in a block 0 i < let B i be the set of candidates that the block access i intersects. The effect on the expected nuber of candidates not ruled out by inspecting soe position l is given by the following. Let U = {0,,.., }\B l and i denote the nuber of ties candidate i has been intersected before the block access to l. k i=0 σ i j U σ j h B l σ h+

9 Carl Barton 9 The greedy inspection axiises the last two ters of the above suation at each step. The intuition behind this schee is based on our proof of the lower bound presented above. In the proof we saw that evenly distributing accesses across all candidates iniises the expectation. The greedy schee attepts to siulate this behaviour by picking the access which ost iniises the probability at each step. We now show that this is in fact optial. Theore 0. The greedy inspection schee perfors an optial nuber of character coparisons. We have shown that a greedy inspection schee iniises the nuber of character inspections required for any g, but this does not give us any explicit upper bound. We now give an upper bound on the average-case search tie significantly better than that in algorith Wp for all but the ost extree values of g. 6 An Iproved Upper Bound As before consider that we have a pattern of length which contains g < wildcard characters and a text of length n. We partition the text into non-overlapping blocks of size and only consider that we have to report all atches within each block. Although this is an optiistic assuption, we will show how to convert this to a true upper bound later in this section. In the rest of this section we will deterine an upper bound for the nuber of character inspections required for one block. First we recall the definition of the ε-dense set cover proble along with a known approxiation results. Proble 3. Given ε > 0, a set of eleents U = {,, 3,.., r}, a faily S of l sets such that every eleent of U occurs in εr sets and the union of S equals U. Find the iniu nuber of sets fro S such that their union is U. The general set cover proble is known to be NP-hard, however the following approxiation result is known for the ε-dense set cover proble. Lea ([8]). There exists an approxiation algorith for the ε-dense set cover proble with output size log r where U = r. ε We define the following faily of sets for j < S j = { i {,,.., } : i < j and c > 0 s.t. x[c] φ and c + i = j} Our goal in this section is to deterine how any inspections it takes to ake the probability of any candidate not being rules out saller enough to ake verification tie negligible. One way to do this is to ensure that all candidates have been intersected at least 3 log σ, causing the probability one is not ruled out to be at ost /. The difficulty in deriving the upper bound is we ust explicitly consider how each candidate is affected by a block access. To do this we use the above result on ε-dense set covers. By the definition of S = S,.., S it is clear that each eleent occurs in g sets, so the proble can be seen as an g -dense set cover proble. We will apply the ε-dense set cover proble, reove the approxiate set cover and apply the ε-dense set cover proble again until we achieve 3 log σ intersections per candidate. We now define a faily of functions whose arguent is the pattern x which relate to repeated applications of the ε-dense set cover proble applied on S. F 0 (x) = log + g

10 0 On the Average-case Coplexity of Pattern Matching with Wildcards F i (x) = log + g + log F 0 (x) + log F (x) log F i (x) We are now in a position to define a density function which gives the density of the reaining sets after i applications of the ε-dense set cover proble. D 0 (x) = g D i (x) = g log F 0 (x) log F (x).. log F i (x) The nuber of characters inspected to guarantee i > 0 intersections per candidate is then given by G i (x) = log F 0 (x) + log F (x) log F i (x) We now show a nuber of properties of the above functions. We wish to show that for soe integer i 0, a constant 0 < ɛ < and g + log F 0 (x) + log F (x) log F i (x) < ɛ. log + ɛ li F i (x) log We proceed by induction on i. Let i = 0 and g < ɛ then clearly log + ɛ li + g log Let i = k, 0 < ɛ < and assue for all natural nubers less than k that the induction hypothesis holds. We focus on the lower bound as the upper bound is obvious. Let i = k + and g + log F 0 (x) + log F (x) log < ɛ. Applying the induction hypothesis this is at least F k (x) g + log + log log < ɛ as and therefore. log +ɛ log +ɛ log +ɛ }{{} k+ ties It then follows that li + g + log F 0 (x) + log F (x) log F k (x) ɛ log + ɛ li F k+ (x) log With this result we get that for sufficiently large and g+ log F 0 (x) + log F (x) +..+ log F i (x) < ɛ. G i (x) = O(i log ) Recall that we wish to intersect each candidate at least 3 log σ so for i = 3 log σ + we get the nuber of inspected characters is G 3 log σ + (x) = O(log σ log ) for sufficiently large. This upper bound is for a block of the text of size and assues all character inspections can be considered as independent tests. To convert this to a general bound for a text of size n consider the text partitioned into blocks of size that overlap by characters. After inspecting O(log σ log ) characters we expect that we can discard the entire block and can shift by characters. It ay be the case that we have already read soe of the characters we need to inspect, but as long as the window is twice the size of O(log σ log )

11 Carl Barton the analysis still holds. Coputing the greedy inspection schee can be trivially done in O( 3 ) and an index can be built by building a finite state achine eorising the candidates still valid for any possible character read in the order specified, this takes no ore than σ O(log σ log ) which is polynoial for σ = O() and as σ O(log σ log ) = O(log σ (log )) is quasi-polynoial for larger alphabets, therefore. Theore. For any constant 0 < ɛ < there exists an algorith for Proble with average-case search tie O( n log σ log ) when g + log F 0 (x) + log F (x) log F 3 log (x) < ɛ. Cobining all the above we know that there exists an algorith that has the following bounds on the nuber of character required to be inspected. Let OPT denote the nuber of characters inspected by the optial schee with the specified conditions on g. 7 Conclusions and Future Work ( n logσ ) ( n logσ log O OPT O ) g In this paper we have investigated the average-case coplexity of two wildcard atching probles. We have considered two odels of coputation and have shown a big jup in the coplexity of the two. The question of a tight bound on the search coplexity of pattern atching with wildcards reains open. We have shown an algorith which has optial average-case search tie and upper/lower bounds on the tie coplexity that are only different by a logarithic factor. This lower bound shows the first provable separation in tie coplexity between wildcard atching and exact atching in the word RAM odel. Although [3] showed that in the worst case the proble is equivalent to coputing boolean convolutions and any believe this cannot be solved in linear tie, no superlinear lower bound on this is known. We conjecture that the optial nuber of character inspections is in fact the lower bound stated in Theore 9. References Ricardo Baeza-Yates and Gonzalo Navarro. New and faster filters for ultiple approxiate string atching. In Rando Structures and Algoriths (RSA), page 00, 998. Ricardo. A. Baeza-Yates. Algoriths for string searching. SIGIR Foru, 3(3-4):34 58, April Jéréy Barbay, Johannes Fischer, and Gonzalo Navarro. Lr-trees: Copressed indices, adaptive sorting, and copressed perutations. Theo. Cop. Sci., 459:6 4, 0. 4 Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. In Fedor V. Foin and Petteri Kaski, editors, Algorith Theory, volue 7357 of Lecture Notes in Coputer Science, pages Peter Clifford and Raphaël Clifford. Siple deterinistic wildcard atching. Inf. Process. Lett., 0():53 54, January Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary atching and indexing with errors and don t cares. In Proc. of the 36th annual ACM STOC, pages ACM, Richard Cole and Raesh Hariharan. Verifying candidate atches in sparse and wildcard atching. In Proc. of the 34th annual ACM STOC, pages 59 60, Maxie Crocheore, Artur Czuaj, Leszek Gasieniec, Stefan Jaroinek, Thierry Lecroq, Wojciech Plandowski, and Wojciech Rytter. Speeding up two string-atching algoriths. Algorithica, (4-5):47 67, 994.

12 On the Average-case Coplexity of Pattern Matching with Wildcards 9 Michael J. Fischer and Michael S. Paterson. String-atching and other products. Technical report, Cabridge, MA, USA, Kio Fredriksson and Gonzalo Navarro. Average-optial ultiple approxiate string atching. In Ricardo Baeza-Yates, Edgar Chávez, and Maxie Crocheore, editors, Cobinatorial Pattern Matching, volue 676 of LNCS, pages Kio Fredriksson and Gonzalo Navarro. Average-optial single and ultiple approxiate string atching. J. Exp. Algorithics, 9, Deceber 004. Kio Fredriksson and Gonzalo Navarro. Flexible usic retrieval in sublinear tie. In In Proc. 0th Prague Stringology Conference, pages 74 88, Paweł Gawrychowski, Moshe Lewenstein, and PatrickK. Nicholson. Weighted ancestors in suffix trees. In Andreas S. Schulz and Dorothea Wagner, editors, Algoriths - ESA 04, volue 8737 of LNCS, pages Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Shara V. Thankachan, and Jeffrey Scott Vitter. Copressed text indexing with wildcards. In Roberto Grossi, Fabrizio Sebastiani, and Fabrizio Silvestri, editors, SPIRE, volue 704 of LNCS, pages Costas S. Iliopoulos and M. Sohel Rahan. Pattern atching algoriths with don t cares. In 33rd International Conference on Current Trends in Theory and Practice of Coputer Science, SOFSEM (), pages 6 6, Piotr Indyk. Faster algoriths for string atching probles: Matching the convolution bound. In Proc. of the 39th Syposiu on Foundations of Coputer Science, pages 66 73, Ada Kalai. Efficient pattern-atching with don t cares. In Proc. of the 3th annual ACM-SIAM SODA, pages , Marek Karpinski and Alexander Zelikovsky. Approxiating dense cases of covering probles. Technical report, Donald E. Knuth, Jaes H. Morris Jr., and Vaughan R. Pratt. Fast pattern atching in strings. SIAM J. Coput., 6():33 350, Tak-Wah La, Wing-Kin Sung, Siu-Lung Ta, and Siu-Ming Yiu. Space efficient indexes for string atching with dont cares. In Takeshi Tokuyaa, editor, Algoriths and Coputation, volue 4835 of LNCS, pages Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern atching. CoRR, abs/40.065, 04. Moritz G. Maaß. Average-case analysis of approxiate trie search. In SuleyanCenk Sahinalp, S. Muthukrishnan, and Ugur Dogrusoz, editors, Cobinatorial Pattern Matching, volue 309 of LNCS, pages S. Muthukrishnan and Krishna Pale. Non-standard stringology: Algoriths and coplexity. In Proceedings of the Twenty-sixth Annual ACM Syposiu on Theory of Coputing, STOC 94, pages , New York, NY, USA, 994. ACM. 4 Gonzalo Navarro and Mathieu Raffinot. Flexible Pattern Matching in Strings: Practical On-line Search Algoriths for Texts and Biological Sequences Ron Y. Pinter. Efficient string atching with don t-care patterns. In Alberto Apostolico and Zvi Galil, editors, Cobinatorial Algoriths on Words, volue of NATO ASI Series, pages Chris Thachuk. Succincter text indexing with wildcards. In Raffaele Giancarlo and Giovanni Manzini, editors, CPM, volue 666 of LNCS, pages Andrew Chi-Chih Yao. The coplexity of pattern atching for a rando string. SIAM J. Coput., 8(3): , 979.

13 Carl Barton 3 Appendix Proof of Lea 4 Proof. The probability of a character fro Σ atching a randoly picked character of Σ {φ} is σ+. The probability of r characters atching in this way is given by ( σ+ )r, by setting r = x log σ+ we get that ( σ+ )r =. There are at ost x log x σ+ + factors of this length in v and so the probability of occurrence is no ore than x. Proof of Lea 6 Proof. In the worst case all g wildcards of v appear in a single factor of size g + x log σ, this factor contains x log σ positions which are not wildcards and the probability that these atch the corresponding characters in u is no ore than. For any factor with less x than g wildcards the probability is. Pessiistically assue all factors of v of length x g + x log σ atch with probability, there are (g + x log x σ ) factors of this length so the probability is no ore than x. Proof of Theore Proof. Recall that the lower bound for exact string atching is Ω(n log σ /). We now show that any fixed algorith has a lower bound of Ω( ng ). Assue the text is partitioned into non-overlapping blocks of size and that we only want to find occurrences contained entirely within these blocks. Let π denote all the perutations of (0,,.., ) and assue that we exaine the characters of each block in the sae fixed but arbitrary order (i 0, i,.., i ) π. We can construct patterns, for any g <, such that we ust exaine at least g + characters before all start positions can be ruled out in the following way. If for 0 j < g all i j occur within a range of positions then we place the wildcards in positions i 0, i,.., i g of the pattern. Otherwise for 0 j < g and i j < place a wildcard at position i j. Any reaining wildcards ay be placed anywhere in the pattern; the reaining positions of the pattern are rando characters fro Σ. After inspecting characters i 0, i,.., i g of the block at least the first position can neither be ruled out nor declared as a atch. Cobining the lower bound of Yao and this we see that any fixed algorith has a lower bound of Ω(n(g + log σ )/) for this proble. Algorith Wp runs in average-case tie O(n(g + log σ )/), atching the lower bound; therefore the algorith is optial in the faily of fixed algoriths. Proof of Theore 5 Proof. The optial nuber of character coparisons is achieved by a schee iniising the nuber of block accesses needed to expect that less than candidate reains. An inspection schee L k is siply a k subset of the probing positions {0,,,.., }. For soe inspection schee L k we denote by E[L k ] the expected nuber of reaining candidates for inspection schee L k. We proceed by induction on the nuber of block access and clai that the greedy schee is an optial schee. Let G i = {ð, ð,.., ð i } be the first i block accesses ade by the greedy inspection schee. G i+ is then defined as G i ð i+ where ð i+ is the block access with the biggest effect on the expected nuber of candidates not ruled out. The base case is siple, we pick the block access which iniises the expected nuber of candidates not ruled out, by definition this is the iniu.

14 4 On the Average-case Coplexity of Pattern Matching with Wildcards Assue that for soe i it is true that for the greedy schee the nuber of expected candidates is the sallest possible after i accesses. Let K i = {ℸ, ℸ,.., ℸ i } be an arbitrary inspection schee with i block accesses. If G i = K i then E[G i+ ] E[K i+ ] and we are done, otherwise we have a few cases to consider. Consider G i+ and K i+, if ℸ i+ = ð i+ = l. Let B l be the set of candidates that the block access l intersects. If the following holds: Then so does the following: Additionally, if the following is true: Then so is the following: h B l σ ℸ h h B l σ ð h σ ℸ h+ σ ð h+ h B l h B l h B l σ ℸ h h B l σ ð h σ ℸ h+ σ ð h+ h B l h B l Therefore, should ℸ i+ = ð i+ then E[G i+ ] E[K i+ ]. Let C = {{0,,.. }\G i }\K i and ℸ i+ ð i+ if ð i+, ℸ i+ C then E[G i+ ] E[K i+ ], otherwise either ð i+ C or ℸ i+ C. If ℸ i+ ð i+ then either ℸ i+ G i or ð i+ K i. It is sufficient to show that there exists at least one possible choice for ð i+ that would cause E[G i+ ] E[K i+ ] in each case. Case : ℸ i+ G i Let ℸ u K i \G i, ℸ u ust exist as K i G i, and consider the inspection schee K i = {K i \ℸ u } ℸ i+. We clai that ð i+ = ℸ u ensures optiality. By the induction hypothesis E[K i ] E[G i]. Now let ð i+ = ℸ i+ = ℸ u, by the above analysis for when ð i+ = ℸ i+, E[G i+ ] E[K i+ ] and therefore E[G i+] E[K i+ ]. Case : ð i+ K i If ℸ i+ C, the greedy schee could also pick the block access ℸ i+ and reain optial so it ust be the case that E[G i+ ] E[K i+ ]. If ℸ i+ C then ℸ i+ G and by case the greedy schee is optial. Therefore E[G i+ ] E[K i+ ] and the stateent is proved.

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Constant-Space String-Matching in Sublinear Average Tie (Extended Abstract) Maxie Crocheore Universite de Marne-la-Vallee Leszek Gasieniec y Max-Planck Institut fur Inforatik Wojciech Rytter z Warsaw University